SQL for Data Science: Filtering, Sorting, and Calculating Data
In our previous article we discussed how to select and retrieve data from databases using SQL. Since many databases can often have millions of records, however, we need a way to filter and sort the data with SQL.
In this guide, we'll discuss several new clauses and operations for filtering data such as:
WHERE
BETWEEN
IN
OR
NOT
LIKE
ORDER BY
GROUP BY
This article is based on notes from Week 2 of this course on SQL for Data Science and is organized as follows:
- The Basic of Filtering with SQL
- Advanced Filtering with
IN
,OR
, andNOT
- Using Wildcards in SQL
- Sorting Data with
ORDER BY
- Math Operators
- Aggregate Functions
- Grouping Data with SQL
Stay up to date with AI
We're an independent group of machine learning engineers, quantitative analysts, and quantum computing enthusiasts. Subscribe to our newsletter and never miss our articles, latest news, etc.
The Basic of Filtering with SQL
In this section, we'll discuss how to use filtering in order to narrow down the data we want to retrieve from a database.
Filtering is also used when you only want to perform analysis on a subset of data or use specific data as part of the model.
In particular, we'll discuss how the use the WHERE
clause with common operators, the BETWEEN
clause, and explain the concept of a NULL
value.
The Benefits of Filtering
As mentioned, filtering is used when you need to be specific about the data you want to retrieve.
Filtering is also used to reduce the number of records you want to retrieve.
This can increase query performance and reduce the strain on the client application.
WHERE
Clause Operators
To do so, we used the WHERE
clause after the SELECT
and FROM
clause:
SELECT column name, column name
FROM table_name
WHERE column name operator value;
In terms of operator values in the WHERE
clause, we can use the following:
- = Equal operator
- <> Not equal
- > Greater than operator
- < Less than operator
- = Equal operator
- >= Greater than or equal
- <= Less than or equal
- BETWEEN Between an inclusive range
- IS NULL Is a null value
When we're filtering values, we can filter using a string or a single value. For example, we can select products above a certain price as follows:
SELECT ProductName, UnitPrice, SupplierID
FROM Products
WHERE UnitPrice >= 75;
Advanced Filtering with IN
, OR
, and NOT
In this section we'll look at several more advanced filtering techniques including IN
, OR
, and NOT
.
IN
Operator
The IN
operator specifies a range of conditions to filter by.
In order to specify the number of conditions, we enclose the IN
operator with parentheses () and a comma delimited list of values:
SELECT ProductId, UnitPrice, SupplierID
FROM Products
WHERE SupplierID IN (9, 10, 11);
OR
Operator
The OR
operator is another useful filtering technique.
It's important to note that a database management system will not evaluate the second condition in a WHERE
clause if the first condition is met.
This means you need to be specific about the order you place items in the query.
SELECT ProductId, UnitPrice, SupplierID
FROM Products
WHERE ProductName = 'Tofu' OR 'Konbu';
It's worth noting that IN
works the same as OR
, although IN
gives you more options in the number of items you can list.
IN
also executes faster than OR
and you don't need to worry about the order in which you place items.
Order of Operations
In writing a query you can use both AND
and OR
operators, although know that SQL processes AND
before OR
.
For this reason, it's best practice to not rely on the default order of operations and instead be specific about the order using ()
when using OR
and AND
together.
Not Operator
The NOT
operator is used to exclude certain options.
This is useful if you want to select everything from a database except certain items:
SELECT *
FROM Employees
WHERE NOT CITY='London' AND
NOT City='Seattle';
Using Wildcards in SQL
In this section, we'll discuss the concept of wildcards in SQL and how to the LIKE
operator with wildcards.
What are Wildcards?
Wildcards are a special character used to match certain parts of a value.
Wildcards search for a pattern from literal text, wildcard character, or a combination of the two.
You'll use LIKE
with wildcards, which is actually a predicate instead of an operator, although they're commonly referred to as operators.
It's important to note that LIKE
can only be used with strings and cannot be used for non-text datatypes.
As such, they are useful for data scientists as you're working with string variables.
Using % Wildcards
In order to use a wildcard you need to add a % sign before, after, or in the middle of a string. For example:
'%Pizza'
gets anything ending with the word pizza'Pizza%'
gets anything after the word pizza'%Pizza%'
gets anything before and after the word pizza'S%E'
gets anything that starts with "S" and ends with "E"
Sorting Data with ORDER BY
Up until now we've focused on filtering data, although there is no logical order to the data. In other words, it is still returned in the same order as it was entered in the database.
By sorting data with the ORDER BY
clause, we can return data based on numerical order or based on ascending or descending order according to the alphabet.
ORDER BY
allows you to sort data based on particular columns as follows:
SELECT something
FROM database
ORDER BY characteristic
A few rules for using ORDER BY
include:
- It takes the name of one or more columns
- Add a comma after each additional column name
- It can sort by a column not retrieved
- It must always be the last clause in a
SELECT
statement
You can also sort by column positions instead of the names by using ORDER BY 2,3
, which sorts by the second and third columns.
Finally, you can sort by directions using:
DESC
for descending orderASC
for ascending order
This only applies to the column names it directly precedes.
Math Operators
After we've selected, filtered, and sorted data, we can then apply mathematical operations to it.
In this section, we'll discuss several basic math calculations you can perform using data and the order of operations.
Below are several simple math operators in SQL:
+
for addition-
for subtraction*
for multiplication/
for division
For example, below is an example that multiplies the total units on order by the unit cost in order to calculate the total order cost:
SELECT ProductID
, UnitsOnOrder
, UnitPrice
, UnitsOnOrder * UnitPrice AS TotalOrderCost
FROM Products
The order of operations for math operators in SQL follows the normal order:
- Parentheses
- Exponents
- Multiplication
- Division
- Addition
- Subtraction
Aggregate Functions
Aggregate functions provide several ways to summarize data with SQL.
A few common use cases of aggregate functions include finding the highest and lowest values in a dataset, the total number of rows, the average value, and so on.
The aggregate functions we can use to analyze data are all quite self-explanatory, including:
AVG
COUNT
MIN
MAX
SUM
For example, below we're selecting the average unit price from all products:
SELECT AVG(UnitPrice) AS avg_price
FROM products
One thing to note about aggregate functions is that if the word DISTINCT
is not used, SQL will assume you want to retrieve ALL
the data. If there are duplicate records in the data, DISTINCT
allows you to remove them from the query.
Grouping Data with SQL
In the previous section on aggregate functions, we just looked at how to pull a single data field and aggregate over it.
Often, however, we need to be able to perform additional aggregations using the GROUP BY
and HAVING
clauses to better sort the data.
For example, if we want to know the number of customers by region we can do this as follows:
SELECT Region,
COUNT(CustomerID) AS total_customers
FROM Customers
GROUP BY Region;
It's important to note that GROUP BY
clauses can contain multiple columns. Also, every column in the SELECT
statement must be present in a GROUP BY
clause, except for aggregated calculations.
In terms of grouping data, we cannot use the WHERE
clause because it filters based on rows.
Instead, we can use the HAVING
clause when filtering for groups with an aggregate function. For example, we can group by customers with more than 2 orders as follows:
SELECT CustomerID
,COUNT (*) AS orders
FROM Orders
GROUP BY CustomerID
HAVING COUNT (*) >=2;
Summary: Filtering, Sorting, and Calculating Data with SQL
In this article, we discussed how to filter, sort, aggregate, calculate, and group data with SQL.
In summary, filtering allows you to narrow down your results, as well as to increase the query and application performance.
Filtering and sorting data also allows you to find specific values, blank values, and ranges of values. All of these functions help understanding your data better and with descriptive statistics.