SQL for Data Science: Filtering, Sorting, and Calculating Data

In our previous article we discussed how to select and retrieve data from databases using SQL. Since many databases can often have millions of records, however, we need a way to filter and sort the data with SQL.

In this guide, we'll discuss several new clauses and operations for filtering data such as:

  • WHERE
  • BETWEEN
  • IN
  • OR
  • NOT
  • LIKE
  • ORDER BY
  • GROUP BY

This article is based on notes from Week 2 of this course on SQL for Data Science and is organized as follows:

  • The Basic of Filtering with SQL
  • Advanced Filtering with IN,  OR, and NOT
  • Using Wildcards in SQL
  • Sorting Data with ORDER BY
  • Math Operators
  • Aggregate Functions
  • Grouping Data with SQL

Stay up to date with AI

We're an independent group of machine learning engineers, quantitative analysts, and quantum computing enthusiasts. Subscribe to our newsletter and never miss our articles, latest news, etc.

Great! Check your inbox and click the link.
Sorry, something went wrong. Please try again.

The Basic of Filtering with SQL

In this section, we'll discuss how to use filtering in order to narrow down the data we want to retrieve from a database.

Filtering is also used when you only want to perform analysis on a subset of data or use specific data as part of the model.

In particular, we'll discuss how the use the WHERE clause with common operators, the BETWEEN clause, and explain the concept of a NULL value.

The Benefits of Filtering

As mentioned, filtering is used when you need to be specific about the data you want to retrieve.

Filtering is also used to reduce the number of records you want to retrieve.

This can increase query performance and reduce the strain on the client application.

WHERE Clause Operators

To do so, we used the WHERE clause after the SELECT and FROM clause:

SELECT column name, column name
FROM table_name
WHERE column name operator value;

In terms of operator values in the WHERE clause, we can use the following:

  • = Equal operator
  • <> Not equal
  • > Greater than operator
  • < Less than operator
  • = Equal operator
  • >= Greater than or equal
  • <= Less than or equal
  • BETWEEN Between an inclusive range
  • IS NULL Is a null value

When we're filtering values, we can filter using a string or a single value. For example, we can select products above a certain price as follows:

SELECT ProductName, UnitPrice, SupplierID
FROM Products
WHERE UnitPrice >= 75;

Advanced Filtering with IN,  OR, and NOT

In this section we'll look at several more advanced filtering techniques including IN, OR, and NOT.

IN Operator

The IN operator specifies a range of conditions to filter by.

In order to specify the number of conditions, we enclose the IN operator with parentheses () and a comma delimited list of values:

SELECT ProductId, UnitPrice, SupplierID
FROM Products
WHERE SupplierID IN (9, 10, 11);

OR Operator

The OR operator is another useful filtering technique.

It's important to note that a database management system will not evaluate the second condition in a WHERE clause if the first condition is met.

This means you need to be specific about the order you place items in the query.

SELECT ProductId, UnitPrice, SupplierID
FROM Products
WHERE ProductName = 'Tofu' OR 'Konbu';

It's worth noting that IN works the same as OR, although IN gives you more options in the number of items you can list.

IN also executes faster than OR and you don't need to worry about the order in which you place items.

Order of Operations

In writing a query you can use both AND and OR operators, although know that SQL processes AND before OR.

For this reason, it's best practice to not rely on the default order of operations and instead be specific about the order using () when using OR and AND together.

Not Operator

The NOT operator is used to exclude certain options.

This is useful if you want to select everything from a database except certain items:

SELECT *
FROM Employees
WHERE NOT CITY='London' AND
NOT City='Seattle';

Using Wildcards in SQL

In this section, we'll discuss the concept of wildcards in SQL and how to the LIKE operator with wildcards.

What are Wildcards?

Wildcards are a special character used to match certain parts of a value.

Wildcards search for a pattern from literal text, wildcard character, or a combination of the two.

You'll use LIKE with wildcards, which is actually a predicate instead of an operator, although they're commonly referred to as operators.

It's important to note that LIKE can only be used with strings and cannot be used for non-text datatypes.

As such, they are useful for data scientists as you're working with string variables.

Using % Wildcards

In order to use a wildcard you need to add a % sign before, after, or in the middle of a string. For example:

  • '%Pizza' gets anything ending with the word pizza
  • 'Pizza%' gets anything after the word pizza
  • '%Pizza%' gets anything before and after the word pizza
  • 'S%E' gets anything that starts with "S" and ends with "E"

Sorting Data with ORDER BY

Up until now we've focused on filtering data, although there is no logical order to the data. In other words, it is still returned in the same order as it was entered in the database.

By sorting data with the ORDER BY clause, we can return data based on numerical order or based on ascending or descending order according to the alphabet.

ORDER BY allows you to sort data based on particular columns as follows:

SELECT something
FROM database
ORDER BY characteristic

A few rules for using ORDER BY include:

  • It takes the name of one or more columns
  • Add a comma after each additional column name
  • It can sort by a column not retrieved
  • It must always be the last clause in a SELECT statement

You can also sort by column positions instead of the names by using ORDER BY 2,3, which sorts by the second and third columns.

Finally, you can sort by directions using:

  • DESC for descending order
  • ASC for ascending order

This only applies to the column names it directly precedes.

Math Operators

After we've selected, filtered, and sorted data, we can then apply mathematical operations to it.

In this section, we'll discuss several basic math calculations you can perform using data and the order of operations.

Below are several simple math operators in SQL:

  • + for addition
  • - for subtraction
  • * for multiplication
  • / for division

For example, below is an example that multiplies the total units on order by the unit cost in order to calculate the total order cost:

SELECT ProductID
, UnitsOnOrder
, UnitPrice
, UnitsOnOrder * UnitPrice AS TotalOrderCost
FROM Products

The order of operations for math operators in SQL follows the normal order:

  • Parentheses
  • Exponents
  • Multiplication
  • Division
  • Addition
  • Subtraction

Aggregate Functions

Aggregate functions provide several ways to summarize data with SQL.

A few common use cases of aggregate functions include finding the highest and lowest values in a dataset, the total number of rows, the average value, and so on.

The aggregate functions we can use to analyze data are all quite self-explanatory, including:

  • AVG
  • COUNT
  • MIN
  • MAX
  • SUM

For example, below we're selecting the average unit price from all products:

SELECT AVG(UnitPrice) AS avg_price
FROM products

One thing to note about aggregate functions is that if the word DISTINCT is not used, SQL will assume you want to retrieve ALL the data. If there are duplicate records in the data, DISTINCT allows you to remove them from the query.

Grouping Data with SQL

In the previous section on aggregate functions, we just looked at how to pull a single data field and aggregate over it.

Often, however, we need to be able to perform additional aggregations using the GROUP BY and HAVING clauses to better sort the data.

For example, if we want to know the number of customers by region we can do this as follows:

SELECT Region,
COUNT(CustomerID) AS total_customers
FROM Customers
GROUP BY Region;

It's important to note that GROUP BY clauses can contain multiple columns. Also, every column in the SELECT statement must be present in a GROUP BY clause, except for aggregated calculations.

In terms of grouping data, we cannot use the WHERE clause because it filters based on rows.

Instead, we can use the HAVING clause when filtering for groups with an aggregate function. For example, we can group by customers with more than 2 orders as follows:

SELECT CustomerID
,COUNT (*) AS orders
FROM Orders
GROUP BY CustomerID
HAVING COUNT (*) >=2;

Summary: Filtering, Sorting, and Calculating Data with SQL

In this article, we discussed how to filter, sort, aggregate, calculate, and group data with SQL.

In summary, filtering allows you to narrow down your results, as well as to increase the query and application performance.

Filtering and sorting data also allows you to find specific values, blank values, and ranges of values. All of these functions help understanding your data better and with descriptive statistics.

Resources