SQL is an essential skill for data scientists as it is the language used to communicate with databases and allows you to collect, organize, and analyze large amounts of data.
In this article, we'll introduce key concepts of SQL for data science, including how to select and retrieve data, common SQL syntax, and more.
This article is based on notes from Week 1 of this course on SQL for Data Science and is organized as follows:
- What is SQL?
- Introduction to Data Models
- The Evolution of Data Models
- Relational vs. Transactional Database Models
- Retrieving Data with the
SELECT
Statement - Creating Tables
- Creating Temporary Tables
- Adding Comments to SQL
Stay up to date with AI
What is SQL?
In this section we'll define how SQL is used by data scientists to communicate with databases and how it differs from other programming languages.
We'll then discuss data relationship within databases, general syntax rules, and how to use the SELECT
statement.
Defining SQL
Structured Query Language (SQL) is a standard computer language for relational database management and data manipulation.
SQL is used to query, insert, update, and modify data in a database.
SQL is made up of statements that are descriptive and relatively easy to learn.
Unlike other programming languages, SQL is a non-procedural language, which means it can't be used to create complete applications.
Instead, it is used to communicate with databases and interact with data.
SQL is used for three main things:
- Reading and retrieving data from a databases
- Writing data to a database, for example adding data to a table
- Updating data and inserting new data
SQL is used by many professions aside from just data scientists, including data architects, backend developers, database administrators, and many more.
A database administrator (DBA) differs from a data scientist in that they're responsible for managing and providing permissions to the entire database, whereas a data scientist is typically an end use of the database.
Data Science and SQL
SQL is fundamental in data science as it allows you to retrieve the data necessary to start building models for analysis or predictions.
Data scientists may also use SQL to create their own tables or test environments.
Often, data scientists will need to combine multiple data sources together, which requires them to write slightly more complex queries to build models.
In short, the main use case of SQL for data science is data retrieval.
SQL and Databased Management Systems
Keep in mind that the syntax used will depend on what database management system (DBMS) you're using.
Each DMBS has its own "dialect", meaning that SQL syntax may may need to be tweaked or translated in order to work properly.
A few of the most common relational database management systems include:
- MySQL
- PostgreSQL
- SQLite
- Microsoft SQL Server
- IBM DB2 Oracle
- Apache Open Office Base
In this article, we'll be focused on the syntax for SQLite.
Introduction to Data Models
In this section we'll discuss data models and ER diagrams.
Thinking about Your Data
It's important to have an understanding of the structure of the data you're querying against in order to be effective at applying SQL to real-world problems.
To start, before writing any SQL it's important to define the problem you're trying to solve, including:
- The business process or subject matter the data is modeling after
- Understanding the business rules
- Understanding how the data is organized and structured in the table
Ultimately, answering these questions will speed up your work and typically produce more accurate results on the first try.
Databases and Tables
First, let's define the several terms:
- Database: A database is a container (usually file or set of files) used to store organized data; it's a set of related information.
- Tables: A table is a structured list of data or a specific type of data.
- Column: A column is a single field in a table - all tables are made up of one or more columns.
- Row: A row is a record in a table.
The Evolution of Data Models
In this section we'll briefly look at the history of data modeling, as well as define a relational database system and discuss the advent fo relational databases in SQL.
What is a Data Model?
Data modeling refers to organizing and structuring information into multiple, related tables.
A data model typically represents a business process or shows relationships between business processes.
A data model should always represent the real world problem its modeling as closely as possible.
Types of Data Models
There are different types of data models, two broad categories of which include:
- Models for prediction that are built by data scientists and machine learning engineers
- Data models as data tables represented and organized in a database
One type of data model that is important to be aware of is NoSQL, which came out in 2009 in order to address the emerging Big Data problem.
NoSQL has less semantics in its data models and is based on schema-less key-value data models. It is best suited for large and sparse data stores.
In short, NoSQL stands for Not Only SQL and is a mechanism for storing and retrieving unstructured data models by means other than tabular relations in relational databases.
Relational vs. Transactional Database Models
In this section we'll look at the difference between relational and transactional database models.
We'll also define entities, attributes, and relationships, including the difference between one-to-one, one-to-many, and many-to-many relationships.
Below are the high-level differences between a relational vs. transactional models:
- Relational Models: These allow for easy querying and data manipulation in logical and intuitive ways.
- Transactional Models: These can be thought of as operational databases, for example if you're in healthcare you could have a database to hold all the insurance claims.
In this article, we'll focus on relational models.
There are three main building blocks for relational models:
- Entities: These include people, places, things, or events and are distinguishable, unique, and distinct.
- Attributes: These are a characteristic of an entity.
- Relationships: These describe associations among entities and include one-to-one, one-to-many, and many-to-many.
Examples of one-to-many relationships are customer to invoices.
Examples of many-to-many are students to classes.
Examples of one-to-one relationships are a manager to a store.
ER Diagrams
In order to understand relationships between tables better, we can use an ER diagram.
An ER model is composed of entity types and specifies relationships that can exist between instances of those entity types.
ER diagrams allow you to visualize relationships between tables, and specfically to see the links between tables. Later, we'll also look at how we can use this diagram to join tables together.
Tables can be joined together using these two keys:
- Primary Keys: These are a column (or set of columns) whose values uniquely identify every row in a table.
- Foreign Keys: These are one or more columns that can be used together to identify a single row in another table.
There are three primary ER diagrams notations, including:
- Chen Notation
- Crow's Foot Notation
- UML Class Diagram Notation
Retrieving Data with the SELECT
Statement
As mentioned, one of the main uses of SQL for data scientists is retrieving data.
In this section, we'll discuss how to do so with the SELECT
statement,
We'll also discuss how to tell a database which table the data will come FROM
, how to SELECT
either all or certain columns from a table in a query, and how to limit the amount of data returned in a query.
With a SELECT
statement we need to specify two pieces fo information:
- The data we want to retrieve
- Where to select it from
For example, if we want to select the column prod_name
from a table called Products
, we would do so like this:
SELECT prod_name
FROM Products;
If we want to retrieve multiple columns, we would do so as follows:
SELECT prod_name, prod_id, prod_price
FROM Products;
If we have a table that has more columns than we want to write out, we can simply request all by using an asterisk *
instead of each column name:
SELECT *
FROM Products;
In summary, anytime we're retrieving data we need to SELECT
the columns and specify FROM
which table to retrieve it from.
If we have a large database and only want to see a sample of the data, we can use a LIMIT
statement to specify the number of records:
SELECT columns you want to see
FROM specific table
LIMIT number of records
Creating Tables
With SQL we can also create new tables and store data in them.
In this section, we'll discuss when this might be beneficial, how to create new tables in an existing database, write data to a new table, and define whether columns can accept NULL
values or not.
The ability to create new tables is useful for data scientists as you create new models and predictions.
Tables can then be used to create dashboards, visualize data, and to extract data from other sources.
In order to create a table we can use the CREATE TABLE
statement, for example below we're creating a table for shoes:
CREATE TABLE Shoes
(
ID char(10) PRIMARY KEY,
Brand char(10) NOT NULL,
Type char(250) NOT NULL,
);
Here we can see we specify whether the number of characters in the column, whether it accepts NULL
values, and the primary key.
Every column in a table is either NULL
or NOT NULL
.
An error will then be returned if one tries to submit no value to a column with NOT NULL
. Also, note that null values are not the same as empty strings and that primary keys cannot be null.
We can then insert new data into this table as follows:
INSERT INFO Shoes
VALUE (
'12356',
'Gucci',
'Slippers'
);
In this case, however, there is no guarantee the data is going into the right column. We can solve this by listing the columns that we want to insert into:
INSERT INTO Shoes
(
ID,
Brand,
Type char(250) NOT NULL,
)
VALUES
(
'12356',
'Gucci',
'Slippers'
);
With this method you know exactly which columns your data is going into.
Creating Temporary Tables
In addition to creating new tables in a database using SQL, we can also create a copy or pull a subset from another table. We can either create a whole table or we can create a temporary table with this.
In this section, we'll discuss how to create temporary tables and describe their limitations.
Temporary tables will be deleted when the current session is terminated.
The advantage of temporary tables is that they're faster than creating a real table and are useful for complex queries using subsets and joins.
Below is an example of creating a temporary tables for sandals from the original shoe table:
CREATE TEMPORARY TABLE Sandals AS
(
SELECT *
FROM shoes
WHERE shoe_type = 'sandals'
)
It's important to note that based on the type of relational database management system, you'll need to look up the specific syntax to create, update and insert tables.
Adding Comments to SQL
Since most queries are often many lines long, it's best practice to add comments in order to make it easy to understand for yourself and others, as well as for troubleshooting.
In case you haven't used comments in other programming languages, they help you remember what you're doing, why, and mute the expression of code.
There are two ways to add comments, either for single lines or a section of comments.
Below we are commenting out brand_id
:
SELECT shoe_ide
--, brand_id
,shoe_name
FROM shoes
Below we are commenting out brand_id
and shoe_name
:
SELECT shoe_ide
/*, brand_id
,shoe_name
*/
FROM shoes
Summary: Selecting and Retrieving Data with SQL
In this article, we introduced SQL, or Structured Query Language, which is the standard language used to communicate with relational database management systems.
We then differentiated between transactional and relational databases.
Next, we defined database basics including primary keys, foreign keys, and table relationships.
Finally, we discussed basic SQL syntax including how to write query statements using SELECT
and FROM
and how to write comments within code.