In this article, we'll look at the key differences and use cases of data lakes vs. data warehouses. We'll also look at how a data lake and data warehouse can be built with Google Cloud Platform.
This article is based on notes from this course on Modernizing Data Lakes and Data Warehouses with GCP and is organized as follows:
- What is a Data Lake?
- Data Lakes vs. Data Warehouses
- Data Lakes: Storage and ETL Options with GCP
- How to Build a Data Lake with Cloud Storage
- Securing Cloud Storage
- Storing Different Data Types
- Using Cloud SQL as a Relational Data Lake
- What is a Data Warehouse?
- Characteristics of the Modern Data Warehouse
- How to Build a Data Warehouse with BigQuery
- Loading Data into a Data Warehouse
Stay up to date with AI
What is a Data Lake?
A data lake is a scalable and secure platform that allows enterprises to ingest, store, and analyze any type or volume of data.
Data lakes are used to power data analytics, data science, machine learning workflows, and batch and streaming pipelines.
Data lakes accept all types of data and are can be portable, on-premise, or stored in the cloud.
At a high-level, the components of a data engineering ecosystem include:
- Data sources
- Data sinks, which include a central data lake repository and a data warehouse
- Data pipelines (batch and streaming)
- High-level orchestration workflows
In short, a data lake is a consolidated location for raw data that is durable and highly available.
Several options for building a data lake with GCP include Google Cloud Storage and BigQuery.
After data has been stored in a data lake, it will likely end up in other places such as a transformation pipeline that cleans and prepares it for storage in a data warehouse.
To enable an end-to-end data pipeline, a Data Engineer will need to work with a suite of big data products for the following tasks:
- Storage: Examples include Cloud Storage and Cloud SQL.
- Ingestion: Examples include Compute Engine and Kubernetes Engine.
- Analytics: Examples include BigQuery and Cloud Dataproc
- Machine Learning: Examples include TensorFlow and Cloud TPU
- Serving: Examples include DialogFlow and App Engine
Data Lakes vs. Data Warehouses
Before we build a data lake, let's first review the main differences from a data warehouse.
A data lake captures information from every aspect of your business operation. The data is stored in its raw format, usually as object blobs or files.
Data lakes support all types of data and tend to be application-specific, meaning the data format is dependant on the application that writes it in.
In contrast, a data warehouse has the following characteristics:
- Data is loaded into the warehouse only when its use case is defined
- Data is processed, organized, and transformed from its raw format
- Since the data is properly formatted it provides faster insights
- Data warehouses tend to have a consistent schema shared across applications
Data Lakes: Storage and ETL Options with GCP
There are several options for building a data lake with GCP, including:
- Cloud Storage
- Cloud SQL
- Cloud Spanner
- Cloud Firestore
- Cloud Bigtable
Choosing between these options largely depends on where your data is now, how big the data is, where it has to go, and how much transformation is required.
The method you use to load data into the cloud also depends on how much transformation is needed. We'll discuss these in more detail below, although the three main ways to load data include:
- Extract and Load (EL) processes
- Extract, Load, and Transform (ELT)
- Extract, Transform, and Load (ETL)
How to Build a Data Lake with Cloud Storage
Creating a data lake with Cloud Storage is a popular choice as data persists beyond the lifetime of virtual machines or clusters.
Cloud Storage is an object store, which means it stores and retrieves binary objects without regard to what data is contained in the objects.
Data is also also stored forever, meaning it's durable, and its also available instantly, or strongly consistent. You can also share data globally that is encrypted or it can be completely private if you want. Data is also served with moderate latency and high throughput.
The two main entities in Cloud Storage are:
- Buckets: these are containers that hold objects
- Objects: objects are pieces of data held within buckets and are also stored with meta data about the object.
When you create a bucket, you need to make several decisions including:
- Location of the Bucket: This is set initially and can never be changed.
- High Availability: If you need high availability, for example during a natural disaster, you can choose multi-region or dual-region which will store replicas in physically separate data centers.
- Frequency of Data Access: You need to choose how often you need to access or change the data. There are discounts for data lakes that you access less frequently.
Securing Cloud Storage
Securing a data lake is an important part of the data engineers job, so let's review A few key security features to control access to objects.
There are two separate methods for controlling access to objects in Cloud Storage:
- Cloud IAM is the policy. It is set at the bucket level and uniformly applies to all objects in the bucket.
- Cloud ACL refers access control lists. These can be applied at the bucket level or individual objects.
All data is encrypted at rest and in transit using Google managed encryption keys managed, or GMEK.
Data locking is different than encrypting. Encryption prevents somebody from understanding the data, whereas locking prevents them from modifying the data.
Storing Different Data Types
As mentioned, Cloud Storage is just one of the options to store data in a data lake on GCP.
Cloud Storage should not be used for transactional workloads as the latency is not low enough to support high-frequency writes. Instead, transactional workloads should use Cloud SQL if you're using SQL or Firestone if you're using NoSQL.
You also don't want to use Cloud Storage for analytical workloads of unstructured data as you'll spend a significant amount of compute parsing data. Instead, it's better to use BigQuery or Cloud Bigtable.
Here's a summary transactional vs. analytical workload requirements:
- Transactional workloads require fast inserts and updates of records. The queries are relatively simple and usually only affect a few records.
- Analytical workloads tend to read the entire dataset and is used for planning or decision support.
In short, transactional systems have a write-focused and are generally 80% writes and 20% reads. Analytical workloads are read-focused and are typically 20% writes and 80% reads.
Storing Relational Data in the Cloud
If you're working with transactional data, Cloud SQL is the default GCP option if you're using MySQL, Postgres, or SQL.
For analytical workloads, the default option on GCP is BigQuery, although if you require ultra-low latency Cloud BigTable is often a better choice.
If you require a globally distributed database, another option is Cloud Spanner.
Using Cloud SQL as a Relational Data Lake
Let's now review using Cloud SQL for OLTP, or Online Transaction Processing workloads.
Cloud SQL is a fully managed database service that simplifies the setup and administration of relational MySQL and PostreSQL databases in the cloud. This means there is a compute engine instance that already has a database like MySQL installed.
The a few of the advantage of a fully managed service include:
- Google-level security
- Managed backups
- Vertical scaling (read and write)
- Horizontal scaling (read)
- Automatic replication
The main differences between fully managed vs. serverless products is outlined below:
- Fully managed means the service runs on hardware you control, i.e. you can SSH into a Cloud SQL instance
- Serverless products can be thought of like an API that you're calling. This means you pay for using the product, but don't have to worry about managing any of the servers.
BigQuery and Cloud Pub/Sub are two examples of serverless products, whereas Cloud Dataproc is fully managed.
What is a Data Warehouse?
In this section, we'll look at what makes a modern data warehouse, and how they differ from data lakes.
An enterprise data warehouse consolidates data from various sources.
A data warehouse takes the raw data from a data lake and standardizes its format in order to make it available for querying and data processing.
A data lake, on the other hand, is simply raw data from various sources.
A data warehouse also has a schema, which is TutorialsPoint describes as:
A logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates.
In order to use a data warehouse, a data engineer or analyst needs to know the schema of the data.
A data warehouse is built to make the data clean, accurate, and consistent. The purpose of a data warehouse, however, is not to simply store the data—that is the purpose of a data lake.
A data warehouse should be optimized for simplicity of access and high-speed query performance.
Characteristics of the Modern Data Warehouse
A modern data warehouse includes the following key characteristics:
Scalability
As the rate of data generation continues to grow, a modern data warehouse needs to be able to handle datasets that don't fit into memory. This is often gigabytes to terabytes of data, and can occasionally mean petabytes. You typically don't want to have separate data warehouses for different datasets, instead you want a single warehouse that can scale to meet the size requirements of your data.
Serverless & NoOps
A modern data warehouse is serverless and NoOps, which means it's completely automated and there's no need for an operations team to manage it. This allows analysts to perform ad hoc queries much faster, ultimately enabling businesses to make decisions faster.
Data Visualization & Reporting
A modern data warehouse allows you to not only perform queries, but should also support data visualization and reporting tools. Ideally, the data warehouse can directly plugin to the data visualization or reporting tool you're using.
ETL & Data Processing
Since you need to build data pipelines to bring data into the warehouse, this requires an ecosystem of ETL and data processing tools.
Data Streaming
The data pipelines should be able to constantly refresh data in the warehouse and keep it up-to-date. This requires the ability to stream data into the warehouse, and not rely solely on batch processing.
Machine Learning
Machine learning and predictive analytics are becoming increasingly important for all businesses and data analysts. A modern data warehouse has to be able to support machine learning without having to move the data out of the warehouse.
Security & Collaboration
Finally, a modern data warehouse must be able to impose enterprise-grade security such as data exfiltration constraints. It should also allow analysts to share queries with collaborators and data partners.
How to Build a Data Warehouse with BigQuery
In this section, let's look at how to build a data warehouse with BigQuery, which has the capabilities of a modern data warehouse discussed above, including:
- It can scale from gigabytes to petabytes seamlessly
- It is serverless and NoOps, including ad hoc queries
- It can be used in conjunction with other GCP or external tools for data visualization and reporting
- It has features such as analyzing geospatial data, using machine learning techniques, and streaming data built-in
- It has all the security benefits of other GCP products and allows you to share queries
You can find an example of querying 10 billion rows of Wikipedia data using BigQuery on GitHub here.
How BigQuery Organizes Data
BigQuery organizes data tables into units called datasets.
To reference a table from the command line, in SQL queries or code, you refer it it using the construct project.dataset.table
.
The reasons to organize information into multiple scopes—datasets, projects, and tables—is to help you structure it more logically.
Every table schema provides structure to the data, and the schema can be entered manually through the GCP console or by providing a JSON file.
Permissions are set at the dataset level, which means if you provide access to a dataset—read or write—you also provide access to all the tables.
Loading Data into a Data Warehouse
Recall that the method used to load data into a data warehouse depends on the amount of transformation required. There are three main ways to load data into a data warehouse:
- EL - Extract and Load: This is used when data is loaded into the warehouse as is, without any transformation. This means the data source and target have the same schema.
- ELT - Extract, Load, and Transform: This is used when raw data is loaded directly into the target and then the transformation is performed.
- ETL - Extract, Transform, and Load: This is used when the transformation occurs before it is loaded into the target.
The most common type of data import is with CSV files, although you can also import data into BigQuery stored in JSON format as long as it's line delimited, as well as files in Avro, parquet, and ORC format. You can also import data into BigQuery directly through the API.
Transferring the data is just the first part of building a data warehouse. If you're building your own system, you would then need to stage the data for cleaning and transform it using ELT or ETL processes in order to get it into its final stable form.
Summary: Data Lakes vs. Data Warehouses
In summary, a data lake captures information from various sources and stores it in its raw format. A data warehouse, on the other hand, stores data that has been processed, organized, and transformed into a usable and accessible format.
Below you'll find additional resources to learn about data engineering, data lakes, and data warehouses.