Chapter 1. Understanding the Architecture of Dask DataFrames
A Note for Early Release Readers
With Early Release ebooks, you get books in their earliest formthe authors raw and unedited content as they writeso you can take advantage of these technologies long before the official release of these titles.
This will be the 3rd chapter of the final book.
If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the author at ccollins@oreilly.com.
Dask DataFrames allow you to scale your pandas workflows. Dask DataFrames overcome two key limitations of pandas:
pandas cannot run on datasets larger than memory
pandas only uses one core when running analyses, which can be slow
Dask DataFrames are designed to overcome these pandas limitations. They can be run on datasets that are larger than memory and use all cores by default for fast execution. Here are the key Dask DataFrame architecture components that allow for Dask to overcome the limitation of pandas:
Lets take a look at the pandas architecture first, so we can better understand how its related to Dask DataFrames.
Youll need to build some new mental models about distributed processing to fully leverage the power of Dask DataFrames. Luckily for pandas programmers, Dask was intentionally designed to have similar syntax. pandas programmers just need to learn the key differences when working with distributed computing systems to make the Dask transition easily.
pandas Architecture
pandas DataFrames are in widespread use today partly because they are easy to use, powerful, and efficient. We dont dig into them deeply in this book, but will quickly review some of their key characteristics. First, they contain rows and values with an index.
Lets create a pandas DataFrame with name
and balance
columns to illustrate:
import pandas as pddf = pd.DataFrame({"name": ["li", "sue", "john", "carlos"], "balance": [10, 20, 30, 40]})
This DataFrame has 4 rows of data, as illustrated in .
Figure 1-1. pandas DataFrame with four rows of data
The DataFrame in also has an index.
Figure 1-2. pandas DataFrame has an index
pandas makes it easy to run analytical queries on the data. It can also be leveraged to build complex models and is a great option for small datasets, but does not work well for larger datasets. Lets look at why pandas doesnt work well for bigger datasets.
pandas Limitations
pandas has two key limitations:
Its DataFrames are limited by the amount of computer memory
Its computations only use a single core, which is slow for large datasets
pandas DataFrames are loaded into the memory of a single computer. The amount of data that can be stored in the RAM of a single computer is limited to the size of the computers RAM. A computer with 8 GB of memory can only hold 8 GB of data in memory. In practice, pandas requires the memory to be much larger than the dataset size. A 2 GB dataset may require 8 GB of memory for example (the exact memory requirement depends on the operations performed and pandas version). illustrates the types of datasets pandas can handle on a computer with 16 GB of RAM.
Figure 1-3. Dataset sizes pandas can handle on a computer with 16 GB of RAM
Furthermore, pandas does not support parallelism. This means that even if you have multiple cores in your CPU, with pandas you are always limited to using only one of the CPU cores at a time. And that means you are regularly leaving much of your hardware potential untapped (see FIgure 3-4).
Figure 1-4. pandas only uses a single core and dont leverage all available computation power
Lets turn our attention to Dask and see how its architected to overcome the scaling and performance limitations of pandas.
How Dask DataFrames Differ from pandas
Dask DataFrames have the same logical structure as pandas DataFrames, and share a lot of the same internals, but have a couple of important architectural differences. As you can see in , pandas stores all data in a single DataFrame, whereas Dask splits up the dataset into a bunch of little pandas DataFrames.