The default file format for any data science workflow
Have you ever used
pd.read_csv() in pandas? Well, that command could have run ~50x faster if you had used parquet instead of CSV.
In this post we will discuss apache parquet, an extremely efficient and well-supported file format. The post is geared towards data practitioners (ML, DE, DS) so we’ll be focusing on high-level concepts and using SQL to talk through core concepts, but links for further resources can be found throughout the post and in the comments.
Without further ado, let’s dive in!
Apache parquet is an open-source file format that provides efficient storage and fast read speed. It uses a hybrid storage format which sequentially stores chunks of columns, lending to high performance when selecting and filtering data. On top of strong compression algorithm support (snappy, gzip, LZO), it also provides some clever tricks for reducing file scans and encoding repeat variables.
If you care about speed, you should consider parquet.
Ok, let’s slow down a bit and discuss parquet in plain english.
1 — The Data Storage Problem
Let’s say we are data engineers. We are looking to create a data architecture that facilitates Online Analytical Processes (OLAP), which are just selective queries geared towards data analysis. Some examples of data functions that are optimized in an OLAP environment are exploratory data analysis or decision science.
But how should we store our data on disk?
Well, there are a many considerations when thinking about whether our conversion is good or bad, but for OLAP workflows, we’re mainly concerned with two…
- Read speed: how fast we can access and decode relevant information from binary files
- Size on disk: how much space our file needs when in binary format
There are a variety of other measures of success for a file compression algorithm, such as write speed and metadata support, but we’ll just focus on the above two for now.
Parquet takes 87% less space and queries 34x faster than CSV (1 TB of data, S3 storage)— src
2 — Parquet Core Features
But how is parquet so much more effective than CSV and other popular formats? The first answer is the storage layout…
2.1 — Hybrid Storage Layout
When we convert a 2-dimensional table to a sequence of 0’s and 1’s, we must think carefully about the optimal structure. Should the first column be written, then the second, then the third? Or should we sequentially store rows?
Traditionally there are three main layouts that convert our 2 dimensional table down to 1:
- Row-based: sequentially store rows (CSV).
- Column-based: sequentially store columns (ORC).
- Hybrid-base: sequentially store chunks of columns (Parquet).
We can see a graphical representation of each of these formats in figure 2.
Before moving on, it’s important to note that parquet is often described as a columnar format. However, due to the fact that it stores chunks of columns, as shown in the bottom of figure 2, a hybrid storage layout is a more precise description.
Now, hybrid layouts are really effective for OLAP workflows because they support both projection and predicates.
Projection is the process of selecting columns — you can think of it as the
SELECT statement in a SQL query. Projection is best supported by a column-based layout. For instance, if we wanted to read the first column of a table using a column-based layout, we could just read the first n indices in our binary file, deserialize them, and present them to the user. Pretty efficient, right?
Predicates is the criteria used to select rows —y ou can think of it as the
WHERE clause in a SQL query. Predicates are best supported by row-based storage. If we wanted all rows according to some criteria, such as
Int >= 2, we can just order our table by
Int (descending), scan until our criteria is not satisfied, and return all rows above that invalid row.
In both of these scenarios, we are looking to traverse as little of the file as possible. And, since data science unfortunately often requires subsetting both rows and columns, a hybrid-based storage layout gives us a middle ground between columnar and row-based file formats.
Great! So if we’re looking to pull data, we can just store sequential chunks of columns and usually get very good performance. But does this method scale?
2.2 — Parquet Metadata
The answer is a resounding “yes.” Parquet leverages metadata to skip parts of our data that can be excluded according to our predicate.
Taking the example table in figure 3, we can see that our row group size is 2, meaning we store 2 rows of a given column, 2 rows of the next column, 2 rows of the third column, and so on.
After we run out of columns, we move to the next set of rows. Note that we only have 3 rows in the above table, so the final row group would only have 1 row.
Now let’s say we are actually storing 100,000 values in our row groups instead of 2. If we look to find all rows where a column has a given value (i.e. an equality predicate), worst case scenario would involve scanning every row in the table.
Parquet intelligently solves this by storing
min values for each row group, allowing us to skip entire row groups, as shown in figure 4. But that’s not all! Since parquet often writes many
.parquet files to a single directory, we can look at the column metadata for an entire file in and determine whether it should be scanned.
By including some extra data, we are able to dramatically increase query speeds. For more, see here.
2.3 — Parquet File Structure
Ok, so we’ve hinted at how data are converted from a 2-d format to a 1-d format, but how is the entire file system structured?
Well, as mentioned above, parquet can write many
.parquet files for a single write. With small datasets, this is a problem and you should probably repartition your data before the write. However, for larger datasets, subsetting data into to multiple files can dramatically improve performance.
Overall, parquet follows the below structure. Let’s take a look at each one in turn…
Root > Parquet Files > Row Groups > Columns > Data Page
First, our file root, is just a directory that holds everything. Inside the root we have many individual
.parquet files, each containing a partition of our data. A single parquet file is composed of many row groups and a single row group contains many columns. Finally, within our columns are data pages, which actually hold the raw data and some relevant metadata.
We can see a simplified representation of a single
.parquet file in figure 4 below.
3 — Additional Optimizations
Since its inception in 2013, parquet has gotten a lot smarter. The underlying structure has been primarily unchanged from the above format, but lots of cool features that improve performance for certain types of data have been added.
Let’s look at some examples…
3.1 — My data has with lots of duplicate values!
Solution: Run-Length Encoding (RLE)
Let’s say we have a column with 10,000,000 values, but all the values are
0. To store this information, we just need 2 numbers:
value and the
number of times it repeated.
RLE does just that. When many sequential duplicates are found, parquet can encode that information as a tuple corresponding to the value and the count. In our example, that would save us from storing 9,999,998 numbers.
3.2 — My data has very big types!
Let’s say we have a column that contains country names, some of which are very long. If we wanted to store “The Democratic Republic of Congo,” we would need a string column that can handle at least 32 characters.
Dictionary encoding replaces each value in our column with a small integer and stores the mapping in our data page’s metadata. When on disk, our encoded values are bit-packed to take up the least amount of space possible, but when we read the data we can still convert our column back to its original values.
3.3 — I’m using complex filters on my data!
Solution: Projection and Predicate Pushdown
In a spark environment, we can avoid reading an entire table into memory through projection and predicate pushdown. Because spark operations are lazily evaluated, meaning they aren’t executed until we actually query data, spark can pull the least amount of data into memory.
As a quick reminder, predicates subset rows and projection subsets columns. So, if we know we only need a few columns and a subset of the rows, we don’t actually have to read the full table into memory — it’s filtered during the read.
3.4 — I have lots of different data!
Finally, Deltalake is an open-source “lakehouse” framework that combines the dynamic nature of a data lake with the structure of data warehouse. If you plan on using parquet as the base of your organization’s data warehouse, additional features such as ACID guarantees and transaction logs are really beneficial.
If you want more, here’s a great beginner-friendly resource for deltalake.
To recap, parquet is a really effective file format for real-world use. It’s super effective at minimizing table scans and also compresses data to small sizes. If you are a data scientist, parquet is a good default file type.
Demystifying the Parquet File Format Republished from Source https://towardsdatascience.com/demystifying-the-parquet-file-format-13adb0206705?source=rss----7f60cf5620c9---4 via https://towardsdatascience.com/feed