Apache Parquet is a columnar data storage format, which provides a way to store tabular data column wise. Columns of same date-time are stored together as rows in Parquet format, so as to offer better storage, compression and data retrieval.
What is Row Oriented Storage Format?
In row oriented storage, data is stored row wise on to the disk.
Columnar Storage Format
In columnar storage format above table will be stored column wise.
As you can see in this format all the IDs are together and so are names and salaries. A Query selecting Name column will require less I/O time as all the values are adjacent unlike in row oriented format.
Using Apache Parquet
Using Parquet format has two advantages
- Reduced storage
- Query performance
Depending on your business use case, Apache Parquet is a good option if you have to provide partial search features i.e. not querying all the columns, and you are not worried about file write time.
Apache Parquet format is supported in all Hadoop based frameworks. Queries selecting few columns from a big set of columns, run faster because disk I/O is much improved because of homogeneous data stored together.
To use Apache spark we need to convert existing data into parquet format. In this article we will learn to convert CSV files to parquet format and then retrieve them back.
CSV to Parquet
We will convert csv files to parquet format using Apache Spark.
Below is pyspark code to convert csv to parquet. You can edit the names and types of columns as per your input.csv
Above code will create parquet files in input-parquet directory. Files will be in binary format so you will not able to read them. You can check the size of the directory and compare it with size of CSV compressed file.
For a 8 MB csv, when compressed, it generated a 636kb parquet file.
The other way: Parquet to CSV
You can retrieve csv files back from parquet files.
You can compare the original and converted CSV files.
You can provide parquet files to your Hadoop based applications rather than providing plain CSV files.