Convert CSV to Parquet Files

Apache Parquet

Apache Parquet is a columnar data storage format, which provides a way to store tabular data column wise. Columns of same date-time are stored together as rows in Parquet format, so as to offer better storage, compression and data retrieval.

What is Row Oriented Storage Format?

In row oriented storage, data is stored row wise on to the disk.

Columnar Storage Format

In columnar storage format above table will be stored column wise.

As you can see in this format all the IDs are together and so are names and salaries. A Query selecting Name column will require less I/O time as all the values are adjacent unlike in row oriented format.

Using Apache Parquet

Using Parquet format has two advantages

  1. Reduced storage
  2. Query performance

Depending on your business use case, Apache Parquet is a good option if you have to provide partial search features i.e. not querying all the columns, and you are not worried about file write time.

Apache Parquet format is supported in all Hadoop based frameworks. Queries selecting few columns from a big set of columns, run faster because disk I/O is much improved because of homogeneous data stored together.

To use Apache spark we need to convert existing data into parquet format. In this article we will learn to convert CSV files to parquet format and then retrieve them back.

CSV to Parquet


We will convert csv files to parquet format using Apache Spark.

Below is pyspark code to convert csv to parquet. You can edit the names and types of columns as per your input.csv

Above code will create parquet files in input-parquet directory. Files will be in binary format so you will not able to read them. You can check the size of the directory and compare it with size of CSV compressed file.

For a 8 MB csv, when compressed, it generated a 636kb parquet file.

How-to-convert-CSV-to-Parquet-Files – Humble Bits

The other way: Parquet to CSV

You can retrieve csv files back from parquet files.

You can compare the original and converted CSV files.

You can provide parquet files to your Hadoop based applications rather than providing plain CSV files.


Share on facebook
Share on twitter
Share on pinterest
Share on linkedin
<b><strong>Karan Makan</strong></b>

Karan Makan

Technology Engineer and Entrepreneur. Currently working with International Clients and helping them scale their products through different ventures. With over 8 years of experience and strong background in Internet Product Management, Growth & Business Strategy.

On Key

Related Posts

Hook Up on Tinder

Since dating can be stressful, there is the possibility of humor to try to reduce tensions. In a new study published in the Proceedings of

error: Content is protected !!