Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the updraftplus domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/xeaerkkk/public_html/anteelo.com/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the better-wp-security domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/xeaerkkk/public_html/anteelo.com/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wordpress-seo domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/xeaerkkk/public_html/anteelo.com/wp-includes/functions.php on line 6114

Deprecated: version_compare(): Passing null to parameter #2 ($version2) of type string is deprecated in /home/xeaerkkk/public_html/anteelo.com/wp-content/plugins/elementor/core/experiments/manager.php on line 132

Deprecated: strpos(): Passing null to parameter #1 ($haystack) of type string is deprecated in /home/xeaerkkk/public_html/anteelo.com/wp-content/plugins/elementor-pro/modules/loop-builder/module.php on line 200

Warning: Undefined array key "options" in /home/xeaerkkk/public_html/anteelo.com/wp-content/plugins/elementor-pro/modules/theme-builder/widgets/site-logo.php on line 194
#Apache Archives - anteelo

Convert CSV to Parquet Files

Apache Parquet

Apache Parquet is a columnar data storage format, which provides a way to store tabular data column wise. Columns of same date-time are stored together as rows in Parquet format, so as to offer better storage, compression and data retrieval.

What is Row Oriented Storage Format?

In row oriented storage, data is stored row wise on to the disk.

Columnar Storage Format

In columnar storage format above table will be stored column wise.

As you can see in this format all the IDs are together and so are names and salaries. A Query selecting Name column will require less I/O time as all the values are adjacent unlike in row oriented format.

Using Apache Parquet

Using Parquet format has two advantages

  1. Reduced storage
  2. Query performance

Depending on your business use case, Apache Parquet is a good option if you have to provide partial search features i.e. not querying all the columns, and you are not worried about file write time.

Apache Parquet format is supported in all Hadoop based frameworks. Queries selecting few columns from a big set of columns, run faster because disk I/O is much improved because of homogeneous data stored together.

To use Apache spark we need to convert existing data into parquet format. In this article we will learn to convert CSV files to parquet format and then retrieve them back.

CSV to Parquet

 

We will convert csv files to parquet format using Apache Spark.

Below is pyspark code to convert csv to parquet. You can edit the names and types of columns as per your input.csv

Above code will create parquet files in input-parquet directory. Files will be in binary format so you will not able to read them. You can check the size of the directory and compare it with size of CSV compressed file.

For a 8 MB csv, when compressed, it generated a 636kb parquet file.

How-to-convert-CSV-to-Parquet-Files – Humble Bits

The other way: Parquet to CSV

You can retrieve csv files back from parquet files.

You can compare the original and converted CSV files.

You can provide parquet files to your Hadoop based applications rather than providing plain CSV files.

error: Content is protected !!