file_path = 'data.csv'
df = pd.read_csv(file_path)
# or
df = pd.read_csv('data.csv')
CSV for Comma-Separated Values is a widely-used way of storing data. Its main advantage is that it is “human readable”, meaning that a human can understand the data. However, the fact that it is stored per line and in “plain text” can lead to performance problems. On the other hand, Parquet is a storage mode optimized for data compression, as the data is stored in binary form per column. However, it cannot be understood without a dedicated program.
file_path = 'data.csv'
df = pd.read_csv(file_path)
# or
df = pd.read_csv('data.csv')
file_path = 'data.parquet'
df = pd.read_parquet(file_path)
# or
df = pd.read_parquet('data.parquet')
The following results were obtained through local experiments.
Processor: Intel® Core™ Ultra 5 135U, 2100 MHz, 12 cores, 14 logical processors
RAM: 16 GB
CO2 Emissions Measurement: Using CodeCarbon
To compare the two storage methods, we’re going to use 3 criteria: the carbon emissions involved in saving the same dataframe, the size of this file and the carbon emissions involved in reading this file. We’ll vary the number of lines in the dataframe to assess the extent of the impact as a function of file size.
Local experiments* on CSV vs Parquet format gives the following results on:
1. Carbon emissions during writing:
2. File size:
3. Carbon emission during reading:
The rule seems relevant. The Parquet format is faster to write to, lighter in weight and faster to read data from. It is suitable for use cases where there would be a lot of data I/O, especially with Cloud storage.