Pandas parquet

Pandas¶

In [1]:

Copied!





import pandas as pd
import numpy as np
import time


# Helper function to measure size
def get_size_mb(series):
    return series.memory_usage(deep=True) / 1024 ** 2
import pandas as pd
import numpy as np
import time


# Helper function to measure size
def get_size_mb(series):
    return series.memory_usage(deep=True) / 1024 ** 2

SETUP: Generate a large synthetic dataset (10 Million rows)¶

In [15]:

Copied!





N_ROWS = 10_000_000

print(f"Generating {N_ROWS} rows of data... this might take a moment.\n")

# Generate random data
# 'low_cardinality': Repeating strings (e.g., "Type_A", "Type_B")
# 'small_ints': Numbers between 0-100 (fits in int8, but defaults to int64)
data = {
    'string_col': np.random.choice(['Item_A', 'Item_B', 'Item_C', 'Item_D'], N_ROWS),
    'int_col': np.random.randint(0, 100, N_ROWS)
}

df = pd.DataFrame(data)
N_ROWS = 10_000_000

print(f"Generating {N_ROWS} rows of data... this might take a moment.\n")

# Generate random data
# 'low_cardinality': Repeating strings (e.g., "Type_A", "Type_B")
# 'small_ints': Numbers between 0-100 (fits in int8, but defaults to int64)
data = {
    'string_col': np.random.choice(['Item_A', 'Item_B', 'Item_C', 'Item_D'], N_ROWS),
    'int_col': np.random.randint(0, 100, N_ROWS)
}

df = pd.DataFrame(data)

Generating 10000000 rows of data... this might take a moment.

In [21]:

Copied!

df.head()
df.head()

Out[21]:

	string_col	int_col
0	Item_B	1
1	Item_D	3
2	Item_C	9
3	Item_C	27
4	Item_C	20

Data Structure	Dimensionality	Spreadsheet Analog	Database Analog	Linear Algebra
Series	1D	Column	Column	Column Vector
DataFrame	2D	Single Sheet	Table	Matrix

Operation	SQL Syntax	Pandas Syntax
Select All	`SELECT * FROM table_name;`	`df`
Select Columns	`SELECT string_col, int_col FROM table_name;`	`df[['string_col', 'int_col']]`
Filter (Where)	`SELECT * FROM table_name WHERE int_col > 50;`	`df[df['int_col'] > 50]`
Multiple Filters (And)	`SELECT * FROM table_name WHERE int_col > 50 AND string_col = 'Item_A';`	`df[(df['int_col'] > 50) & (df['string_col'] == 'Item_A')]`
Multiple Filters (Or)	`SELECT * FROM table_name WHERE string_col = 'Item_A' OR string_col = 'Item_B';`	`df[(df['string_col'] == 'Item_A') \\| (df['string_col'] == 'Item_B')]` or `df[df['string_col'].isin(['Item_A', 'Item_B'])]`
Distinct / Unique	`SELECT DISTINCT string_col FROM table_name;`	`df['string_col'].unique()`
Sorting	`SELECT * FROM table_name ORDER BY int_col DESC;`	`df.sort_values(by='int_col', ascending=False)`
Group By (Count)	`SELECT string_col, COUNT(*) FROM table_name GROUP BY string_col;`	`df.groupby('string_col').size()` or `df['string_col'].value_counts()`
Group By (Avg)	`SELECT string_col, AVG(int_col) FROM table_name GROUP BY string_col;`	`df.groupby('string_col')['int_col'].mean()`
Rename Column	`SELECT string_col AS item_name FROM table_name;`	`df.rename(columns={'string_col': 'item_name'})`
Limit / Head	`SELECT * FROM table_name LIMIT 5;`	`df.head(5)`

In [4]:

Copied!

df.columns
df.columns

Out[4]:

Index(['string_col', 'int_col'], dtype='object')

In [22]:

Copied!





# ---------------------------------------------------------
# TEST 1: The "Object" Trap vs. Categoricals
# Scenario: Grouping by a string column
# ---------------------------------------------------------
print("-" * 30)
print("TEST 1: String Object vs. Category")
print("-" * 30)

# 1. Baseline (Object/String)
size_obj = get_size_mb(df['string_col'])
start = time.time()
res_obj = df.groupby('string_col').count()
end = time.time()
time_obj = end - start

print(f"Object Dtype:")
print(f"  - Memory: {size_obj:.2f} MB")
print(f"  - GroupBy Time: {time_obj:.4f} sec")
# ---------------------------------------------------------
# TEST 1: The "Object" Trap vs. Categoricals
# Scenario: Grouping by a string column
# ---------------------------------------------------------
print("-" * 30)
print("TEST 1: String Object vs. Category")
print("-" * 30)

# 1. Baseline (Object/String)
size_obj = get_size_mb(df['string_col'])
start = time.time()
res_obj = df.groupby('string_col').count()
end = time.time()
time_obj = end - start

print(f"Object Dtype:")
print(f"  - Memory: {size_obj:.2f} MB")
print(f"  - GroupBy Time: {time_obj:.4f} sec")

------------------------------
TEST 1: String Object vs. Category
------------------------------
Object Dtype:
  - Memory: 524.52 MB
  - GroupBy Time: 0.3738 sec

In [23]:

Copied!

df.columns
df.columns

Out[23]:

Index(['string_col', 'int_col'], dtype='object')

Category Optimization¶

In [24]:

Copied!





start_conv = time.time()
df['cat_col'] = df['string_col'].astype('category')
end_conv = time.time()
conversion_cost = end_conv - start_conv

print(df['cat_col'].cat.categories)
print(df['cat_col'].cat.codes)

start_conv = time.time()
df['cat_col'] = df['string_col'].astype('category')
end_conv = time.time()
conversion_cost = end_conv - start_conv

print(df['cat_col'].cat.categories)
print(df['cat_col'].cat.codes)

Index(['Item_A', 'Item_B', 'Item_C', 'Item_D'], dtype='object')
0          1
1          3
2          2
3          2
4          2
          ..
9999995    1
9999996    2
9999997    3
9999998    1
9999999    0
Length: 10000000, dtype: int8

In [25]:

Copied!





size_cat = get_size_mb(df['cat_col'])
start = time.time()
res_cat = df.groupby('cat_col').count() # observed=False is future default safety
end = time.time()
time_cat = end - start

print(f"Category Dtype:")
print(f"  - Memory: {size_cat:.2f} MB ({(size_obj/size_cat):.1f}x smaller)")
print(f"  - GroupBy Time: {time_cat:.4f} sec ({(time_obj/time_cat):.1f}x faster)")
print(f"  - (Conversion overhead was {conversion_cost:.4f} sec)")


size_cat = get_size_mb(df['cat_col'])
start = time.time()
res_cat = df.groupby('cat_col').count() # observed=False is future default safety
end = time.time()
time_cat = end - start

print(f"Category Dtype:")
print(f"  - Memory: {size_cat:.2f} MB ({(size_obj/size_cat):.1f}x smaller)")
print(f"  - GroupBy Time: {time_cat:.4f} sec ({(time_obj/time_cat):.1f}x faster)")
print(f"  - (Conversion overhead was {conversion_cost:.4f} sec)")

/tmp/ipykernel_45599/3568531628.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  res_cat = df.groupby('cat_col').count() # observed=False is future default safety

Category Dtype:
  - Memory: 9.54 MB (55.0x smaller)
  - GroupBy Time: 0.3037 sec (1.2x faster)
  - (Conversion overhead was 0.3405 sec)

Save space via integer precision¶

In [26]:

Copied!





# ---------------------------------------------------------
# TEST 2: Integer Precision (int64 vs. int8)
# Scenario: Basic arithmetic (Memory Bandwidth check)
# ---------------------------------------------------------
print("\n" + "-" * 30)
print("TEST 2: int64 vs. int8 (Downcasting)")
print("-" * 30)

# 1. Baseline (int64 - default in Pandas/NumPy)
# Ensure it is int64
df['int_col'] = df['int_col'].astype('int64') 
size_i64 = get_size_mb(df['int_col'])

start = time.time()
# Simple scan operation: Multiply every number by 2
_ = df['int_col'] * 2
end = time.time()
time_i64 = end - start

print(f"int64 Dtype:")
print(f"  - Memory: {size_i64:.2f} MB")
print(f"  - Operation Time: {time_i64:.4f} sec")

# 2. Optimized (int8)
df['int8_col'] = df['int_col'].astype('int8')
size_i8 = get_size_mb(df['int8_col'])

start = time.time()
_ = df['int8_col'] * 2
end = time.time()
time_i8 = end - start

print(f"int8 Dtype:")
print(f"  - Memory: {size_i8:.2f} MB ({(size_i64/size_i8):.1f}x smaller)")
print(f"  - Operation Time: {time_i8:.4f} sec ({(time_i64/time_i8):.1f}x faster)")
# ---------------------------------------------------------
# TEST 2: Integer Precision (int64 vs. int8)
# Scenario: Basic arithmetic (Memory Bandwidth check)
# ---------------------------------------------------------
print("\n" + "-" * 30)
print("TEST 2: int64 vs. int8 (Downcasting)")
print("-" * 30)

# 1. Baseline (int64 - default in Pandas/NumPy)
# Ensure it is int64
df['int_col'] = df['int_col'].astype('int64') 
size_i64 = get_size_mb(df['int_col'])

start = time.time()
# Simple scan operation: Multiply every number by 2
_ = df['int_col'] * 2
end = time.time()
time_i64 = end - start

print(f"int64 Dtype:")
print(f"  - Memory: {size_i64:.2f} MB")
print(f"  - Operation Time: {time_i64:.4f} sec")

# 2. Optimized (int8)
df['int8_col'] = df['int_col'].astype('int8')
size_i8 = get_size_mb(df['int8_col'])

start = time.time()
_ = df['int8_col'] * 2
end = time.time()
time_i8 = end - start

print(f"int8 Dtype:")
print(f"  - Memory: {size_i8:.2f} MB ({(size_i64/size_i8):.1f}x smaller)")
print(f"  - Operation Time: {time_i8:.4f} sec ({(time_i64/time_i8):.1f}x faster)")

------------------------------
TEST 2: int64 vs. int8 (Downcasting)
------------------------------
int64 Dtype:
  - Memory: 76.29 MB
  - Operation Time: 0.0171 sec
int8 Dtype:
  - Memory: 9.54 MB (8.0x smaller)
  - Operation Time: 0.0018 sec (9.4x faster)

python loops vs vectorization¶

In [27]:

Copied!





import pandas as pd
import numpy as np
import time

# 1. SETUP: Create a "Medium" sized DataFrame (100k rows)
# We use a fixed seed so results are reproducible
np.random.seed(42)
N = 100_000
df = pd.DataFrame({
    'A': np.random.randint(0, 100, size=N),
    'B': np.random.random(size=N)
})

print(f"--- Benchmarking on {N:,} rows ---\n")
import pandas as pd
import numpy as np
import time

# 1. SETUP: Create a "Medium" sized DataFrame (100k rows)
# We use a fixed seed so results are reproducible
np.random.seed(42)
N = 100_000
df = pd.DataFrame({
    'A': np.random.randint(0, 100, size=N),
    'B': np.random.random(size=N)
})

print(f"--- Benchmarking on {N:,} rows ---\n")

--- Benchmarking on 100,000 rows ---

METHOD 1: Iterrows¶

In [28]:

Copied!





start = time.time()
res_iter = []
for index, row in df.iterrows():
    if row['A'] > 50:
        res_iter.append(row['B'] * 10)
    else:
        res_iter.append(row['B'])
df['C_iter'] = res_iter
end = time.time()
time_iter = end - start
print(f"1. Iterrows:         {time_iter:.4f} sec  (Baseline: Slowest)")
start = time.time()
res_iter = []
for index, row in df.iterrows():
    if row['A'] > 50:
        res_iter.append(row['B'] * 10)
    else:
        res_iter.append(row['B'])
df['C_iter'] = res_iter
end = time.time()
time_iter = end - start
print(f"1. Iterrows:         {time_iter:.4f} sec  (Baseline: Slowest)")

1. Iterrows:         1.3712 sec  (Baseline: Slowest)

METHOD 2: .apply()¶

In [29]:

Copied!





start = time.time()
def logic(row):
    if row['A'] > 50:
        return row['B'] * 10
    else:
        return row['B']
df['C_apply'] = df.apply(logic, axis=1)
end = time.time()
time_apply = end - start
print(f"2. .apply(axis=1):   {time_apply:.4f} sec  (approx {time_iter/time_apply:.1f}x faster)")
start = time.time()
def logic(row):
    if row['A'] > 50:
        return row['B'] * 10
    else:
        return row['B']
df['C_apply'] = df.apply(logic, axis=1)
end = time.time()
time_apply = end - start
print(f"2. .apply(axis=1):   {time_apply:.4f} sec  (approx {time_iter/time_apply:.1f}x faster)")

2. .apply(axis=1):   0.4061 sec  (approx 3.4x faster)

METHOD 3: .iloc loop (The Manual Fetch)¶

In [30]:

Copied!





# Note: Doing this manually is usually even slower or similar to iterrows 
# because of the repeated Python-to-C API calls for every single index.
start = time.time()
res_iloc = []
for i in range(len(df)):
    # Accessing scalar values one by one
    val_a = df.iloc[i, 0] # Column A is index 0
    val_b = df.iloc[i, 1] # Column B is index 1
    if val_a > 50:
        res_iloc.append(val_b * 10)
    else:
        res_iloc.append(val_b)
df['C_iloc'] = res_iloc
end = time.time()
time_iloc = end - start
print(f"3. .iloc Loop:       {time_iloc:.4f} sec  (approx {time_iter/time_iloc:.1f}x faster)")
# Note: Doing this manually is usually even slower or similar to iterrows 
# because of the repeated Python-to-C API calls for every single index.
start = time.time()
res_iloc = []
for i in range(len(df)):
    # Accessing scalar values one by one
    val_a = df.iloc[i, 0] # Column A is index 0
    val_b = df.iloc[i, 1] # Column B is index 1
    if val_a > 50:
        res_iloc.append(val_b * 10)
    else:
        res_iloc.append(val_b)
df['C_iloc'] = res_iloc
end = time.time()
time_iloc = end - start
print(f"3. .iloc Loop:       {time_iloc:.4f} sec  (approx {time_iter/time_iloc:.1f}x faster)")

3. .iloc Loop:       2.0270 sec  (approx 0.7x faster)

METHOD 4: Vectorized Masking (The Columnar Way)¶

In [35]:

Copied!





start = time.time()
# Step 1: Create Mask (Simd check on entire A array)
mask = df['A'] > 50
# Step 2: Initialize result with default value (Column B)
df['C_vec'] = df['B']
# Step 3: Apply transformation only where mask is True
df.loc[mask, 'C_vec'] = df['B'] * 10
end = time.time()
time_vec = end - start
print(f"4. Vectorized:       {time_vec:.4f} sec  (approx {time_iter/time_vec:.1f}x faster)")


print("\n--- Summary ---")
print(f"Vectorization was {time_iter / time_vec:.0f} times faster than iterrows.")
start = time.time()
# Step 1: Create Mask (Simd check on entire A array)
mask = df['A'] > 50
# Step 2: Initialize result with default value (Column B)
df['C_vec'] = df['B']
# Step 3: Apply transformation only where mask is True
df.loc[mask, 'C_vec'] = df['B'] * 10
end = time.time()
time_vec = end - start
print(f"4. Vectorized:       {time_vec:.4f} sec  (approx {time_iter/time_vec:.1f}x faster)")


print("\n--- Summary ---")
print(f"Vectorization was {time_iter / time_vec:.0f} times faster than iterrows.")

4. Vectorized:       0.0034 sec  (approx 401.7x faster)

--- Summary ---
Vectorization was 402 times faster than iterrows.

In [37]:

Copied!

df[["A","B"]].to_csv("df.csv", index=False) # 2.1MB

df[["A","B"]].to_parquet("df.parquet", index=False) # 1 MB

df[["A","B"]].sort_values(by="A").to_parquet("sorted_df.parquet", index=False) # 982.3 KB
df[["A","B"]].to_csv("df.csv", index=False) # 2.1MB

df[["A","B"]].to_parquet("df.parquet", index=False) # 1 MB

df[["A","B"]].sort_values(by="A").to_parquet("sorted_df.parquet", index=False) # 982.3 KB

In [39]:

Copied!

df[["A","B"]].sort_values(by="A")
df[["A","B"]].sort_values(by="A")

Out[39]:

	A	B
57418	0	0.245712
57059	0	0.216732
57327	0	0.655700
57120	0	0.829777
99905	0	0.060866
...	...	...
3433	99	0.771281
15807	99	0.512312
11	99	0.518606
15768	99	0.863714
70946	99	0.961537

100000 rows × 2 columns

Parquet¶

Apache Parquet is widely considered the "gold standard" for open-source columnar storage. It is the default format for tools like Spark, AWS Athena, and Hadoop.

Unit of Storage: "Pages"¶

Parquet doesn't just write one giant column. It breaks a column down into chunks called Pages (typically 1MB to 8MB in size).

Encodings and compression are applied per page.
This means one page of a column could use Dictionary Encoding, while the next page (if the data characteristics change) could use Delta Encoding. This adaptivity is key to Parquet's efficiency.

Encodings¶

Dictionary Encoding (The Default)¶

Parquet is aggressive about Dictionary Encoding. When it starts writing a column page, it defaults to a mode called PLAIN_DICTIONARY (or RLE_DICTIONARY in newer versions).

The Process:

Parquet extracts unique values and builds a "Dictionary Page" at the start of the column chunk.
It replaces the actual values in the "Data Page" with integer IDs (indices) pointing to the dictionary.

The "Fall Back" Mechanism: Parquet monitors the size of the dictionary. If the column has high cardinality (e.g., millions of unique UUIDs) and the dictionary becomes too large to be efficient, Parquet automatically stops dictionary encoding for subsequent pages and falls back to PLAIN encoding.

Hybrid RLE and Bit-Packing¶

This is where Parquet gets clever. It doesn't just choose between Run-Length Encoding (RLE) and Bit-Packing; it combines them into a single stream known as Hybrid RLE/Bit-Packing.

It is primarily used for Dictionary Indices, Boolean values, and Repetition/Definition Levels (metadata used to handle nested data and NULLs).

How it works: The encoder looks ahead at the stream of integers.
If it sees a run of identical values: It outputs an RLE packet (e.g., "The value 5 repeats 50 times").
If the values are random: It outputs a Bit-Packed packet (e.g., "Here are the next 8 literal values, packed into 3 bits each").
Why this matters: Pure RLE is terrible for random data (it actually expands the size). Pure Bit-Packing is suboptimal for repeating data. Parquet switches back and forth dynamically within the same page to ensure optimal storage.

Delta Encoding Variants¶

For numerical data where dictionary encoding isn't suitable (like unique timestamps or IDs), Parquet supports DELTA_BINARY_PACKED.

Mini-Blocks: It divides data into small blocks (e.g., 128 integers).
Min-Delta: It calculates the minimum delta (difference) in that block.
Bit-Packing: It stores the variations from that minimum using the smallest number of bits possible.

Final Layer: Compression Codecs¶

Everything described above is just "Encoding." Once the data is RLE-encoded, Bit-Packed, or Dictionary-replaced, Parquet then applies a general-purpose block compression algorithm to the resulting bytes.

Snappy: (Default in many systems) Offers very fast compression/decompression speeds but moderate compression ratios. Great for interactive queries.
GZIP: High compression ratio, but slow (high CPU cost). Good for "cold" archival data.
ZSTD (Zstandard): The modern favorite. It offers compression ratios similar to GZIP but with speeds closer to Snappy.
LZO / Brotli: Also supported but less common defaults.

Life of a Parquet Column¶

If you save a list of "Cities" to a Parquet file, here is the journey the data takes:

Input: ["New York", "New York", "Paris", "London", "London"]
Dictionary Encoding:

Dictionary Page: 0: New York, 1: Paris, 2: London
Data Stream: 0, 0, 1, 2, 2

Hybrid RLE/Bit-Packing:

The stream 0, 0, 1, 2, 2 is converted to compact bits.
(e.g., "0 repeats twice" [RLE], then "1, 2, 2" [Bit-Packed]).

Compression:

The resulting bits are compressed using Snappy.

Disk: The compressed bytes are written to the file.

This multi-stage process is why Parquet files are often 90% smaller than the original CSV or JSON files.