Pandas parquet
Pandas¶
import pandas as pd
import numpy as np
import time
# Helper function to measure size
def get_size_mb(series):
return series.memory_usage(deep=True) / 1024 ** 2
SETUP: Generate a large synthetic dataset (10 Million rows)¶
N_ROWS = 10_000_000
print(f"Generating {N_ROWS} rows of data... this might take a moment.\n")
# Generate random data
# 'low_cardinality': Repeating strings (e.g., "Type_A", "Type_B")
# 'small_ints': Numbers between 0-100 (fits in int8, but defaults to int64)
data = {
'string_col': np.random.choice(['Item_A', 'Item_B', 'Item_C', 'Item_D'], N_ROWS),
'int_col': np.random.randint(0, 100, N_ROWS)
}
df = pd.DataFrame(data)
Generating 10000000 rows of data... this might take a moment.
df.head()
| string_col | int_col | |
|---|---|---|
| 0 | Item_B | 1 |
| 1 | Item_D | 3 |
| 2 | Item_C | 9 |
| 3 | Item_C | 27 |
| 4 | Item_C | 20 |
| Data Structure | Dimensionality | Spreadsheet Analog | Database Analog | Linear Algebra |
|---|---|---|---|---|
| Series | 1D | Column | Column | Column Vector |
| DataFrame | 2D | Single Sheet | Table | Matrix |
| Operation | SQL Syntax | Pandas Syntax |
|---|---|---|
| Select All | SELECT * FROM table_name; |
df |
| Select Columns | SELECT string_col, int_col FROM table_name; |
df[['string_col', 'int_col']] |
| Filter (Where) | SELECT * FROM table_name WHERE int_col > 50; |
df[df['int_col'] > 50] |
| Multiple Filters (And) | SELECT * FROM table_name WHERE int_col > 50 AND string_col = 'Item_A'; |
df[(df['int_col'] > 50) & (df['string_col'] == 'Item_A')] |
| Multiple Filters (Or) | SELECT * FROM table_name WHERE string_col = 'Item_A' OR string_col = 'Item_B'; |
df[(df['string_col'] == 'Item_A') \| (df['string_col'] == 'Item_B')]or df[df['string_col'].isin(['Item_A', 'Item_B'])] |
| Distinct / Unique | SELECT DISTINCT string_col FROM table_name; |
df['string_col'].unique() |
| Sorting | SELECT * FROM table_name ORDER BY int_col DESC; |
df.sort_values(by='int_col', ascending=False) |
| Group By (Count) | SELECT string_col, COUNT(*) FROM table_name GROUP BY string_col; |
df.groupby('string_col').size()or df['string_col'].value_counts() |
| Group By (Avg) | SELECT string_col, AVG(int_col) FROM table_name GROUP BY string_col; |
df.groupby('string_col')['int_col'].mean() |
| Rename Column | SELECT string_col AS item_name FROM table_name; |
df.rename(columns={'string_col': 'item_name'}) |
| Limit / Head | SELECT * FROM table_name LIMIT 5; |
df.head(5) |
df.columns
Index(['string_col', 'int_col'], dtype='object')
# ---------------------------------------------------------
# TEST 1: The "Object" Trap vs. Categoricals
# Scenario: Grouping by a string column
# ---------------------------------------------------------
print("-" * 30)
print("TEST 1: String Object vs. Category")
print("-" * 30)
# 1. Baseline (Object/String)
size_obj = get_size_mb(df['string_col'])
start = time.time()
res_obj = df.groupby('string_col').count()
end = time.time()
time_obj = end - start
print(f"Object Dtype:")
print(f" - Memory: {size_obj:.2f} MB")
print(f" - GroupBy Time: {time_obj:.4f} sec")
------------------------------ TEST 1: String Object vs. Category ------------------------------ Object Dtype: - Memory: 524.52 MB - GroupBy Time: 0.3738 sec
df.columns
Index(['string_col', 'int_col'], dtype='object')
Category Optimization¶
start_conv = time.time()
df['cat_col'] = df['string_col'].astype('category')
end_conv = time.time()
conversion_cost = end_conv - start_conv
print(df['cat_col'].cat.categories)
print(df['cat_col'].cat.codes)
Index(['Item_A', 'Item_B', 'Item_C', 'Item_D'], dtype='object')
0 1
1 3
2 2
3 2
4 2
..
9999995 1
9999996 2
9999997 3
9999998 1
9999999 0
Length: 10000000, dtype: int8
size_cat = get_size_mb(df['cat_col'])
start = time.time()
res_cat = df.groupby('cat_col').count() # observed=False is future default safety
end = time.time()
time_cat = end - start
print(f"Category Dtype:")
print(f" - Memory: {size_cat:.2f} MB ({(size_obj/size_cat):.1f}x smaller)")
print(f" - GroupBy Time: {time_cat:.4f} sec ({(time_obj/time_cat):.1f}x faster)")
print(f" - (Conversion overhead was {conversion_cost:.4f} sec)")
/tmp/ipykernel_45599/3568531628.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
res_cat = df.groupby('cat_col').count() # observed=False is future default safety
Category Dtype: - Memory: 9.54 MB (55.0x smaller) - GroupBy Time: 0.3037 sec (1.2x faster) - (Conversion overhead was 0.3405 sec)
Save space via integer precision¶
# ---------------------------------------------------------
# TEST 2: Integer Precision (int64 vs. int8)
# Scenario: Basic arithmetic (Memory Bandwidth check)
# ---------------------------------------------------------
print("\n" + "-" * 30)
print("TEST 2: int64 vs. int8 (Downcasting)")
print("-" * 30)
# 1. Baseline (int64 - default in Pandas/NumPy)
# Ensure it is int64
df['int_col'] = df['int_col'].astype('int64')
size_i64 = get_size_mb(df['int_col'])
start = time.time()
# Simple scan operation: Multiply every number by 2
_ = df['int_col'] * 2
end = time.time()
time_i64 = end - start
print(f"int64 Dtype:")
print(f" - Memory: {size_i64:.2f} MB")
print(f" - Operation Time: {time_i64:.4f} sec")
# 2. Optimized (int8)
df['int8_col'] = df['int_col'].astype('int8')
size_i8 = get_size_mb(df['int8_col'])
start = time.time()
_ = df['int8_col'] * 2
end = time.time()
time_i8 = end - start
print(f"int8 Dtype:")
print(f" - Memory: {size_i8:.2f} MB ({(size_i64/size_i8):.1f}x smaller)")
print(f" - Operation Time: {time_i8:.4f} sec ({(time_i64/time_i8):.1f}x faster)")
------------------------------ TEST 2: int64 vs. int8 (Downcasting) ------------------------------ int64 Dtype: - Memory: 76.29 MB - Operation Time: 0.0171 sec int8 Dtype: - Memory: 9.54 MB (8.0x smaller) - Operation Time: 0.0018 sec (9.4x faster)
python loops vs vectorization¶
import pandas as pd
import numpy as np
import time
# 1. SETUP: Create a "Medium" sized DataFrame (100k rows)
# We use a fixed seed so results are reproducible
np.random.seed(42)
N = 100_000
df = pd.DataFrame({
'A': np.random.randint(0, 100, size=N),
'B': np.random.random(size=N)
})
print(f"--- Benchmarking on {N:,} rows ---\n")
--- Benchmarking on 100,000 rows ---
METHOD 1: Iterrows¶
start = time.time()
res_iter = []
for index, row in df.iterrows():
if row['A'] > 50:
res_iter.append(row['B'] * 10)
else:
res_iter.append(row['B'])
df['C_iter'] = res_iter
end = time.time()
time_iter = end - start
print(f"1. Iterrows: {time_iter:.4f} sec (Baseline: Slowest)")
1. Iterrows: 1.3712 sec (Baseline: Slowest)
METHOD 2: .apply()¶
start = time.time()
def logic(row):
if row['A'] > 50:
return row['B'] * 10
else:
return row['B']
df['C_apply'] = df.apply(logic, axis=1)
end = time.time()
time_apply = end - start
print(f"2. .apply(axis=1): {time_apply:.4f} sec (approx {time_iter/time_apply:.1f}x faster)")
2. .apply(axis=1): 0.4061 sec (approx 3.4x faster)
METHOD 3: .iloc loop (The Manual Fetch)¶
# Note: Doing this manually is usually even slower or similar to iterrows
# because of the repeated Python-to-C API calls for every single index.
start = time.time()
res_iloc = []
for i in range(len(df)):
# Accessing scalar values one by one
val_a = df.iloc[i, 0] # Column A is index 0
val_b = df.iloc[i, 1] # Column B is index 1
if val_a > 50:
res_iloc.append(val_b * 10)
else:
res_iloc.append(val_b)
df['C_iloc'] = res_iloc
end = time.time()
time_iloc = end - start
print(f"3. .iloc Loop: {time_iloc:.4f} sec (approx {time_iter/time_iloc:.1f}x faster)")
3. .iloc Loop: 2.0270 sec (approx 0.7x faster)
METHOD 4: Vectorized Masking (The Columnar Way)¶
start = time.time()
# Step 1: Create Mask (Simd check on entire A array)
mask = df['A'] > 50
# Step 2: Initialize result with default value (Column B)
df['C_vec'] = df['B']
# Step 3: Apply transformation only where mask is True
df.loc[mask, 'C_vec'] = df['B'] * 10
end = time.time()
time_vec = end - start
print(f"4. Vectorized: {time_vec:.4f} sec (approx {time_iter/time_vec:.1f}x faster)")
print("\n--- Summary ---")
print(f"Vectorization was {time_iter / time_vec:.0f} times faster than iterrows.")
4. Vectorized: 0.0034 sec (approx 401.7x faster) --- Summary --- Vectorization was 402 times faster than iterrows.
df[["A","B"]].to_csv("df.csv", index=False) # 2.1MB
df[["A","B"]].to_parquet("df.parquet", index=False) # 1 MB
df[["A","B"]].sort_values(by="A").to_parquet("sorted_df.parquet", index=False) # 982.3 KB
df[["A","B"]].sort_values(by="A")
| A | B | |
|---|---|---|
| 57418 | 0 | 0.245712 |
| 57059 | 0 | 0.216732 |
| 57327 | 0 | 0.655700 |
| 57120 | 0 | 0.829777 |
| 99905 | 0 | 0.060866 |
| ... | ... | ... |
| 3433 | 99 | 0.771281 |
| 15807 | 99 | 0.512312 |
| 11 | 99 | 0.518606 |
| 15768 | 99 | 0.863714 |
| 70946 | 99 | 0.961537 |
100000 rows × 2 columns
Parquet¶
Apache Parquet is widely considered the "gold standard" for open-source columnar storage. It is the default format for tools like Spark, AWS Athena, and Hadoop.

Unit of Storage: "Pages"¶
Parquet doesn't just write one giant column. It breaks a column down into chunks called Pages (typically 1MB to 8MB in size).
- Encodings and compression are applied per page.
- This means one page of a column could use Dictionary Encoding, while the next page (if the data characteristics change) could use Delta Encoding. This adaptivity is key to Parquet's efficiency.
Encodings¶
Dictionary Encoding (The Default)¶
Parquet is aggressive about Dictionary Encoding. When it starts writing a column page, it defaults to a mode called PLAIN_DICTIONARY (or RLE_DICTIONARY in newer versions).
- The Process:
- Parquet extracts unique values and builds a "Dictionary Page" at the start of the column chunk.
- It replaces the actual values in the "Data Page" with integer IDs (indices) pointing to the dictionary.
- The "Fall Back" Mechanism: Parquet monitors the size of the dictionary. If the column has high cardinality (e.g., millions of unique UUIDs) and the dictionary becomes too large to be efficient, Parquet automatically stops dictionary encoding for subsequent pages and falls back to
PLAINencoding.
Hybrid RLE and Bit-Packing¶
This is where Parquet gets clever. It doesn't just choose between Run-Length Encoding (RLE) and Bit-Packing; it combines them into a single stream known as Hybrid RLE/Bit-Packing.
It is primarily used for Dictionary Indices, Boolean values, and Repetition/Definition Levels (metadata used to handle nested data and NULLs).
How it works: The encoder looks ahead at the stream of integers.
If it sees a run of identical values: It outputs an RLE packet (e.g., "The value 5 repeats 50 times").
If the values are random: It outputs a Bit-Packed packet (e.g., "Here are the next 8 literal values, packed into 3 bits each").
Why this matters: Pure RLE is terrible for random data (it actually expands the size). Pure Bit-Packing is suboptimal for repeating data. Parquet switches back and forth dynamically within the same page to ensure optimal storage.
Delta Encoding Variants¶
For numerical data where dictionary encoding isn't suitable (like unique timestamps or IDs), Parquet supports DELTA_BINARY_PACKED.
- Mini-Blocks: It divides data into small blocks (e.g., 128 integers).
- Min-Delta: It calculates the minimum delta (difference) in that block.
- Bit-Packing: It stores the variations from that minimum using the smallest number of bits possible.
Final Layer: Compression Codecs¶
Everything described above is just "Encoding." Once the data is RLE-encoded, Bit-Packed, or Dictionary-replaced, Parquet then applies a general-purpose block compression algorithm to the resulting bytes.
- Snappy: (Default in many systems) Offers very fast compression/decompression speeds but moderate compression ratios. Great for interactive queries.
- GZIP: High compression ratio, but slow (high CPU cost). Good for "cold" archival data.
- ZSTD (Zstandard): The modern favorite. It offers compression ratios similar to GZIP but with speeds closer to Snappy.
- LZO / Brotli: Also supported but less common defaults.
Life of a Parquet Column¶
If you save a list of "Cities" to a Parquet file, here is the journey the data takes:
- Input:
["New York", "New York", "Paris", "London", "London"] - Dictionary Encoding:
- Dictionary Page:
0: New York, 1: Paris, 2: London - Data Stream:
0, 0, 1, 2, 2
- Hybrid RLE/Bit-Packing:
- The stream
0, 0, 1, 2, 2is converted to compact bits. - (e.g., "0 repeats twice" [RLE], then "1, 2, 2" [Bit-Packed]).
- Compression:
- The resulting bits are compressed using Snappy.
- Disk: The compressed bytes are written to the file.
This multi-stage process is why Parquet files are often 90% smaller than the original CSV or JSON files.