Python profiling¶
Table of contents:¶
- The Fundamental Computer System
- Profiling techniques
- Python lists
- Dictionary and namespaces
! pip install memory_profiler
%load_ext memory_profiler
# for %timeit and %memit
Requirement already satisfied: memory_profiler in /home/ankush/workplace/da_colab/DS614/.venv/lib/python3.12/site-packages (0.61.0) Requirement already satisfied: psutil in /home/ankush/workplace/da_colab/DS614/.venv/lib/python3.12/site-packages (from memory_profiler) (7.2.1)
Fundamental Computer System¶
Computing units
Provides the ability to transform bits it recieves into other bits or to change the state of process. Eg: CPU(commonly used), GPU(auxilary unit)Memory units
Used to store bits.
Eg: Spinning disc, Solid State Drive(SSD), RAM, l1/l2/l3 cache, CPU registersCommunication layers
How fast we can move data from one place to another
Eg: Frontside bus(between RAM and l1/l2 cache), External bus(hardware devices to CPU and main memory)
Computing units¶
Provides the ability to transform bits it recieves into other bits or to change the state of process.
Measurements:
How many cycles per second?(Clock speed)
A CPU cycle is the basic unit of time used by a CPU to execute an instruction. It consists of a fetch, decode, execute, and write-back phase.
How many instructions per cycle(IPC)?
Having a high IPC can drastically affect computing by changing the level of vectorization that is possible.
Commands:¶
lscpu
# eg: output
# CPU max MHz: 4680.0000 # 1 MHz means 1 million cycles per second
# CPU min MHz: 400.0000
# BogoMIPS: 6587.72 # Million instructions per second
Memory units¶
Used to store bits.
Measurements:
- How much data it can hold?
- How fast it can read/write to it.
- How fast it take to locate data
Tiered structure:
- Spinning hard drive
- Solid State hard drive
- RAM
- l1/l2/l3 cache
- CPU registers
Relative time latency¶

Image credits: Relative time latencies and computer programming
Communication layers¶
How fast we can move data from one place to another
Measurements:
How much data can a bus move in transfer?(Bus width)
How many transfers bus can make in 1 second(bus frequency)
Types of bus:
- Frontside bus: connects RAM to l1/l2 cache
- External bus: connect external devices to system memory/CPU
- PCI bus: For peripheral devices
Language choices¶
Python is a garbage-collected language
Memory is automatically allocated and freed when needed.
Leads to mempory fragmentation.
Dynamic types and language not being compiled
If code is static than compiler can make lot of optimization to lay out objects in memory efficiently so that CPU run certain instructions more efficiently. Python being dynamic language reduces the scope of such optimzation.
Global interpretor lock(GIL)
Python handles race condtions and deadlocks using GIL. GIL makes sure that a Python process can run only one instruction at a time regardless of number of cores it is currently using.
Why use python?¶
- It"s "batteried included" language with key features either builtin or accessible via external libraries.
- Managed environments and shells available.
- It"s faster to prototype ideas in python.
Profiling techniques¶
- Simple approaches
- linux time command
- line_profiler
- perf module
Simple approaches to timing - print and decorator¶
# time module
import time
s_time = time.time()
time.sleep(4)
print(f"took {time.time() - s_time} secs")
took 4.0003697872161865 secs
# log function"s execution time using decorator
import time
from functools import wraps
def timer_func(func):
# This function shows the execution time of
# the function object passed
@wraps(func)
def wrap_func(*args, **kwargs):
t1 = time.time()
result = func(*args, **kwargs)
t2 = time.time()
print(f'Function {func.__name__!r} executed in {(t2-t1):.4f}s')
return result
return wrap_func
@timer_func
def lengthy_function():
time.sleep(4)
lengthy_function()
Function 'lengthy_function' executed in 4.0003s
Simple Timing Using the Unix time Command¶
! time python3 -u profiling_examples/insertion_sort.py 10000
! time python3 -u profiling_examples/merge_sort.py 10000
real 0m2.514s user 0m2.507s sys 0m0.006s real 0m0.026s user 0m0.024s sys 0m0.002s
line_profiler¶
# installation:
pip install line_profiler
# decorate functions to be profile using @profile
# how to run
kernprof -l -v python_script.py
line_profiler provides following info corresponding to each line
Line: Line number
Hits: number of times line got executed
Per Hit: approx time taken by that line
% Time: percentage time spent executing the line
Line Contents: Line content
%time is most useful parameter in case looking for CPU bottlenecks
! kernprof -l -v profiling_examples/insertion_sort.py
[100, 99, 98, 97, 96, 95, 94, 93, 92, 91]
Wrote profile results to 'insertion_sort.py.lprof'
Timer unit: 1e-06 s
Total time: 0.00398147 s
File: profiling_examples/insertion_sort.py
Function: insertion_sort at line 13
Line # Hits Time Per Hit % Time Line Contents
==============================================================
13 @profile
14 def insertion_sort(arr):
15 # Traverse through 1 to len(arr)
16 99 29.1 0.3 0.7 for i in range(1, len(arr)):
17 98 27.9 0.3 0.7 key = arr[i]
18 # Move elements of arr[0..i-1], that are
19 # greater than key, to one position ahead
20 # of their current position
21 98 26.4 0.3 0.7 j = i - 1
22 4949 1324.4 0.3 33.3 while j >= 0 and key < arr[j]:
23 4851 1328.8 0.3 33.4 arr[j + 1] = arr[j]
24 4851 1217.1 0.3 30.6 j -= 1
25 98 27.7 0.3 0.7 arr[j + 1] = key
! kernprof -l -v profiling_examples/merge_sort.py
Wrote profile results to 'merge_sort.py.lprof'
Timer unit: 1e-06 s
Total time: 0.00140043 s
File: profiling_examples/merge_sort.py
Function: merge_sort at line 17
Line # Hits Time Per Hit % Time Line Contents
==============================================================
17 @profile
18 def merge_sort(arr):
19 197 94.5 0.5 6.7 if len(arr) > 1:
20
21 # Finding the mid of the array
22 98 31.2 0.3 2.2 mid = len(arr) // 2
23
24 # Dividing the array elements
25 98 36.5 0.4 2.6 L = arr[:mid]
26
27 # into 2 halves
28 98 30.1 0.3 2.2 R = arr[mid:]
29
30 # Sorting the first half
31 98 124.6 1.3 8.9 merge_sort(L)
32
33 # Sorting the second half
34 98 121.5 1.2 8.7 merge_sort(R)
35
36 98 25.9 0.3 1.8 i = j = k = 0
37
38 # Copy data to temp arrays L[] and R[]
39 450 134.2 0.3 9.6 while i < len(L) and j < len(R):
40 352 101.8 0.3 7.3 if L[i] <= R[j]:
41 arr[k] = L[i]
42 i += 1
43 else:
44 352 97.4 0.3 7.0 arr[k] = R[j]
45 352 92.7 0.3 6.6 j += 1
46 352 92.3 0.3 6.6 k += 1
47
48 # Checking if any element was left
49 410 113.4 0.3 8.1 while i < len(L):
50 312 86.0 0.3 6.1 arr[k] = L[i]
51 312 85.0 0.3 6.1 i += 1
52 312 78.3 0.3 5.6 k += 1
53
54 98 55.1 0.6 3.9 while j < len(R):
55 arr[k] = R[j]
56 j += 1
57 k += 1
perf module¶
Probing how well memory is being moved to the CPU can be quite hard; however, in Linux the perf tool can be used to get amazing amounts of insight into how the CPU is dealing with the program being run.
# installation:
sudo apt update
sudo apt install linux-tools-common
# how to run?
sudo perf stat -e cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branches,branch-misses,task-clock,faults,minor-faults,cs,migrations -r 3 python3 high_performance_python_examples/insertion_sort.py
# cycles: tells us how many clock cycles our task took.
#instructions: tells us how many instructions our code is issuing to the CPU.
#cache-references: data attempted to be referenced via cache
#cache-misses: data references not found in cache and was fetched from RAM
#branches: a time in the code where the execution flow changes. eg: if..else
#branch-misses: cpu tries to predict the flow and preload the instructions. when prediction is incorrect.
#faults: memory allocations
Strategies for profiling code¶
Disable TurboBoost in the BIOS.
Disable the operating system’s ability to override the SpeedStep (you will find this in your BIOS if you’re allowed to control it).
Only use mains power (never battery power).
Disable background tools like backups and Dropbox while running experiments.
Run the experiments many times to obtain a stable measurement.
Possibly drop to run level 1 (Unix) so that no other tasks are running.
Reboot and rerun the experiments to double-confirm the results.
Data structures and tradeoffs and quirks¶
- Dynamic lists and caching
- Resizing and memory allocation
- Difference between python lists and Numpy arrays
- Performance analysis of dictionaries
Lists¶
Python lists are dynamic in nature.
Pros:
- User dont need to know list size before hand.
- User don"t have to handle resizing
Cons:
- Resizing(in background) makes appending to array slower.
- Due to resizing lists usually take more space than the elements it holds.
%timeit [i*i for i in range(10000)]
258 μs ± 7.41 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit l = []
for i in range(10000):
l.append(i*i)
# takes more time because of list resizing
386 μs ± 885 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%memit [i*i for i in range(1000000)]
peak memory: 85.82 MiB, increment: 0.00 MiB
%%memit l = []
for i in range(1000000):
l.append(i*i)
# takes more time because of list resizing
peak memory: 114.71 MiB, increment: 29.62 MiB
Key Factors¶
Memory allocations:
Memory allocations are not cheap. Whenver you assign a value to variable interpreter needs to talk to Operating system to allocate new space and then iterate over that space to access it. Try to allocate space fewer times and then reuse that space throughout the program.
Memory fragmentation:
Python lists doesn"t natively support vectorization because lists doesnt store the actual data but pointers to it.
Cache misses:
Since data is fragmented across different memory locations, you must move each piece over instead of a block over, hence missing out on benifits of caching
Lists vs numpy arrrays¶
from array import array
import numpy as np
def norm_square_list(vector):
norm = 0
for v in vector:
norm += v*v
return norm
def norm_square_list_comprehension(vector):
return sum([v*v for v in vector])
def norm_square_numpy(vector):
return np.sum(vector * vector)
def norm_square_numpy_dot(vector):
return np.dot(vector, vector)
%%timeit
vector = list(range(1000000))
norm_square_list(vector)
56.2 ms ± 135 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
vector = list(range(1000000))
norm_square_list_comprehension(vector)
61.1 ms ± 338 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
vector_numpy = np.arange(1000000)
norm_square_numpy(vector_numpy)
# numpy has specialized code in C that takes benifits of any vectorization that CPU has enabled.
#Also numpy arrays are represented sequentially in memory as low level data types
907 μs ± 37.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
vector_numpy = np.arange(1000000)
norm_square_numpy_dot(vector_numpy)
# uses specialized code that doesn"t need to store intermediate output of (vector * vector)
734 μs ± 5.57 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Dictionaries and namespaces¶
Python’s namespace management, which heavily uses dictionaries to do its lookups. Whenever a variable, function, or module is invoked in Python, there is a hierarchy that determines where it looks for these objects. First, Python looks inside of the locals() array, which has entries for all local variables. Python works hard to make local variable lookups fast, and this is the only part of the chain that doesn’t require a dictionary lookup. If it doesn’t exist there, then the globals() dictionary is searched. Finally, if the object isn’t found there, the __builtin__ object is searched.
import math
from math import sin
import dis
def test1(x):
"""
resolve `math` from globals()
resolve `sin` from math module
local lookup for x
"""
return math.sin(x)
def test2(x):
"""
resolve sin from globals()
resolve x from locals()
"""
return sin(x)
def test3(x, sin=math.sin):
"""
resolve sin from locals()
resolve x from locals()
"""
return sin(x)
%timeit test1(123456)
#2 dictionary lookup, 1 local lookup
58.5 ns ± 0.157 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
%timeit test2(123456)
#1 dictionary lookup, 1 local lookup
65 ns ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
%timeit test3(123456)
# 2 local lookups
55.9 ns ± 0.107 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
dis.dis(test1)
5 0 RESUME 0
12 2 LOAD_GLOBAL 0 (math)
12 LOAD_ATTR 3 (NULL|self + sin)
32 LOAD_FAST 0 (x)
34 CALL 1
42 RETURN_VALUE
dis.dis(test2)
15 0 RESUME 0
20 2 LOAD_GLOBAL 1 (NULL + sin)
12 LOAD_FAST 0 (x)
14 CALL 1
22 RETURN_VALUE
dis.dis(test3)
23 0 RESUME 0
28 2 PUSH_NULL
4 LOAD_FAST 1 (sin)
6 LOAD_FAST 0 (x)
8 CALL 1
16 RETURN_VALUE
Further exploration¶
- How can we write modules for python using C/fortran/rust
- Multiprocessing
- How can we write code that runs across a cluster and use more machines.
- How can we make programs use less RAM