Python profiling¶

Table of contents:¶

The Fundamental Computer System
Profiling techniques
Python lists
Dictionary and namespaces

In [1]:

Copied!

! pip install memory_profiler
%load_ext memory_profiler
# for %timeit and %memit
! pip install memory_profiler
%load_ext memory_profiler
# for %timeit and %memit

Requirement already satisfied: memory_profiler in /home/ankush/workplace/da_colab/DS614/.venv/lib/python3.12/site-packages (0.61.0)
Requirement already satisfied: psutil in /home/ankush/workplace/da_colab/DS614/.venv/lib/python3.12/site-packages (from memory_profiler) (7.2.1)

Fundamental Computer System¶

Computing units
Provides the ability to transform bits it recieves into other bits or to change the state of process. Eg: CPU(commonly used), GPU(auxilary unit)
Memory units
Used to store bits.
Eg: Spinning disc, Solid State Drive(SSD), RAM, l1/l2/l3 cache, CPU registers
Communication layers
How fast we can move data from one place to another
Eg: Frontside bus(between RAM and l1/l2 cache), External bus(hardware devices to CPU and main memory)

Computing units¶

Provides the ability to transform bits it recieves into other bits or to change the state of process.

Measurements:
How many cycles per second?(Clock speed)
A CPU cycle is the basic unit of time used by a CPU to execute an instruction. It consists of a fetch, decode, execute, and write-back phase.

How many instructions per cycle(IPC)?
Having a high IPC can drastically affect computing by changing the level of vectorization that is possible.

Commands:¶

lscpu

# eg: output
# CPU max MHz:          4680.0000 # 1 MHz means 1 million cycles per second
# CPU min MHz:          400.0000
# BogoMIPS:             6587.72 # Million instructions per second

Memory units¶

Used to store bits.
Measurements:

How much data it can hold?
How fast it can read/write to it.
How fast it take to locate data

Tiered structure:

Spinning hard drive
Solid State hard drive
RAM
l1/l2/l3 cache
CPU registers

Relative time latency¶

Memory access
Image credits: Relative time latencies and computer programming

Communication layers¶

How fast we can move data from one place to another

Measurements:
How much data can a bus move in transfer?(Bus width) How many transfers bus can make in 1 second(bus frequency)

Types of bus:

Frontside bus: connects RAM to l1/l2 cache
External bus: connect external devices to system memory/CPU
PCI bus: For peripheral devices

Language choices¶

Python is a garbage-collected language
Memory is automatically allocated and freed when needed. Leads to mempory fragmentation.

Dynamic types and language not being compiled
If code is static than compiler can make lot of optimization to lay out objects in memory efficiently so that CPU run certain instructions more efficiently. Python being dynamic language reduces the scope of such optimzation.

Global interpretor lock(GIL)
Python handles race condtions and deadlocks using GIL. GIL makes sure that a Python process can run only one instruction at a time regardless of number of cores it is currently using.

Why use python?¶

It"s "batteried included" language with key features either builtin or accessible via external libraries.
Managed environments and shells available.
It"s faster to prototype ideas in python.

Profiling techniques¶

Simple approaches
linux time command
line_profiler
perf module

Simple approaches to timing - print and decorator¶

In [2]:

Copied!





# time module
import time

s_time = time.time()
time.sleep(4)
print(f"took {time.time() - s_time} secs")
# time module
import time

s_time = time.time()
time.sleep(4)
print(f"took {time.time() - s_time} secs")

took 4.0003697872161865 secs

In [3]:

Copied!





# log function"s execution time using decorator

import time
from functools import wraps

def timer_func(func):
    # This function shows the execution time of 
    # the function object passed
    @wraps(func)
    def wrap_func(*args, **kwargs):
        t1 = time.time()
        result = func(*args, **kwargs)
        t2 = time.time()
        print(f'Function {func.__name__!r} executed in {(t2-t1):.4f}s')
        return result
    return wrap_func
  

@timer_func
def lengthy_function():
    time.sleep(4)
    
    
lengthy_function()
# log function"s execution time using decorator

import time
from functools import wraps

def timer_func(func):
    # This function shows the execution time of 
    # the function object passed
    @wraps(func)
    def wrap_func(*args, **kwargs):
        t1 = time.time()
        result = func(*args, **kwargs)
        t2 = time.time()
        print(f'Function {func.__name__!r} executed in {(t2-t1):.4f}s')
        return result
    return wrap_func
  

@timer_func
def lengthy_function():
    time.sleep(4)
    
    
lengthy_function()

Function 'lengthy_function' executed in 4.0003s

Simple Timing Using the Unix `time` Command¶

In [4]:

Copied!

! time python3 -u profiling_examples/insertion_sort.py  10000
! time python3 -u profiling_examples/merge_sort.py  10000
! time python3 -u profiling_examples/insertion_sort.py  10000
! time python3 -u profiling_examples/merge_sort.py  10000

real	0m2.514s
user	0m2.507s
sys	0m0.006s

real	0m0.026s
user	0m0.024s
sys	0m0.002s

line_profiler¶

# installation: 
pip install line_profiler

# decorate functions to be profile using @profile

# how to run
kernprof -l -v  python_script.py

line_profiler provides following info corresponding to each line

Line: Line number

Hits: number of times line got executed

Per Hit: approx time taken by that line

% Time: percentage time spent executing the line

Line Contents: Line content

%time is most useful parameter in case looking for CPU bottlenecks

In [6]:

Copied!

! kernprof -l -v  profiling_examples/insertion_sort.py
! kernprof -l -v  profiling_examples/insertion_sort.py

[100, 99, 98, 97, 96, 95, 94, 93, 92, 91]
Wrote profile results to 'insertion_sort.py.lprof'
Timer unit: 1e-06 s

Total time: 0.00398147 s
File: profiling_examples/insertion_sort.py
Function: insertion_sort at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    13                                           @profile
    14                                           def insertion_sort(arr):
    15                                               # Traverse through 1 to len(arr)
    16        99         29.1      0.3      0.7      for i in range(1, len(arr)):
    17        98         27.9      0.3      0.7          key = arr[i]
    18                                                   # Move elements of arr[0..i-1], that are
    19                                                   # greater than key, to one position ahead
    20                                                   # of their current position
    21        98         26.4      0.3      0.7          j = i - 1
    22      4949       1324.4      0.3     33.3          while j >= 0 and key < arr[j]:
    23      4851       1328.8      0.3     33.4              arr[j + 1] = arr[j]
    24      4851       1217.1      0.3     30.6              j -= 1
    25        98         27.7      0.3      0.7          arr[j + 1] = key

In [7]:

Copied!

! kernprof -l -v  profiling_examples/merge_sort.py
! kernprof -l -v  profiling_examples/merge_sort.py

Wrote profile results to 'merge_sort.py.lprof'
Timer unit: 1e-06 s

Total time: 0.00140043 s
File: profiling_examples/merge_sort.py
Function: merge_sort at line 17

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    17                                           @profile
    18                                           def merge_sort(arr):
    19       197         94.5      0.5      6.7      if len(arr) > 1:
    20                                           
    21                                                   # Finding the mid of the array
    22        98         31.2      0.3      2.2          mid = len(arr) // 2
    23                                           
    24                                                   # Dividing the array elements
    25        98         36.5      0.4      2.6          L = arr[:mid]
    26                                           
    27                                                   # into 2 halves
    28        98         30.1      0.3      2.2          R = arr[mid:]
    29                                           
    30                                                   # Sorting the first half
    31        98        124.6      1.3      8.9          merge_sort(L)
    32                                           
    33                                                   # Sorting the second half
    34        98        121.5      1.2      8.7          merge_sort(R)
    35                                           
    36        98         25.9      0.3      1.8          i = j = k = 0
    37                                           
    38                                                   # Copy data to temp arrays L[] and R[]
    39       450        134.2      0.3      9.6          while i < len(L) and j < len(R):
    40       352        101.8      0.3      7.3              if L[i] <= R[j]:
    41                                                           arr[k] = L[i]
    42                                                           i += 1
    43                                                       else:
    44       352         97.4      0.3      7.0                  arr[k] = R[j]
    45       352         92.7      0.3      6.6                  j += 1
    46       352         92.3      0.3      6.6              k += 1
    47                                           
    48                                                   # Checking if any element was left
    49       410        113.4      0.3      8.1          while i < len(L):
    50       312         86.0      0.3      6.1              arr[k] = L[i]
    51       312         85.0      0.3      6.1              i += 1
    52       312         78.3      0.3      5.6              k += 1
    53                                           
    54        98         55.1      0.6      3.9          while j < len(R):
    55                                                       arr[k] = R[j]
    56                                                       j += 1
    57                                                       k += 1

perf module¶

Probing how well memory is being moved to the CPU can be quite hard; however, in Linux the perf tool can be used to get amazing amounts of insight into how the CPU is dealing with the program being run.

# installation:
sudo apt update
sudo apt install linux-tools-common

# how to run?
sudo perf stat -e cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branches,branch-misses,task-clock,faults,minor-faults,cs,migrations -r 3 python3 high_performance_python_examples/insertion_sort.py 

# cycles:  tells us how many clock cycles our task took.
#instructions:  tells us how many instructions our code is issuing to the CPU.
#cache-references: data attempted to be referenced via cache
#cache-misses: data references not found in cache and was fetched from RAM
#branches: a time in the code where the execution flow changes. eg: if..else 
#branch-misses: cpu tries to predict the flow and preload the instructions. when prediction is incorrect.
#faults: memory allocations

Strategies for profiling code¶

Disable TurboBoost in the BIOS.
Disable the operating system’s ability to override the SpeedStep (you will find this in your BIOS if you’re allowed to control it).
Only use mains power (never battery power).
Disable background tools like backups and Dropbox while running experiments.
Run the experiments many times to obtain a stable measurement.
Possibly drop to run level 1 (Unix) so that no other tasks are running.
Reboot and rerun the experiments to double-confirm the results.

Data structures and tradeoffs and quirks¶

Dynamic lists and caching
Resizing and memory allocation
Difference between python lists and Numpy arrays
Performance analysis of dictionaries

Lists¶

Python lists are dynamic in nature.

Pros:

User dont need to know list size before hand.
User don"t have to handle resizing

Cons:

Resizing(in background) makes appending to array slower.
Due to resizing lists usually take more space than the elements it holds.

In [8]:

Copied!

%timeit [i*i for i in range(10000)]
%timeit [i*i for i in range(10000)]

258 μs ± 7.41 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [9]:

Copied!

%%timeit l = []
for i in range(10000):
    l.append(i*i)
    
# takes more time because of list resizing
%%timeit l = []
for i in range(10000):
    l.append(i*i)
    
# takes more time because of list resizing

386 μs ± 885 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [10]:

Copied!

%memit [i*i for i in range(1000000)]
%memit [i*i for i in range(1000000)]

peak memory: 85.82 MiB, increment: 0.00 MiB

In [11]:

Copied!

%%memit l = []
for i in range(1000000):
    l.append(i*i)
    
# takes more time because of list resizing
%%memit l = []
for i in range(1000000):
    l.append(i*i)
    
# takes more time because of list resizing

peak memory: 114.71 MiB, increment: 29.62 MiB

Key Factors¶

Memory allocations:
Memory allocations are not cheap. Whenver you assign a value to variable interpreter needs to talk to Operating system to allocate new space and then iterate over that space to access it. Try to allocate space fewer times and then reuse that space throughout the program.

Memory fragmentation:
Python lists doesn"t natively support vectorization because lists doesnt store the actual data but pointers to it.

Cache misses:
Since data is fragmented across different memory locations, you must move each piece over instead of a block over, hence missing out on benifits of caching

Lists vs numpy arrrays¶

In [12]:

Copied!

from array import array
import numpy as np

def norm_square_list(vector):
    norm = 0
    for v in vector:
        norm += v*v
    return norm

def norm_square_list_comprehension(vector):
    return sum([v*v for v in vector])

def norm_square_numpy(vector):
    return np.sum(vector * vector)

def norm_square_numpy_dot(vector):
    return np.dot(vector, vector)
from array import array
import numpy as np

def norm_square_list(vector):
    norm = 0
    for v in vector:
        norm += v*v
    return norm

def norm_square_list_comprehension(vector):
    return sum([v*v for v in vector])

def norm_square_numpy(vector):
    return np.sum(vector * vector)

def norm_square_numpy_dot(vector):
    return np.dot(vector, vector)

In [13]:

Copied!

%%timeit
vector = list(range(1000000))
norm_square_list(vector)
%%timeit
vector = list(range(1000000))
norm_square_list(vector)

56.2 ms ± 135 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]:

Copied!

%%timeit
vector = list(range(1000000))
norm_square_list_comprehension(vector)
%%timeit
vector = list(range(1000000))
norm_square_list_comprehension(vector)

61.1 ms ± 338 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [15]:

Copied!





%%timeit
vector_numpy = np.arange(1000000)
norm_square_numpy(vector_numpy)
# numpy has specialized code in C that takes benifits of any vectorization that CPU has enabled. 
#Also numpy arrays are represented sequentially in memory as low level data types
%%timeit
vector_numpy = np.arange(1000000)
norm_square_numpy(vector_numpy)
# numpy has specialized code in C that takes benifits of any vectorization that CPU has enabled. 
#Also numpy arrays are represented sequentially in memory as low level data types

907 μs ± 37.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [15]:

Copied!





%%timeit
vector_numpy = np.arange(1000000)
norm_square_numpy_dot(vector_numpy)
# uses specialized code that doesn"t need to store intermediate output of (vector * vector)
%%timeit
vector_numpy = np.arange(1000000)
norm_square_numpy_dot(vector_numpy)
# uses specialized code that doesn"t need to store intermediate output of (vector * vector)

734 μs ± 5.57 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Dictionaries and namespaces¶

Python’s namespace management, which heavily uses dictionaries to do its lookups. Whenever a variable, function, or module is invoked in Python, there is a hierarchy that determines where it looks for these objects. First, Python looks inside of the locals() array, which has entries for all local variables. Python works hard to make local variable lookups fast, and this is the only part of the chain that doesn’t require a dictionary lookup. If it doesn’t exist there, then the globals() dictionary is searched. Finally, if the object isn’t found there, the __builtin__ object is searched.

In [16]:

Copied!





import math
from math import sin
import dis

def test1(x):
    """
    resolve `math` from globals()
    resolve `sin` from math module
    local lookup for x
    
    """
    return math.sin(x)


def test2(x):
    """
    resolve sin from globals()
    resolve x from locals()
    """
    return sin(x)


def test3(x, sin=math.sin):
    """
    resolve sin from locals()
    resolve x from locals()
    """
    return sin(x)
import math
from math import sin
import dis

def test1(x):
    """
    resolve `math` from globals()
    resolve `sin` from math module
    local lookup for x
    
    """
    return math.sin(x)


def test2(x):
    """
    resolve sin from globals()
    resolve x from locals()
    """
    return sin(x)


def test3(x, sin=math.sin):
    """
    resolve sin from locals()
    resolve x from locals()
    """
    return sin(x)

In [17]:

Copied!

%timeit test1(123456) 
#2 dictionary lookup, 1 local lookup
%timeit test1(123456) 
#2 dictionary lookup, 1 local lookup

58.5 ns ± 0.157 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [18]:

Copied!

%timeit test2(123456)
#1 dictionary lookup, 1 local lookup
%timeit test2(123456)
#1 dictionary lookup, 1 local lookup

65 ns ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [19]:

Copied!

%timeit test3(123456)
# 2 local lookups
%timeit test3(123456)
# 2 local lookups

55.9 ns ± 0.107 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [20]:

Copied!

dis.dis(test1)
dis.dis(test1)

  5           0 RESUME                   0

 12           2 LOAD_GLOBAL              0 (math)
             12 LOAD_ATTR                3 (NULL|self + sin)
             32 LOAD_FAST                0 (x)
             34 CALL                     1
             42 RETURN_VALUE

In [21]:

Copied!

dis.dis(test2)
dis.dis(test2)

 15           0 RESUME                   0

 20           2 LOAD_GLOBAL              1 (NULL + sin)
             12 LOAD_FAST                0 (x)
             14 CALL                     1
             22 RETURN_VALUE

In [22]:

Copied!

dis.dis(test3)
dis.dis(test3)

 23           0 RESUME                   0

 28           2 PUSH_NULL
              4 LOAD_FAST                1 (sin)
              6 LOAD_FAST                0 (x)
              8 CALL                     1
             16 RETURN_VALUE

Further exploration¶

How can we write modules for python using C/fortran/rust
Multiprocessing
How can we write code that runs across a cluster and use more machines.
How can we make programs use less RAM