Practical session 3


Remember that there is no code faster than no code. - Taligent's Guide to Designing Programs


Development Tools for Scientific Computing 2024/2025

Pasquale Claudio Africa, Dario Coscia
20 Feb 2025

Part 1: How to make kNN faster?

In the previous practical session (number 2), we implemented a basic k-Nearest Neighbors (kNN) algorithm. In today's session, we will focus on optimizing the performance of the kNN implementation—an essential step when dealing with large datasets.
We will look at one specific library called Numba, but other options are available FAISS, Cython or PyPy.

The objective for today is to implement a fast kNN and utilize a profiler to identify bottlenecks in the code that can be optimized further. By the end of the session, you will have a more efficient version of the kNN algorithm, suitable for handling large-scale data.

Notes on Numba

Numba is a compiler for Python array and numerical functions that gives
you the power to speed up your applications with high performance
functions written directly in Python.

Numba generates optimized machine code from pure Python code using
the LLVM compiler infrastructure. With a few simple
annotations, array-oriented and math-heavy Python code can be
just-in-time optimized to performance similar as C, C++ and Fortran, without
having to switch languages or Python interpreters.

Numba's main features are:

  • on-the-fly code generation jit (at import time or runtime, at the
    user's preference)
  • integration with the Python scientific software stack (thanks to Numpy)

Here is how a Numba-optimized function, taking a Numpy array as argument,
might look like:

# JIT compilation.
@numba.jit(...)
def sum2d_jit(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result
# AOT compilation.
cc = CC('module')
@cc.export('sum2d_aot', 'f8(f8[:])')
def sum2d_aot(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

if __name__ == "__main__":
    cc.compile()  # <- we need to compile the code!

JIT or AOT ?

There are two ways to compile the code:

  1. Just In Time compilation: Compilation of a function at execution time
  2. Ahead Of Time compilation: Compilation of a function in a separate step before running the program code, producing an on-disk binary object which can be distributed independently. This is the traditional kind of compilation known
    in languages such as C, C++ or Fortran.

While Numba's main use case is Just-in-Time compilation (easier), it also
provides a facility for Ahead-of-Time compilation (AOT).

More on AOT

AOT compilation produces a compiled extension module which does not depend
on Numba: you can distribute the module on machines which do not have
Numba installed (but Numpy is required). BUT... You have to specify function signatures explicitly; each exported function can have only one signature; and AOT compilation produces generic code for your CPU's architectural family
(for example "x86-64"), while JIT compilation produces code optimized for your particular CPU model. Today we will use AOT compilation.

Part 2: Profiling the code

We will use line-profiler to profile the code.
Go into devtools_scicomp_project_2025 create starting from knn_classifier branch, a new one titled optimized_knn_classifier. This is the branch we will work on today.

  1. Install Numba and Line Profiler:
  • Activate the conda environment devtools_scicomp and install numba and line_profiler. Add it to the requirements.txt.
  1. Modify and profile old code
  • Add the @profile decorator into distance and majority_vote functions in src/pyclassify/utils.py and all
    methods (except for __init__) inside the src/pyclassify/classify.py. Note that the profilation does not happen if LINE_PROFILE flag is not set to 1.
  • Create a directory called logs inside the root directory, this is where we are going to store the profiler files. Profile the code by running:
    python -m kernprof -l -o logs/profile.dat scripts/run.py --config=experiments/config
    
  • Inspect the results and save them into logs/slow_knn_classifier.txt. Where is the bottleneck?

Part 3: Profiling the Numpy code

  • Modify the code to work also with numpy arrays.
    • The kNN class __init__ must take an extra argument called backend, this argument must either be plain (default) or numpy, otherwise a ValueError is raised. Change the tests accordingly.
    • In src/pyclassify/utils.py create a function called distance_numpy which is the numpy implementation of the distance function.
    • Create an attribute distance in kNN which stores the function used to compute the distance between points. This attribute must be set dynamically depending on the backend (plain=>distance, numpy=>distance_numpy). Change the __call__ accordingly to handle data types. NOTE you don't need to change the whole code, you just need two add an if statement in __call__, while leaving other methods unchanged.
  • Change the scripts/run.py file in order to pass backend from the config. Create in experiments a config_numpy.yaml file, where backend is set to numpy.

  • Profile and inspect the results, save them into logs/numpy_knn_classifier.txt. Do we get similar performances? How much speedup do we obtain by using distance_numpy function?

Part 3: Optimize code for Big Datasets

We will now use a way larger dataset called Spambase, in order to classify emails being spam or not. See more here.

  • In shell directory create a file called submit_spam.sh and write a bash code to download the Spambase dataset. NOTE the code must only download the dataset and position it in the data/ folder, all the additional files downloaded must be removed by the bash script.
  • Create in experiments a config_spambase.yaml file where the dataset entry points to the Spambase dataset and the backend is set to numpy. Also modify read_file function in src/pyclassify/utils.py to handle binary labels (0, 1 ). Finally modify scripts/run.py to shuffle the data if not done in the previous lecture.
  • Profile and inspect the results, save them into logs/spam_numpy_knn_classifier.txt. Which part of the code needs optimization?
  • In src/pyclassify/utils.py write a highly optimized distance_numba function using AOT compilation. HINT: I suggest to create a python file (later you should erase it) just for the functions to optimize and do some tests of speedup for small and big size arrays. Note that since we are doing aot compilation there is the need to compile the code. To optimize look at the following flags.
  • Modify the kNN.__init__ method to allow numba as backend.
  • Once the code is optimized, profile and inspect the results, save them into logs/spam_numba_knn_classifier.txt. How much was the speedup? Why?
  • Perform a study on how fast is your implementation of distance compared to plain numpy for varying the dimensions of and . Plot the figure and save it (scalability.png) inside the logs/ directory.

Solutions

The repository with the solution is reported here: GitHub repo