Part 3: Optimize code for Big Datasets

We will now use a way larger dataset called Spambase, in order to classify emails being spam or not. See more here.

In shell directory create a file called submit_spam.sh and write a bash code to download the Spambase dataset. NOTE the code must only download the dataset and position it in the data/ folder, all the additional files downloaded must be removed by the bash script.
Create in experiments a config_spambase.yaml file where the dataset entry points to the Spambase dataset and the backend is set to numpy. Also modify read_file function in src/pyclassify/utils.py to handle binary labels (0, 1 ). Finally modify scripts/run.py to shuffle the data if not done in the previous lecture.
Profile and inspect the results, save them into logs/spam_numpy_knn_classifier.txt. Which part of the code needs optimization?

Practical session 3