Original Photo by nappy from Pexels

Welcome to my first blog. Recently I started to develop my package, and my priorities are maximizing speed. I found many things in Sklearn that works quite slow. This is not a blog about tutorial using Sklearn, NumPy or Numba but this blog serves purpose for speed comparison across several libraries.
tldr, just skip to the conclusion part

Here’s my machine spec :

  • CPU : Ryzen 3500U
  • RAM : 6 GB
  • OS : Windows 10

My software :

  • Python : 3.8.5
  • numpy : 1.18.5
  • scikit-learn : 0.23.2
  • numba : 0.50.1

For evaluation, i use magic function %timeit and %memit to measure runtime speed and the memory taken by the process. %timeit work out of the box, but for using memit, we need to install memory_profiler and load it first.

# Run this inside jupyterlab/notebook cell
!pip install memory_profiler
%load_ext memory_profiler

1. Binary Classification Evaluation

The first section is evaluating those packages in evaluating binary classification results. Here’s the dummy data i generated by uniform distribution with length 1000000

import numpy as np
from numpy.random import uniform
size = 500000
n_positive = int(size*0.8) 
n_negative = int(size*1.2)
actual = np.array([0]*n_negative + [1]*n_positive)
pred = np.append(uniform(0.2,0.6,n_negative ),uniform(0.4,0.8,size))

1.1 Logloss calculation

from sklearn.metrics import log_loss as logloss_sklearn
def logloss_numpy(actual,pred):
     return -1*np.mean(actual*np.log(pred) + (1-actual)*np.log(1-pred))
from numba import jit
def logloss_numba(actual,pred):
     logloss = 0
     for i in range(actual.shape[0]):  
         logloss += actual[i]*np.log(pred[i]) + (1-actual[i])*np.log(1-pred[i])
     return -1*logloss/actual.shape[0]
%timeit logloss_sklearn(actual,pred)
%timeit logloss_numpy(actual,pred)
%timeit logloss_numba(actual,pred)
%memit logloss_sklearn(actual,pred)
%memit logloss_numpy(actual,pred)
%memit logloss_numba(actual,pred)
Method Time Memory Usage
Sklearn 227 ms ± 7.52 ms peak : 260.78 MiB, increment: 41.03 MiB
Numpy 36.8 ms ± 1.25 ms peak : 244.09 MiB, increment: 23.61 MiB
Numba 16.9 ms ± 1.32 ms peak : 220.52 MiB, increment: 0.01 MiB

1.2 Confusion Matrix

The following code cited from this stackoverflow

from sklearn.metrics import confusion_matrix as CM_sklearn
def CM_numba(true, pred):
    #by cgnorthcutt
    K = len(np.unique(true))
    result = np.zeros((K, K))
    for i in range(len(true)):
        result[true[i]][pred[i]] += 1
    return result
def CM_numpy(true,pred):
    #by rytido
    classes = len(np.unique(true))
    return np.bincount(true * classes + pred).reshape((classes, classes))
%timeit CM_sklearn(actual,pred_class)
%timeit CM_numpy(actual,pred_class)
%timeit CM_numba(actual,pred_class)
%memit CM_sklearn(actual,pred_class)
%memit CM_numpy(actual,pred_class)
%memit CM_numba(actual,pred_class)
Method Time Memory Usage
Sklearn 888 ms ± 27.2 ms peak : 248.26 MiB, increment: 18.79 MiB
Numpy 27.5 ms ± 1.61 ms peak : 240.56 MiB, increment: 12.42 MiB
Numba 34.3 ms ± 1.58 ms peak : 232.95 MiB, increment: 3.83 MiB

1.3 ROC-AUC score

The code cited from https://www.kaggle.com/c/microsoft-malware-prediction/discussion/76013

from sklearn.metrics import roc_auc_score as rocauc_sklearn
def rocauc_numpy(y_true, y_prob):
    #This function basically same with rocauc_numba
    #I want to see if numpy natural vectorization is faster than numba
    y_true = np.asarray(y_true, dtype=np.float64)
    y_true = y_true[np.argsort(y_prob)]
    n = len(y_true)
    nfalse = (1-y_true).cumsum()
    auc = (y_true*nfalse).sum()
    last_nfalse = nfalse[-1]
    auc /= (last_nfalse* (n - last_nfalse))
    return auc
from numba import jit
def rocauc_numba(y_true, y_prob):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_prob)]
    nfalse = 0
    auc = 0
    n = len(y_true)
    for i in range(n):
        y_i = y_true[i]
        nfalse += (1 - y_i)
        auc += y_i * nfalse
    auc /= (nfalse * (n - nfalse))
    return auc
%timeit rocauc_sklearn(actual,pred)
%timeit rocauc_numpy(actual, pred)
%timeit rocauc_numba(actual, pred)
%memit rocauc_sklearn(actual,pred)
%memit rocauc_numpy(actual, pred)
%memit rocauc_numba(actual, pred)
Method Time Memory Usage
Sklearn 752 ms ± 33.4 ms peak : 298.05 MiB, increment: 46.12 MiB
Numpy 282 ms ± 25.4 ms peak : 258.04 MiB, increment: 15.32 MiB
Numba 358 ms ± 42.7 ms peak : 250.51 MiB, increment: 7.72 MiB

2.Regression Evaluation

The second section is evaluating those libraries on evaluation regression metric.

2.1 Root Mean Squared Error

from sklearn.metrics import mean_squared_error as RMSE_sklearn
def RMSE_numba(actual,pred):
    Squared_Error = 0
    for i in range(actual.shape[0]):  
        Squared_Error += (actual[i]-pred[i])**2
    return (Squared_Error/actual.shape[0])**0.5     
def RMSE_numpy(actual,pred):
    return np.sqrt(np.mean((actual-pred)**2))
%timeit RMSE_sklearn(actual,pred,squared=False)
%timeit RMSE_numpy(actual,pred)
%timeit RMSE_numba(actual,pred)
%memit RMSE_sklearn(actual,pred,squared=False)
%memit RMSE_numpy(actual,pred)
%memit RMSE_numba(actual,pred)
Method Time Memory Usage
Sklearn 10.8 ms ± 533 µs peak : 488.31 MiB, increment: 15.27 MiB
Numpy 8.9 ms ± 491 µs peak : 488.31 MiB, increment: 15.27 MiB
Numba 1.06 ms ± 35.7 µs peak : 250.05 MiB, increment: 0.00 MiB

2.2 R-Squared

from sklearn.metrics import r2_score as R2_sklearn
def R2_numpy(actual,pred):
    avg = actual.mean()
    SSE = ((pred-actual)**2).sum()
    SST = ((actual-avg)**2).sum()
    return 1-SSE/SST
def R2_numba(actual,pred):
    avg = 0
    len_array = actual.shape[0]
    for i in range(len_array):
        avg += actual[i]
    avg /= len_array
    SSE = 0
    SST = 0
    for i in range(len_array) :
        SSE += (pred[i]-actual[i])**2
        SST += (actual[i]-avg)**2
    return 1-SSE/SST
%timeit R2_sklearn(actual,pred)
%timeit R2_numpy(actual,pred)
%timeit R2_numba(actual,pred)
%memit R2_sklearn(actual,pred)
%memit R2_numpy(actual,pred)
%memit R2_numba(actual,pred)

Method Time Memory Usage Sklearn 27.7 ms ± 1.59 ms peak : 480.68 MiB, increment: 7.63 MiB Numpy 17.7 ms ± 803 µs peak : 488.31 MiB, increment: 15.27 MiB Numba 1.42 ms ± 46.8 µs peak : 473.05 MiB, increment: 0.00 MiB

Method Time Memory Usage
Sklearn 27.7 ms ± 1.59 ms peak : 480.68 MiB, increment: 7.63 MiB
Numpy 17.7 ms ± 803 µs peak : 488.31 MiB, increment: 15.27 MiB
Numba 1.42 ms ± 46.8 µs peak : 473.05 MiB, increment: 0.00 MiB

3. Linear Regression

The third section is evaluating those packages on creating a multiple linear regression. Here’s the dummy data I generated by a normal distribution.

X = np.random.normal(0,1,(100000,200))
Y = np.random.normal(0,1,(100000))
from sklearn.linear_model import LinearRegression as linreg_sklearn
#Numpy version
def least_square_numpy(X,y):
    return np.linalg.multi_dot([np.linalg.inv(X.T.dot(X)),X.T,y])
class linreg_numpy :
    def __init__(self):
        self.beta = 0
    def fit(self,X,y):
        X_with1 = np.append(np.ones((len(X),1)),X,axis=1)
        self.beta = least_square_numpy(X_with1,y)
        return self
    def predict(self,X):
        X_with1 = np.append(np.ones((len(X),1)),X,axis=1)
        return X_with1@self.beta
#Numba version
def inv_nla_jit(A):
    return np.linalg.inv(A)
def nb_dot(A, B):
    return np.dot(A,B)
def least_square_numba(X,y):
    inside = nb_dot(X.T,X)
    outside = nb_dot(X.T,y)
    return nb_dot(inv_nla_jit(inside),outside)
class linreg_numba :
    def __init__(self):
        self.beta = 0
    def fit(self,X,y):
        X_with1 = np.append(np.ones((len(X),1)),X,axis=1)
        self.beta = least_square_numba(X_with1,y)
        return self
    def predict(self,X):
        X_with1 = np.append(np.ones((len(X),1)),X,axis=1)
        return X_with1@self.beta
%timeit linreg_sklearn().fit(X,Y.reshape(-1,1))
%timeit linreg_numpy().fit(X,Y.reshape(-1,1))
%timeit linreg_numba().fit(X,Y.reshape(-1,1))
%memit linreg_sklearn().fit(X,Y.reshape(-1,1))
%memit linreg_numpy().fit(X,Y.reshape(-1,1))
%memit linreg_numba().fit(X,Y.reshape(-1,1))
Method Time Memory Usage
Sklearn 999 ms ± 28.2 ms 911.35 MiB, increment: 305.18 MiB
Numpy 330 ms ± 20.2 ms 760.29 MiB, increment: 153.35 MiB
Numba 381 ms ± 10.1 ms peak : 762.25 MiB, increment: 155.32 MiB


From these 6 experiments, I conclude that :

  1. Sklearn is slower than NumPy and Numba. Does that mean we need to ditch Sklearn? Obviously not, since the Sklearn function offers much functionality that I not implement. But if you sure that you don’t need all of those features, then just use Numpy/Numba implementation.
  2. Numba is faster than Numpy for heavy but simple calculations such as addition, subtraction, multiplication, division, etc. I also notice that when using less NumPy code inside the Numba function, the memory it took is drastically lower.
  3. Numpy and Numba have a similar speed for a more complex algorithm. In this kind of situation, I will pick Numpy because it needs less code to write.

If you have any suggestion or question, feel free to reach me out on Linkenid!