Original Photo by nappy from Pexels
Welcome to my first blog. Recently I started to develop my package, and my priorities are maximizing speed. I found many things in Sklearn that works quite slow. This is not a blog about tutorial using Sklearn, NumPy or Numba but this blog serves purpose for speed comparison across several libraries.
tldr, just skip to the conclusion part
Here’s my machine spec :
- CPU : Ryzen 3500U
- RAM : 6 GB
- OS : Windows 10
My software :
- Python : 3.8.5
- numpy : 1.18.5
- scikit-learn : 0.23.2
- numba : 0.50.1
For evaluation, i use magic function %timeit and %memit to measure runtime speed and the memory taken by the process. %timeit work out of the box, but for using memit, we need to install memory_profiler and load it first.
# Run this inside jupyterlab/notebook cell
!pip install memory_profiler
%load_ext memory_profiler
1. Binary Classification Evaluation
The first section is evaluating those packages in evaluating binary classification results. Here’s the dummy data i generated by uniform distribution with length 1000000
import numpy as np
from numpy.random import uniform
size = 500000
n_positive = int(size*0.8)
n_negative = int(size*1.2)
actual = np.array([0]*n_negative + [1]*n_positive)
pred = np.append(uniform(0.2,0.6,n_negative ),uniform(0.4,0.8,size))
1.1 Logloss calculation
from sklearn.metrics import log_loss as logloss_sklearn
def logloss_numpy(actual,pred):
return -1*np.mean(actual*np.log(pred) + (1-actual)*np.log(1-pred))
from numba import jit
@jit(nopython=True)
def logloss_numba(actual,pred):
logloss = 0
for i in range(actual.shape[0]):
logloss += actual[i]*np.log(pred[i]) + (1-actual[i])*np.log(1-pred[i])
return -1*logloss/actual.shape[0]
%timeit logloss_sklearn(actual,pred)
%timeit logloss_numpy(actual,pred)
%timeit logloss_numba(actual,pred)
%memit logloss_sklearn(actual,pred)
%memit logloss_numpy(actual,pred)
%memit logloss_numba(actual,pred)
Method | Time | Memory Usage |
---|---|---|
Sklearn | 227 ms ± 7.52 ms | peak : 260.78 MiB, increment: 41.03 MiB |
Numpy | 36.8 ms ± 1.25 ms | peak : 244.09 MiB, increment: 23.61 MiB |
Numba | 16.9 ms ± 1.32 ms | peak : 220.52 MiB, increment: 0.01 MiB |
1.2 Confusion Matrix
The following code cited from this stackoverflow
from sklearn.metrics import confusion_matrix as CM_sklearn
@jit(nopython=True)
def CM_numba(true, pred):
#by cgnorthcutt
K = len(np.unique(true))
result = np.zeros((K, K))
for i in range(len(true)):
result[true[i]][pred[i]] += 1
return result
def CM_numpy(true,pred):
#by rytido
classes = len(np.unique(true))
return np.bincount(true * classes + pred).reshape((classes, classes))
%timeit CM_sklearn(actual,pred_class)
%timeit CM_numpy(actual,pred_class)
%timeit CM_numba(actual,pred_class)
%memit CM_sklearn(actual,pred_class)
%memit CM_numpy(actual,pred_class)
%memit CM_numba(actual,pred_class)
Method | Time | Memory Usage |
---|---|---|
Sklearn | 888 ms ± 27.2 ms | peak : 248.26 MiB, increment: 18.79 MiB |
Numpy | 27.5 ms ± 1.61 ms | peak : 240.56 MiB, increment: 12.42 MiB |
Numba | 34.3 ms ± 1.58 ms | peak : 232.95 MiB, increment: 3.83 MiB |
1.3 ROC-AUC score
The code cited from https://www.kaggle.com/c/microsoft-malware-prediction/discussion/76013
from sklearn.metrics import roc_auc_score as rocauc_sklearn
def rocauc_numpy(y_true, y_prob):
#This function basically same with rocauc_numba
#I want to see if numpy natural vectorization is faster than numba
y_true = np.asarray(y_true, dtype=np.float64)
y_true = y_true[np.argsort(y_prob)]
n = len(y_true)
nfalse = (1-y_true).cumsum()
auc = (y_true*nfalse).sum()
last_nfalse = nfalse[-1]
auc /= (last_nfalse* (n - last_nfalse))
return auc
from numba import jit
@jit(nopython=True)
def rocauc_numba(y_true, y_prob):
y_true = np.asarray(y_true)
y_true = y_true[np.argsort(y_prob)]
nfalse = 0
auc = 0
n = len(y_true)
for i in range(n):
y_i = y_true[i]
nfalse += (1 - y_i)
auc += y_i * nfalse
auc /= (nfalse * (n - nfalse))
return auc
%timeit rocauc_sklearn(actual,pred)
%timeit rocauc_numpy(actual, pred)
%timeit rocauc_numba(actual, pred)
%memit rocauc_sklearn(actual,pred)
%memit rocauc_numpy(actual, pred)
%memit rocauc_numba(actual, pred)
Method | Time | Memory Usage |
---|---|---|
Sklearn | 752 ms ± 33.4 ms | peak : 298.05 MiB, increment: 46.12 MiB |
Numpy | 282 ms ± 25.4 ms | peak : 258.04 MiB, increment: 15.32 MiB |
Numba | 358 ms ± 42.7 ms | peak : 250.51 MiB, increment: 7.72 MiB |
2.Regression Evaluation
The second section is evaluating those libraries on evaluation regression metric.
2.1 Root Mean Squared Error
from sklearn.metrics import mean_squared_error as RMSE_sklearn
@jit(nopython=True)
def RMSE_numba(actual,pred):
Squared_Error = 0
for i in range(actual.shape[0]):
Squared_Error += (actual[i]-pred[i])**2
return (Squared_Error/actual.shape[0])**0.5
def RMSE_numpy(actual,pred):
return np.sqrt(np.mean((actual-pred)**2))
%timeit RMSE_sklearn(actual,pred,squared=False)
%timeit RMSE_numpy(actual,pred)
%timeit RMSE_numba(actual,pred)
%memit RMSE_sklearn(actual,pred,squared=False)
%memit RMSE_numpy(actual,pred)
%memit RMSE_numba(actual,pred)
Method | Time | Memory Usage |
---|---|---|
Sklearn | 10.8 ms ± 533 µs | peak : 488.31 MiB, increment: 15.27 MiB |
Numpy | 8.9 ms ± 491 µs | peak : 488.31 MiB, increment: 15.27 MiB |
Numba | 1.06 ms ± 35.7 µs | peak : 250.05 MiB, increment: 0.00 MiB |
2.2 R-Squared
from sklearn.metrics import r2_score as R2_sklearn
def R2_numpy(actual,pred):
avg = actual.mean()
SSE = ((pred-actual)**2).sum()
SST = ((actual-avg)**2).sum()
return 1-SSE/SST
@jit
def R2_numba(actual,pred):
avg = 0
len_array = actual.shape[0]
for i in range(len_array):
avg += actual[i]
avg /= len_array
SSE = 0
SST = 0
for i in range(len_array) :
SSE += (pred[i]-actual[i])**2
SST += (actual[i]-avg)**2
return 1-SSE/SST
%timeit R2_sklearn(actual,pred)
%timeit R2_numpy(actual,pred)
%timeit R2_numba(actual,pred)
%memit R2_sklearn(actual,pred)
%memit R2_numpy(actual,pred)
%memit R2_numba(actual,pred)
Method Time Memory Usage Sklearn 27.7 ms ± 1.59 ms peak : 480.68 MiB, increment: 7.63 MiB Numpy 17.7 ms ± 803 µs peak : 488.31 MiB, increment: 15.27 MiB Numba 1.42 ms ± 46.8 µs peak : 473.05 MiB, increment: 0.00 MiB
Method | Time | Memory Usage |
---|---|---|
Sklearn | 27.7 ms ± 1.59 ms | peak : 480.68 MiB, increment: 7.63 MiB |
Numpy | 17.7 ms ± 803 µs | peak : 488.31 MiB, increment: 15.27 MiB |
Numba | 1.42 ms ± 46.8 µs | peak : 473.05 MiB, increment: 0.00 MiB |
3. Linear Regression
The third section is evaluating those packages on creating a multiple linear regression. Here’s the dummy data I generated by a normal distribution.
X = np.random.normal(0,1,(100000,200))
Y = np.random.normal(0,1,(100000))
from sklearn.linear_model import LinearRegression as linreg_sklearn
#Numpy version
def least_square_numpy(X,y):
return np.linalg.multi_dot([np.linalg.inv(X.T.dot(X)),X.T,y])
class linreg_numpy :
def __init__(self):
self.beta = 0
def fit(self,X,y):
X_with1 = np.append(np.ones((len(X),1)),X,axis=1)
self.beta = least_square_numpy(X_with1,y)
return self
def predict(self,X):
X_with1 = np.append(np.ones((len(X),1)),X,axis=1)
return X_with1@self.beta
#Numba version
@jit
def inv_nla_jit(A):
return np.linalg.inv(A)
@jit
def nb_dot(A, B):
return np.dot(A,B)
def least_square_numba(X,y):
inside = nb_dot(X.T,X)
outside = nb_dot(X.T,y)
return nb_dot(inv_nla_jit(inside),outside)
class linreg_numba :
def __init__(self):
self.beta = 0
def fit(self,X,y):
X_with1 = np.append(np.ones((len(X),1)),X,axis=1)
self.beta = least_square_numba(X_with1,y)
return self
def predict(self,X):
X_with1 = np.append(np.ones((len(X),1)),X,axis=1)
return X_with1@self.beta
%timeit linreg_sklearn().fit(X,Y.reshape(-1,1))
%timeit linreg_numpy().fit(X,Y.reshape(-1,1))
%timeit linreg_numba().fit(X,Y.reshape(-1,1))
%memit linreg_sklearn().fit(X,Y.reshape(-1,1))
%memit linreg_numpy().fit(X,Y.reshape(-1,1))
%memit linreg_numba().fit(X,Y.reshape(-1,1))
Method | Time | Memory Usage |
---|---|---|
Sklearn | 999 ms ± 28.2 ms | 911.35 MiB, increment: 305.18 MiB |
Numpy | 330 ms ± 20.2 ms | 760.29 MiB, increment: 153.35 MiB |
Numba | 381 ms ± 10.1 ms | peak : 762.25 MiB, increment: 155.32 MiB |
Conclusion
From these 6 experiments, I conclude that :
- Sklearn is slower than NumPy and Numba. Does that mean we need to ditch Sklearn? Obviously not, since the Sklearn function offers much functionality that I not implement. But if you sure that you don’t need all of those features, then just use Numpy/Numba implementation.
- Numba is faster than Numpy for heavy but simple calculations such as addition, subtraction, multiplication, division, etc. I also notice that when using less NumPy code inside the Numba function, the memory it took is drastically lower.
- Numpy and Numba have a similar speed for a more complex algorithm. In this kind of situation, I will pick Numpy because it needs less code to write.
If you have any suggestion or question, feel free to reach me out on Linkenid!