Trainining Omikuji from scipy.sparse.csr_matrix #55

CarloNicolini · 2024-02-20T18:20:15Z

I've adapted an alternative method to train the omikuji model by bypassing disk write for the Python wrapper.

The main work is based on the creation of a new methods in the lib.rs file called load_omikuji_data_set_from_features_labels.

It is designed to take in the three main numpy arrays defining the underlying structure of the scipy.sparse.csr_matrix.
In other words I map the scipy.sparse.csr_matrix.{indices, indptr, data} arrays into Rust vectors, and then I recreate a features matrix together with the labels set, in a way similar to the train_on_data method.

juhoinkinen · 2024-02-22T09:29:11Z

Just a fellow Omikuji fan dropping by to ask how big speed up you think can be achieved with this? Could you give some measured numbers?

CarloNicolini · 2024-02-22T15:10:33Z

@juhoinkinen the speed-up depends if you have to fit many times over a large-dataset, like in the case of a GridSearchCV. In this case you don't incur the I/O costs.
I am gonna measure the real-difference between the two cases and post it here.

CarloNicolini · 2024-02-23T20:13:49Z

@juhoinkinen
Training on the EurLEX-4k train from the omikuji repository itself dropped from an average of 10.4 seconds to an average of 7.1 seconds. It's a 15k rows and 5k features on a Macbook Pro M1.
The advantage is when making a large grid search with cross validation and joblib with multiple parallel jobs, one can avoid too much I/O pressure.

I am new to Rust but a lot of C++ background. I've managed to make the stuff work only with float32 data types for features and uint32 for labels. It could be nice to make it a bit more generic though.

tomtung · 2024-02-24T03:11:00Z

Thanks for the contribution! I've been traveling and don't have my computer with me. I'll try take a look in early March. In the meantime, it would be great of you can make all the tests pass :D

carlo added 2 commits February 20, 2024 13:13

training by passing sparse features and labels

92748d7

Corrected python wrapper

2c79255

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainining Omikuji from scipy.sparse.csr_matrix #55

Trainining Omikuji from scipy.sparse.csr_matrix #55

CarloNicolini commented Feb 20, 2024

juhoinkinen commented Feb 22, 2024

CarloNicolini commented Feb 22, 2024

CarloNicolini commented Feb 23, 2024

tomtung commented Feb 24, 2024

Trainining Omikuji from scipy.sparse.csr_matrix #55

Are you sure you want to change the base?

Trainining Omikuji from scipy.sparse.csr_matrix #55

Conversation

CarloNicolini commented Feb 20, 2024

juhoinkinen commented Feb 22, 2024

CarloNicolini commented Feb 22, 2024

CarloNicolini commented Feb 23, 2024

tomtung commented Feb 24, 2024