Main Document (PDF EBOOK) with 247 pages (27-12-2022) for direct download book_pytorch_scikit_learn_numpy.pdf
Available Files in this repository
Datasets, main file .py and notebooks .pynb at ./notebooks
Main Features
-
Theory for the linear models and implementation with pytorch and scikit-learn
-
Practice of deep learning with pytorch for feedforward neural networks
-
Many examples and exercices to practice and understand further the contents
-
Very large datasets 450000 and 11000000 on a home computer with a few gigabytes
-
Step by step for theory & code (require only minimum knowledge in python and maths)
-
Learn the maths basics without compromise before consolidate towards advanced models
-
Generic python functions, allow to train and alter deep models for tabular data in a blink
Abstract
This book is an introduction to computational statistics for the generalized linear models (glm) and to machine learning with the python language. Extensions of the glm with nonlinearities come from hidden layer(s) within a neural network for linear and nonlinear regression or classification. This allows to present side by side classical statistics and current deep learning. The loglikelihoods and the corresponding loss functions are explained. The gradient and hessian matrix are discussed and implemented for these linear and nonlinear models. Several methods are implemented from scratch with numpy for prediction (linear, logistic, poisson regressions) and for reduction (principal component analysis, random projection). The gradient descent, newton-raphson, natural gradient and l-fbgs algorithms are implemented. The datasets in stake are with 10 to 10^7 rows, and are tabular such that images or texts are vectorized. The data are stored in a compressed format (memmap or hdf5) and loaded by chunks for several case studies with pytorch or scikit-learn. Pytorch is presented for training with minibatches via a generic implementation for studying with computer programs. Scikit-learn is presented for processing large datasets via the partial fit, after the small examples. Sixty exercises are proposed at the end of the chapters with selected solutions to go beyond the contents.
Chapters
-
Introduction
Polynomial regression
Error on a train sample
Error on a test sample -
Linear models with numpy and scikit-learn (chapter02_book.ipynb)
Theory for linear regression
Theory for logistic regression
Loglikelihood and loss function
Analytical expression of the derivatives
implementation with numpy
Implementation with Scikit-Learn -
First-order training of linear models (chapter03_book.ipynb)
Algorithm with one datum and with one minibatch
Implementation of the algorithms with numpy
Implementation of the algorithms with pytorch -
Neural networks for (deep) glm (chapter04_book.ipynb)
Presentation of the different loss functions from pytorch
Generic implementation of the algorithms with pytorch
Example of nonlinear frontier with a small dataset -
Lasso selection for (deep) glm (chapter05_book.ipynb)
Penalization of the regression for sparse solution
Implementation with pytorch for a neural network
Selection of the hyperparameters (grid and bayesian) -
Hessian and covariance for (deep) glm (chapter06_book.ipynb)
Notion of variance of the parameters
Implementation with statsmodels for linear models
Implementation with pytorch for a neural network -
Second-order training of (deep) glm (chapter07_book.ipynb)
Expression of the update for 1st-order for poisson regression
Expression of the update for 2nd-order for poisson regression
Implementation of gradient descent for the poisson regression
Implementation of newton-raphson and natural gradient with numpy
Implementation of l-fbgs algorithm with pytorch for deep regressions
Notion of quality of the estimation for comparison -
Autoencoder compared to ipca and t-sne (chapter08_book.ipynb)
Introduction to the algebra for principal component analysis
Implementation step by step for principal component analysis
Implementation with scikit-Learn of pca and (non)linear autoencoders
Implementation of t-sne with python from two modules
Implementation of random projection for large datasets
Notion of quality of the visualization for comparison -
Solution to selected exercices (chapter09_book.ipynb)
Several solutions for large datasets with scikit-learn
Several solutions for neural networks with pytorch