Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CON-3189 Fix the piscine structure in AI branch #2748

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 28 additions & 22 deletions subjects/ai/classification/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
# Classification
## Classification

### Overview

The goal of this day is to understand practical classification with Scikit Learn.

### Role play

Imagine you're a data scientist working for a cutting-edge medical research company. Your team has been tasked with developing a machine learning model to assist doctors in diagnosing breast cancer. You'll be using logistic regression to classify tumors as benign or malignant based on various features.

### Learning Objectives

Today we will learn a different approach in Machine Learning: the classification which is a large domain in the field of statistics and machine learning. Generally, it can be broken down in two areas:

- **Binary classification**, where we wish to group an outcome into one of two groups.
Expand Down Expand Up @@ -45,23 +53,19 @@ The **logloss** or **cross entropy** is the loss used for classification. Simila

_Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.

### **Resources**

### Logistic regression
### Resources

- https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102
- [Logistic regression](https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102)

### Logloss
- [Logloss](https://www.datacamp.com/tutorial/the-cross-entropy-loss-function-in-machine-learning)

- https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451

- https://medium.com/swlh/what-is-logistic-regression-62807de62efa
- [More on logistic regression](https://medium.com/swlh/what-is-logistic-regression-62807de62efa)

---

---

# Exercise 0: Environment and libraries
### Exercise 0: Environment and libraries

The goal of this exercise is to set up the Python work environment with the required libraries.

Expand All @@ -73,13 +77,13 @@ I recommend to use:
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required

1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
Oumaimafisaoui marked this conversation as resolved.
Show resolved Hide resolved

---

---

# Exercise 1: Logistic regression in Scikit-learn
### Exercise 1: Logistic regression in Scikit-learn

The goal of this exercise is to learn to use Scikit-learn to classify data.

Expand All @@ -98,7 +102,7 @@ y = [0,0,0,1,1,1,0]

---

# Exercise 2: Sigmoid
### Exercise 2: Sigmoid

The goal of this exercise is to learn to compute and plot the sigmoid function.

Expand All @@ -120,11 +124,11 @@ The plot should look like this:

---

# Exercise 3: Decision boundary
### Exercise 3: Decision boundary

The goal of this exercise is to learn to fit a logistic regression on simple examples and to understand how the algorithm separated the data from the different classes.

## 1 dimension
#### 1 dimension

First, we will start as usual with features data in 1 dimension. Use `make classification` from Scikit-learn to generate 100 data points:

Expand Down Expand Up @@ -191,7 +195,7 @@ def predict_probability(coefs, X):

[ex3q6]: ./w2_day2_ex3_q5.png "Scatter plot + Logistic regression + predictions"

## 2 dimensions
#### 2 dimensions

Now, let us repeat this process on 2-dimensional data. The goal is to focus on the decision boundary and to understand how the Logistic Regression create a line that separates the data. The code to plot the decision boundary is provided, however it is important to understand the way it works.

Expand Down Expand Up @@ -247,7 +251,7 @@ The plot should look like this:

---

# Exercise 4: Train test split
### Exercise 4: Train test split

The goal of this exercise is to learn to split a classification data set. The idea is the same as splitting a regression data set but there's one important detail specific to the classification: the proportion of each class in the train set and test set.

Expand All @@ -271,7 +275,7 @@ y[70:] = 1

---

# Exercise 5: Breast Cancer prediction
### Exercise 5: Breast Cancer prediction

The goal of this exercise is to use Logistic Regression
to predict breast cancer. It is always important to understand the data before training any Machine Learning algorithm. The data is described in **breast-cancer-wisconsin.names**. I suggest to add manually the column names in the DataFrame.
Expand All @@ -296,10 +300,10 @@ Preliminary:
- [Database](data/breast-cancer-wisconsin.data) and [database information](data/breast-cancer-wisconsin.names)

---
---

---

# Exercise 6: Multi-class (Optional)
### Exercise 6: Multi-class (Optional)

The goal of this exercise is to learn to train a classification algorithm on a multi-class labelled data.
Some algorithms as SVM or Logistic Regression do not natively support multi-class (more than 2 classes). There are some approaches that allow to use these algorithms on multi-class data.
Expand All @@ -310,7 +314,7 @@ Let's assume we work with 3 classes: A, B and C.

More details:

- https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/
- https://medium.com/@agrawalsam1997/multiclass-classification-onevsrest-and-onevsone-classification-strategy-2c293a91571a

Let's implement the One-vs-Rest approach from `LogisticRegression`.

Expand Down Expand Up @@ -354,6 +358,8 @@ def predict_one_vs_all(X, clf0, clf1, clf2 ):
return classes
```

- https://randerson112358.medium.com/python-logistic-regression-program-5e1b32f964db
Resources :

- https://www.kaggle.com/code/rahulrajpandey31/logistic-regression-from-scratch-iris-data-set

- https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a
8 changes: 4 additions & 4 deletions subjects/ai/classification/audit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

##### Run `python --version`

###### Does it print `Python 3.x`? x >= 8?
###### Does it print `Python 3.x`? x >= 9?

###### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error?

Expand All @@ -31,7 +31,6 @@ Score:
0.7142857142857143
```


---

---
Expand Down Expand Up @@ -73,9 +72,9 @@ Coefficient: [[1.18866075]]

###### For question 4, does `predict_probability` output the same probabilities as `predict_proba`? Note that the values have to match one of the class probabilities, not both. To do so, compare the output with: `clf.predict_proba(X)[:,1]`. The shape of the arrays is not important.

###### Does `predict_class` output the same classes as `cfl.predict(X)` for question 5? The shape of the arrays is not important.
###### Does `predict_class` output the same classes as `cfl.predict(X)` for question 5? The shape of the arrays is not important.

###### Does the plot for question 6 look like the plot below? As mentioned, it is not required to shift the class prediction to make the plot easier to understand.
###### Does the plot for question 6 look like the plot below? As mentioned, it is not required to shift the class prediction to make the plot easier to understand.

![alt text][ex3q6]

Expand Down Expand Up @@ -193,6 +192,7 @@ As said, for some reasons, the results may be slightly different from mine becau
---

#### Bonus

#### Exercise 6: Multi-class (Optional)

##### The exercise is validated if all questions of the exercise are validated
Expand Down
30 changes: 19 additions & 11 deletions subjects/ai/data-wrangling/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,21 @@
# Data wrangling
## Data wrangling

Data wrangling is one of the crucial tasks in data science and analysis which includes operations like:
### Overview

Data wrangling is one of the crucial tasks in data science and analysis

### Role Play

You are a newly hired data analyst at a major e-commerce company. Your first assignment is to clean and prepare various datasets for analysis. The company's data comes from multiple sources and in different formats. Your manager has tasked you with combining these datasets, dealing with missing or inconsistent data, and preparing summary reports. You'll need to use your data wrangling skills to transform raw data into a format suitable for analysis and visualization.

### Learning Objectives

- Data Sorting: To rearrange values in ascending or descending order.
- Data Filtration: To create a subset of available data.
- Data Reduction: To eliminate or replace unwanted values.
- Data Access: To read or write data files.
- Data Processing: To perform aggregation, statistical, and similar operations on specific values.
Ax explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling.
As explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling.

### Exercises of the day

Expand Down Expand Up @@ -45,7 +53,7 @@ I suggest to use the most recent one.

---

# Exercise 0: Environment and libraries
### Exercise 0: Environment and libraries

The goal of this exercise is to set up the Python work environment with the required libraries.

Expand All @@ -57,13 +65,13 @@ I recommend to use:
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required

1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` ,`tabulate` and `jupyter`.
1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy` ,`tabulate` and `jupyter`.
Oumaimafisaoui marked this conversation as resolved.
Show resolved Hide resolved

---

---

# Exercise 1: Concatenate
### Exercise 1: Concatenate

The goal of this exercise is to learn to concatenate DataFrames. The logic is the same for the Series.

Expand All @@ -82,7 +90,7 @@ df2 = pd.DataFrame([['c', 1], ['d', 2]],

---

# Exercise 2: Merge
### Exercise 2: Merge

The goal of this exercise is to learn to merge DataFrames
The logic of merging DataFrames in Pandas is quite similar as the one used in SQL.
Expand Down Expand Up @@ -132,7 +140,7 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])

---

# Exercise 3: Merge MultiIndex
### Exercise 3: Merge MultiIndex

The goal of this exercise is to learn to merge DataFrames with MultiIndex.
Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data.
Expand Down Expand Up @@ -171,7 +179,7 @@ Use the code below to generate the DataFrames. `market_data` contains fake marke

---

# Exercise 4: Groupby Apply
### Exercise 4: Groupby Apply

The goal of this exercise is to learn to group the data and apply a function on the groups.
The use case we will work on is computing
Expand Down Expand Up @@ -241,7 +249,7 @@ Here is what the function should output:

---

# Exercise 5: Groupby Agg
### Exercise 5: Groupby Agg

The goal of this exercise is to learn to compute different type of aggregations on the groups. This small DataFrame contains products and prices.

Expand Down Expand Up @@ -269,7 +277,7 @@ Note: The columns don't have to be MultiIndex

---

# Exercise 6: Unstack
### Exercise 6: Unstack

The goal of this exercise is to learn to unstack a MultiIndex
Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest, ...
Expand Down
4 changes: 2 additions & 2 deletions subjects/ai/data-wrangling/audit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

##### Run `python --version`.

###### Does it print `Python 3.x`? x >= 8
###### Does it print `Python 3.x`? x >= 9

###### Does `import jupyter`, `import numpy` and `import pandas` run without any error?

Expand Down Expand Up @@ -52,7 +52,7 @@
| 5 | 6 | nan | nan | O | P |
| 6 | 7 | nan | nan | Q | R |
| 7 | 8 | nan | nan | S | T |

Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.

---
Expand Down
30 changes: 20 additions & 10 deletions subjects/ai/keras-2/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,14 @@
# Keras 2
## Keras 2

### Overview

This exercise set focuses on advanced applications of Keras for building and training neural networks. You'll work on both regression and multi-class classification problems, using real-world datasets like the Auto MPG and Iris datasets.

### Role Play

You're a data scientist at a biotech company developing AI-powered systems for various applications. Your current project involves creating neural networks for both regression and multi-class classification tasks. You'll be working on predicting car fuel efficiency and classifying flower species, showcasing the versatility of neural networks in different domains.

### Learning Objectives

The goal of this day is to learn to use Keras to build Neural Networks and train them on small data sets. This helps to understand the specifics of networks for classification and regression.

Expand Down Expand Up @@ -28,15 +38,15 @@ The audit will provide the code and output because it is not straightforward to
_Version of Keras I used to do the exercises: 2.4.3_.
I suggest to use the most recent one.

### **Resources**
### Resources

- https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
- [Neural network](https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/)

---

---

# Exercise 0: Environment and libraries
### Exercise 0: Environment and libraries

The goal of this exercise is to set up the Python work environment with the required libraries.

Expand All @@ -48,13 +58,13 @@ I recommend to use:
- the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recent versions of the libraries required

1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter` and `keras`.
1. Create a virtual environment named with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter` and `keras`.

---

---

# Exercise 1: Regression - Optimize
### Exercise 1: Regression - Optimize

The goal of this exercise is to learn to set up the optimization for a regression neural network. There's no code to run in that exercise. In W2D2E3, we implemented a neural network designed for regression. We will be using this neural network:

Expand Down Expand Up @@ -88,7 +98,7 @@ https://keras.io/api/metrics/regression_metrics/

---

# Exercise 2: Regression example
### Exercise 2: Regression example

The goal of this exercise is to learn to train a neural network to perform a regression on a data set.
The data set is [Auto MPG Dataset](auto-mpg.csv) and the go is to build a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.
Expand All @@ -109,7 +119,7 @@ https://www.tensorflow.org/tutorials/keras/regression

---

# Exercise 3: Multi classification - Softmax
### Exercise 3: Multi classification - Softmax

The goal of this exercise is to learn to a neural network architecture for multi-class data. This is an important type of problem on which to practice with neural networks because the three class values require specialized handling. A multi-classification neural network uses as output layer a **softmax** layer. The **softmax** activation function is an extension of the sigmoid as it is designed to output the probabilities to belong to each class in a multi-class problem. This output layer has to contain as much neurons as classes in the multi-classification problem. This article explains in detail how it works. https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax

Expand All @@ -126,7 +136,7 @@ Let us assume we want to classify images and we know they contain either apples,

---

# Exercise 4: Multi classification - Optimize
### Exercise 4: Multi classification - Optimize

The goal of this exercise is to learn to optimize a multi-classification neural network. As learnt previously, the loss function used in binary classification is the log loss - also called in Keras `binary_crossentropy`. This function is defined for binary classification and can be extended to multi-classification. In Keras, the extended loss that supports multi-classification is `binary_crossentropy`. There's no code to run in that exercise.

Expand All @@ -142,7 +152,7 @@ model.compile(loss='',#TODO1

---

# Exercise 5 Multi classification example
### Exercise 5 Multi classification example

The goal of this exercise is to learn to use a neural network to classify a multiclass data set. The data set used is the Iris data set which allows to classify flower given basic features as flower's measurement.

Expand Down
3 changes: 1 addition & 2 deletions subjects/ai/keras-2/audit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

##### Run `python --version`.

###### Does it print `Python 3.x`? x >= 8
###### Does it print `Python 3.x`? x >= 9

###### Do `import jupyter`, `import numpy`, `import pandas` and `import keras` run without any error?

Expand Down Expand Up @@ -131,7 +131,6 @@ model.compile(loss='categorical_crossentropy',

---


#### Exercise 5: Multi classification example

##### The exercise is validated if all questions of the exercise are validated
Expand Down
Loading