Hello potential Laine colleague!
If you are reading this, you are probably applying for a Machine Learning engineering job at Laine. This coding challenge will evaluate if you have the right skills for the job!
Completing the challenge should take about half a day if you have the relevant experience. In this challange, you will try to build a job description classifier and deploy your model on Google Cloud Platform (GCP) Vertex AI.
A job desciption is a paragraph of text that describes a certain job position. For example:
You will mostly work with TensorFlow and Python to solve hard Machine Learning tasks and help to put these in production.
For the challenge, we ask you to create a ML model to classify such texts into 5 categories: IT Jobs, Sales Jobs, HR & Recruitment Jobs, Accounting & Finance Jobs, and Customer Services Jobs.
This repository contains the template codes that you can start with. You can clone this repository using command:
git clone [email protected]:ml6team/laine-engineer-coding-challenge.git
In the end you'll need to deploy your model on GCP. So you need to register your Google Cloud account. You'll need a credit card for the registration, but then you'll receive some free credits from Google to start your development for free.
You need to install the gcloud
command in your system, which is a part of the Google Cloud SDK. You'll use this command in several steps.
This challenge requires Python 3.7 for compatibility with TensorFlow 2.1.0. If you don't have Python 3.7 installed, you can install it using pyenv:
# Install pyenv
curl https://pyenv.run | bash
export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
# Install Python 3.7
pyenv install 3.7.16
pyenv global 3.7.16
# Create virtual environment
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
Before you begin to implement your classification model you need to download the data used for training and local evaluation from Google Cloud Storage and place the data
folder in the base folder. To download the data, execute the following command:
gsutil -m cp -r gs://ml6_junior_ml_engineer_challenge_cv_job_description_data/data .
For your purposes, the data has already been split into training set and validation set. They are respectively in the train.csv
and eval.csv
files. There are five kinds of job classes: Sales Jobs, Customer Services Jobs, IT Jobs, HR & Recruitment Jobs and, Accounting & Finance Jobs. Labeled 0 to 4 as defined in 'trainer/config.py'. If you want, you can inspect the data. The code that loads the CSV files into texts and labels is already provided to you.
The test set will be used for the final evaluation when you submit your solution. Hence, the test.csv
file is not provided.
You can start your development on your local PC / remote server / Cloud VM. The goal is to train a job description classifier that achieves an accuracy as high as possible. To do so, you need to code the data preprocessor, model definition, model exporter and model loader. The main program logic is written in task.py, where the dataset loader and model training codes are also provided.
The general local workflow is as follow:
The template Python files are provided in the trainer
folder:
task.py: containing the main training logics such as loading the CSV dataset, preprocess the data, model training, model evaluation and model exporting. Normally you don't need to modify this file.
config.py: containing the global configurations. You can change some existing definitions such as EPOCHS and BATCH_SIZE to control the training, or define more global variables that are needed.
preprocess.py: containing the preprocessor class. You should implement this class to preprocess the input data. Notice that the preprocessing method could be written differently for training set, validation set and test set.
model.py: containing the model definition. You should implement your model in this file.
export.py: containing the code for exporting and loading the model. You should implement these methods so that your model can be exported to file and loaded from file.
predictor.py: containing the entry code for the online prediction (used by the containerized server).
In principle, you can feel free to change any files in the trainer
folder. Just remember that the objective is to deploy your classifier on Google Cloud Vertex AI, your model will then accept API request that contains the job description text, and it will return the job category as the classification result.
You can train your model by running the below command after implementing the files above:
python3 trainer/task.py
In the main folder you will see two main files for deploying to Vertex AI:
**server.py: This specified how your app should serve and process requests once deployed in Vertex AI. It has been specifically tailored to format your model inputs/outputs into those expected by Vertex APIs. You should NOT need to change this file.
**Dockerfile: Specifies how the image should be built. You should NOT need to change this file.
After training the model locally, you can create a GCP project and deploy your model on GCP Vertex AI. Notice that the GCP deployment is also a part of the coding challenge, to see if you can quickly get familiar with a Cloud environment. It's as important as the model training part. Please read the guidelines carefully and deploy the model in the correct steps.
Since we want to provide flexibility on the approaches that you can choose, we don't restrict your solution to be a TensorFlow model with fixed input / output format. In order to deploy a customized model on GCP Vertex AI, you will use GCP's custom container feature, which allows you to control the logic of model loading, data preprocessing and results postprocessing.
The general online workflow is as follow:
Now we explain the model deployment step-by-step:
- You have completed the missing methods that were required for model training and the training was successful.
- You have exported your trained model to a file, e.g. a TensorFlow model saved in `output/saved_model`.
- You have installed the required dependencies:
source venv/bin/activate
pip install -r requirements.txt
python3 trainer/task.py
python3 server.py
2. Ensure google cloud cli is setup and your environment is configured with the correct apis enabled:
gcloud auth login
gcloud services enable aiplatform.googleapis.com
gcloud services enable artifactregistry.googleapis.com
gcloud builds submit --tag gcr.io/<PROJECT_ID>/job-classifier:v1 .
gcloud ai models upload \
--region=europe-west1 \
--display-name=<MODEL_NAME> \
--container-image-uri=gcr.io/<PROJECT_ID>/job-classifier:v1
gcloud ai endpoints create \
--region=europe-west1 \
--display-name=<ENDPOINT_NAME>
You can check the deployed resource ids here:
gcloud ai models list --region=europe-west1
gcloud ai endpoints list --region=europe-west1
5. And finally deploy your model to the endpoint. It's important to use the model and endpoint ids and not the name you gave it.
gcloud ai endpoints deploy-model <ENDPOINT_ID> \
--region=europe-west1 \
--model=<MODEL_ID> \
--display-name=job-classifier-deployment \
--traffic-split=0=100
THIS MAY TAKE A WHILE.
Before you submit your solution, you can check if your deployed model works by listing the created endpoint and running a test on it.
gcloud ai endpoints predict <ENDPOINT_ID> \
--region=europe-west1 \
--json-request=check_deployed_model/test.json
Check if you are able to get a prediction out of the gcloud
command. If you get errors, you should try to resolve them before submitting the solution. The output of the command should look something like this (the numbers will probably be different):
{
"predictions": [0]
}
The values you use for the $ENDPOINT_ID
variable can be found by running the list command above. You will need these values and your Google Cloud Project ID to submit your coding test.
To be able to pass the coding test. You should be able to get an accuracy of 75% on our secret dataset of job descriptions (which you don't have access to). If your accuracy however seems to be less than 75% after we evaluated it, you can just keep submitting solutions until you are able to get an accuracy of 75%.
Once you are able to execute the command above without errors, you can add us to your project:
- Go to the menu of your project
- Click IAM & admin
- Click Add
- Add
laine-coding-challenge-eval@zippy-carving-465819-p5.iam.gserviceaccount.com
as a member with the role Project Owner
After you added us to your project you should fill in: this form so we are able to automatically evaluate your solution to the coding test. Once you've filled in the form someone from Laine will run the eval pipeline and get back to you. We'll hope with you that your results are good enough to land an interview at Laine. If however you don't you can resubmit a new solution as many times as you want, so don't give up!
If you are invited for an interview at Laine afterwards, make sure to bring your laptop with a copy of the code you wrote, so you can explain your model.py
file to us.
Once finished with the coding challenge and (hopefully!) receiving a positive outcome, you can run the below commands to tear down the resources you have created.
# get deployed model id
gcloud ai endpoints describe <ENDPOINT_ID> --region=europe-west1
#then run below with the id from ^
gcloud ai endpoints undeploy-model <ENDPOINT_ID> \
--deployed-model-id=<DEPLOYED_MODEL_ID> \
--region=europe-west1 \
--quiet
gcloud ai endpoints delete <ENDPOINT_ID> --region=europe-west1 --quiet
gcloud ai models delete <MODEL_ID> --region=europe-west1 --quiet
gcloud container images delete gcr.io/<PROJECT_ID>/job-classifier:v1 --quiet --force-delete-tags
Now your project should be clean once more! Note: you will need to do this for each submission created if you change the model and endpoint names. You can verify this has been successfuly via running:
gcloud ai models list --region=europe-west1
gcloud ai endpoints list --region=europe-west1