This repository contains code and logic for building a categorization engine for bank transactions.
- Synthetic Data Generation
- Cleaning the Transactional Query
- Embeddings/Features Creation
- Training Pipeline
- Inferencing Pipeline
- Deploying as an API endpoint
- Results
I took the sample schema of the dataset shared and created a synthetic data based on the same patern
Shared Sample Dataset
Synthetically Generated Dataset
These configs for generating the are mentioned in the config_data.py file. What it does is it takes in num_of_users and palces by tag mentioned in the config file and create random transactions for these users. Some things that are handled in the generation function.
- Loop over number of users mentioned in the config file
- Randomly Select a place for the transaction that hasn't been user for the user.
- Add random noise to the transaction like 'XXX' and transaction prefixes such as POS etc. Also add random alphanumeric chars.
Dump this synthetic Data Created using data_generator.py.You can specify the output folder where you want to dump this.
- Remove Punctuations
- Remove alphanumeric chars
- Remove unnecesarry sequences of 'XXX'
- Remove transaction prefixes such as pos,mps,bil etc
Now this is custom way of doing cleaning of the transaction query where we want to remove all the noise from the data and want to focus just on the merchant.
The results after doing this are :
You can see we are able to reduce much of the noise. Still some of it are left.
Just to experiment on new things i tried giving the transactional query to chatllm model to see if it can give me the exact merchant name.
The results were very impressive. The constraints to this approach would require us handling these things:
- How can we send in multiple transactional queries at one pass. (We will keep the context window in mind)
- Profile the time required to do so, what kind of latency are we looking at?
- Can we use a lighter parameter open source model, can we achieve similar results(because easier to deploy for custom use case).
- We can fine tune a base model for this task or prompt engineering with few shot examples would work for us.
We need to answer the above questions before moving on to use this to extract merchant from the transaction. You can follow along this notebook to see how to extract merchant from the transaction query using langchain and chatgpt.
So to create our featureset, we are going to use embeddings to represent our transaction words into contextual vectors.(Similar words would have similar vector space) Now for the purpose of this project i did not experimented with something large. What i kept in mind were these
- Multillingual Embeddings
- Small Size of the embedding model
- Good Accuracy across the MTEB board
So I head over to the MTEB board to find such model and i ended up choosing all-MiniLM-L6-v2
The chosen embedding model projects our transactions to a vector space of 384 dimensions where we have used each dimension as a featureset.
Head over to this notebook to follow how i create embeddings from the transactional queries.
Training is kept quite simple because we are working with synthetic data. I have decided to go with Bagging Approach using SVC
and Gradient Boosting
Algo. I do a Randomized Search CV on hyperparameters to find out the best hyperparameter store the test results and model file in the assests folder.
trainer.py is the module where you can find the code for training the model.
Some of the things that i kept in mind while preparing the data for training :
- Using user_id as a feature in itself. We want to add the past preferences by the user to increase the chances of tag prediction for that use.
- Split the train/test stratified on the user_id. We want to keep all the users in our training set as it is essential for our model to make predictions based on previous taggings.
- Since this data is synthetically created and for every user we have similar places and transactions (although with different noises),this model is definately going to overfit.
This has clearly overfitted. (Because of common keywords for multiple users). But we are not benchmarking scores on synthetic data. It would make sense to do so when we have large amount of actual transaction data.
.The training took approx 6 mins to be completed. You can follow along this notebook to see how you can train on the synthetic data.
Inferencing is pretty simple, we use the same module feature_generator.py that was used at the time of the training but with the inference settings to create features.For the pipeline refer to this inference pipeline. At the time of the inferencing we
- Preprocess/clean the data , the way we did at the time of the training.
- We check if the user is already in the userbase and the model was trained on the data by that user. If now we assign the user a value of -1.
- Predict out top three labels with probability score.
Follow this notebook to walkthrough how inferencing is done.
I built a flask app around the inference module and contanirized it for hosting it.
I have used ECS service from AWS to host the Dockerimage.
Have used default scaling and security groups.
API ENDPOINT : http://13.233.3.50:8080/predict_tag
To send a sample request through CURL
curl -X POST http://13.233.3.50:8080/predict_tag -H "Content-Type: application/json" -d "{\"users\": [\"User200\", \"User3\"], \"transactions\":[\"POS XXXXXXXXXXXX1111 IKEA INDIA PVT L\", \"POS XXXXXXXXXXXX1111 APOLLO PHARMACY PVT L\"]}"
Output:
{"data":{"prob":[[["Shopping",0.6422143460736127],["Medical",0.18530665751017797],["Travel",0.11028613262101186]],[["Medical",0.9988152915012749],["Subscription",0.0006008496836324658],["Shopping",0.0004127586966547764]]]},"result":"success"}
-
Since we have used User as a feature let us see the effect in the probabilty score when the user is in the training vs when he is not
In the Above image you User2 is in the training data and user200 isn't so there is a decrease of 10% confidence. -
Since we are using embeddings similar things should be close to each other. SO let us try sending a request for Anjana Sweets which is not in our training set.
We see that it is able to correctly predict the tag for Anjana Sweets.
The embeddings space are not contextual enough to place zomato
and swiggy
in the same context.
We could do several things here
- Try with another set of embeddings which is particulary tuned for this.
- Make an Embedding model ourselves by labelling enough such data points.