Categorization engine For bank transactions

This repository contains code and logic for building a categorization engine for bank transactions.

How can we send in multiple transactional queries at one pass. (We will keep the context window in mind)
Profile the time required to do so, what kind of latency are we looking at?
Can we use a lighter parameter open source model, can we achieve similar results(because easier to deploy for custom use case).
We can fine tune a base model for this task or prompt engineering with few shot examples would work for us.

We need to answer the above questions before moving on to use this to extract merchant from the transaction. You can follow along this notebook to see how to extract merchant from the transaction query using langchain and chatgpt.

Embeddings/Features Creation

So to create our featureset, we are going to use embeddings to represent our transaction words into contextual vectors.(Similar words would have similar vector space) Now for the purpose of this project i did not experimented with something large. What i kept in mind were these

Multillingual Embeddings
Small Size of the embedding model
Good Accuracy across the MTEB board So I head over to the MTEB board to find such model and i ended up choosing all-MiniLM-L6-v2

The chosen embedding model projects our transactions to a vector space of 384 dimensions where we have used each dimension as a featureset.

Head over to this notebook to follow how i create embeddings from the transactional queries.

Training Pipeline

Training is kept quite simple because we are working with synthetic data. I have decided to go with Bagging Approach using SVC and Gradient Boosting Algo. I do a Randomized Search CV on hyperparameters to find out the best hyperparameter store the test results and model file in the assests folder. trainer.py is the module where you can find the code for training the model. Some of the things that i kept in mind while preparing the data for training :

Using user_id as a feature in itself. We want to add the past preferences by the user to increase the chances of tag prediction for that use.
Split the train/test stratified on the user_id. We want to keep all the users in our training set as it is essential for our model to make predictions based on previous taggings.
Since this data is synthetically created and for every user we have similar places and transactions (although with different noises),this model is definately going to overfit.

This has clearly overfitted. (Because of common keywords for multiple users). But we are not benchmarking scores on synthetic data. It would make sense to do so when we have large amount of actual transaction data.
.The training took approx 6 mins to be completed. You can follow along this notebook to see how you can train on the synthetic data.

Inferencing

Inferencing is pretty simple, we use the same module feature_generator.py that was used at the time of the training but with the inference settings to create features.For the pipeline refer to this inference pipeline. At the time of the inferencing we

Preprocess/clean the data , the way we did at the time of the training.
We check if the user is already in the userbase and the model was trained on the data by that user. If now we assign the user a value of -1.
Predict out top three labels with probability score.
Follow this notebook to walkthrough how inferencing is done.

Deploying as an API endpoint

I built a flask app around the inference module and contanirized it for hosting it.
I have used ECS service from AWS to host the Dockerimage.
Have used default scaling and security groups.

API ENDPOINT : http://13.233.3.50:8080/predict_tag

To send a sample request through CURL

curl -X POST http://13.233.3.50:8080/predict_tag -H "Content-Type: application/json" -d "{\"users\": [\"User200\", \"User3\"], \"transactions\":[\"POS XXXXXXXXXXXX1111 IKEA INDIA PVT L\", \"POS XXXXXXXXXXXX1111 APOLLO PHARMACY PVT L\"]}"

Output:

{"data":{"prob":[[["Shopping",0.6422143460736127],["Medical",0.18530665751017797],["Travel",0.11028613262101186]],[["Medical",0.9988152915012749],["Subscription",0.0006008496836324658],["Shopping",0.0004127586966547764]]]},"result":"success"}

Results Comparision

Since we have used User as a feature let us see the effect in the probabilty score when the user is in the training vs when he is not

In the Above image you User2 is in the training data and user200 isn't so there is a decrease of 10% confidence.
Since we are using embeddings similar things should be close to each other. SO let us try sending a request for Anjana Sweets which is not in our training set.

We see that it is able to correctly predict the tag for Anjana Sweets.
Where it fails

The embeddings space are not contextual enough to place zomato and swiggy in the same context. We could do several things here

Try with another set of embeddings which is particulary tuned for this.
Make an Embedding model ourselves by labelling enough such data points.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
assets		assets
data		data
notebooks		notebooks
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
config_data.py		config_data.py
data_generator.py		data_generator.py
feature_generator.py		feature_generator.py
inference.py		inference.py
requirements.txt		requirements.txt
trainer.py		trainer.py
training_config.py		training_config.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Categorization engine For bank transactions

Table of Contents

Sythetic Data Generation

Cleaning the Transactional Query

Embeddings/Features Creation

Training Pipeline

Inferencing

Deploying as an API endpoint

Results Comparision

About

Releases

Packages

Languages

License

RheagalFire/categorization_engine_bank_txn

Folders and files

Latest commit

History

Repository files navigation

Categorization engine For bank transactions

Table of Contents

Sythetic Data Generation

Cleaning the Transactional Query

Embeddings/Features Creation

Training Pipeline

Inferencing

Deploying as an API endpoint

Results Comparision

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages