Skip to content

VladimirKadlec/similar_images

Repository files navigation

Similar images demo application

Problem definition

A client asks us to create a solution for finding the most similar content. A user uploads an item (e.g. photo), the solution presents the most similar items available to the user

Solution

In the following we work with images/photos as an example of user item. The demonstration application is built in the following steps:

  • create an image database
  • build a demo application taking an image as an input and returning similar images from the database as an output.

The images from the COCO dataset "2014 Val images" are used to populate the image database, see COCO website for details. Our database contains 40504 images, most of them are photos of real world common objects.

The demo application is a Jupyter Notebook. You can search for random images from the database, or upload your own image as an input. The results are images from the database (i.e. these 40504 images from the COCO dataset).

Demo application

Technical description

Image database

There are several options for effective search. The implementation in build_search_db.py is based on product quantization, 16 sub-quantizers with 8 bits, 100 centroids. The resulting index in just 8M bytes. Note, that the quantization requires training, we use the whole dataset as the training set, because our database is static. For the demo purposes even brute force exact search would be fast enough, as 40k images is really small number.

Demo application

The demo application is a Jupyter Notebook, see Similar image.ipynb.

Installation

The development was done on Ubuntu 22.04.1 LTS, 4 CPU cores, 4G RAM server, in python3.

Requirements

You usually use virtualenv + pip, conda or similar to install the following packages:

click
tensorflow
pillow
notebook
ipywidgets
faiss-cpu

Note: You have to enable widgetsnbextension before starting jupyter notebook. If you have virtual environment, use:

jupyter nbextension enable --py --sys-prefix widgetsnbextension

command before jupyter notebook launch.

Steps

The following steps should work on Ubuntu 22.04.1 LTS Linux server. The whole installation requires cca 14G disk space. Don't forget to activate your virtual environment with installed packages.

  1. Clone this repo, enter the directory:
$git clone '[email protected]:VladimirKadlec/similar_images.git'
$cd similar_images
  1. Download and extract MS COCO images (6.2G file, 6.8G extracted size):
$./get_ms_coco_dataset.sh
  1. Download or compute feature vectors for images:
  • download vectors from my DropBox (225M file, 633M extracted size):
$./get_ms_coco_vectors.sh
  • or compute them on your own, it takes 83 min on 4 core CPU, cca 2G RAM.
$./extract_image_vectors.py -l ms_coco_file_list.txt
  1. Build Faiss index: The index is built under 50 seconds.
$./build_search_db.py -i ms_coco_file_list.txt.saved_predictions.npy -o ms_coco_file_list.faiss.index
  1. Run Jupyter Notebook with file Similar image.ipynb. You may need to click Kernel/Restart & Run All for the first time.

Discussion

  1. How to evaluate the quality of the similar search?
  • A/B testing against the current solution. If there is no current solution get an explicit user feedback from small number of users (e.g. questionnaire).
  1. Is it fast enough for millions of images?
  • The feature extraction takes cca 150ms for a single image + 2ms similarity search on CPU. In general, the main slow part is the feature extraction, the Faiss library is fast enough.
  • It depends on actual setup and requirements. The speed can be improved by:
    • batch processing (2x-3x speedup on CPU with enough memory)
    • use of GPU machines (20x-100x speedup)
    • use of different network for feature extracting, e.g. EfficientNet.
    • use of cluster of machines (linear speedup with respect to number of machines).
  1. Feature vector size 4096, isn't it too big?
  • The vectors are quantized during indexing by the Faiss library to cca 16 x 8 bits.
  1. VGG16 vs EfficientNetB0:
  • The EfficientNetB0 is much faster and much smaller, the results from the experiments weren't satisfactory.
  1. Is it possible to add new images to the index?
  • Yes, new images (feature vectors) can be added to the index without rebuilding the whole index.
  1. What about the similarity of texts?
  • See proof of concept of proof of concept in Text_similarity.ipynb. The database is 20k tweets from Sentiment140 dataset, the index is pre-computed and stored in this repo.

About

Similar image search demo application

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published