This guide introduces how to generate the searching database with ElasticSearch and PostgreSQL. It uses Ray to accelerate the indexing speed on multi-nodes. And then deploy search pipeline with web UI.
We use StackOverflow dataset in this guide, users can follow this guide on how to integration your own dataset and model.
Download StackOverflow dataset
# create stackoverflow dataset folder
mkdir stackoverflow
cd stackoverflow
# copy downloaded StackOverflow dataset into stackoverflow dataset folder
Note: If you will not try the colbert pipelines, skip it.
# download the colbert model and build the haystack-ray image
wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz
tar -xvzf colbertv2.0.tar.gz
First make sure you are in applications/indexing
folder of the repo
cd applications/indexing
Refer to stackoverflow_indexing_workflow.yml. The YAML file must include a head node. The head node should be your local machine which launches the indexing workflow. You can add the worker node under the nodes component. The pipelines component declares the pipelines that need to be executed, and declares the database used by each pipeline. The YAML files of these pipelines are included in $workspace_dir which is ./stackoverflow_indexing for stackoverflow dataset.
nodes:
- node: $host_ip # IP address of head node. Head node is launched in local machine,
type: head # value:[head, worker] Must exist.
cores: $core_range #for example 0-7
image: intel/ai-workflows:odqa-haystack-api
dataset_dir: $dataset_dir #mount to /home/user/dataset of container. It should include the dataset files.
customer_dir: $customer_dir #mount to /home/user/data of container
workspace_dir: $workspace_dir #mount to /home/user/workspace of container. $workspace_dir should be ./stackoverflow_indexing for stackoverflow indexing.
- node: $host_ip # IP address of worker node.
type: worker
image: intel/ai-workflows:odqa-haystack-api
cores: $core_range #for example 0-7
user: $user #configure the user name of worker node for remote access
password: $password #configure the password of worker node for remote access and sudo executing
dataset_dir: $dataset_dir #If you use the nfs share storage, it should be same with path of head node.
customer_dir: $customer_dir #If you use the nfs share storage, it should be same with path of head node.
workspace_dir: $workspace_dir #If you use the nfs share storage, it should be same with path of head node.
pipelines:
- name: colbert_indexing_pipeline.yml #The name should be same with file name of pipeline file which is included in $workspace_dir
database: # The database containers will be launched in head node.
type: elasticsearch # value:[elasticsearch, postgres]. Must be consistent with the database type of DocumentStore configured in the pipeline file
image: elasticsearch:7.9.2 #For elasticsearch, the 7.9.2 is prefered. To change the version, you need to ensure that you use the same version as the query pipeline
cores: $core_range
data_dir: $data_dir #The data directory of database which mountes to /usr/share/elasticsearch/data of elasticsearch container, or /var/lib/postgresql/data of postgresql container
- name: emr_indexing_pipeline.yml
database:
type: elasticsearch
image: elasticsearch:7.9.2
cores: $core_range
data_dir: $data_dir
- name: faiss_indexing_pipeline.yml
database: # The database containers will run in head node.
type: postgres
image: postgres:14.1-alpine
cores: $core_range
data_dir: $data_dir
Directories mounted to head and worker containers.
-
$workspace_dir:
Mount point: /home/user/workspace
Include the pipeline YAML files, prepare_env.sh and python file of dataset class for preprocessing and converting dataset data to documents or files of Ray dataset.
Note: Pipeline YAML files and prepare_env.sh cannot be placed in subfolders of $workspace_dir.
For example:
$ ls ./marco_indexing colbert_indexing_pipeline.yml emr_indexing_pipeline.yml faiss_indexing_pipeline.yml marco_dataset.py prepare_env.sh
-
$customer_dir:
Mount point: /home/user/data
The log will be saved in the directory. You also can copy your model into this direcotry. For example, indexing workflow of colbert_indexing_pipeline.yml needs the downloaded colbertv2.0 model. Copy colbertv2.0 folder into this directory.
-
$dataset_dir:
Mount point: /home/user/dataset
The directory include the dataset files.
Directories mounted to database container.
-
$data_dir
if $data_dir is not exist, it will be created. Please use different directory path for different pipeline.
In pipeline YAML files there are some parameters need to be modified. These pipeline YAML files is under your $workspace_dir.
For example:
./stackoverflow_indexing/colbert_indexing_pipeline.yml ./stackoverflow_indexing/emr_indexing_pipeline.yml ./stackoverflow_indexing/faiss_indexing_pipeline.yml
Edit these YAML files according to your local runtime environment.
-
For DocumentStore you need to specify the address of the linked database. The $host_name should be the hostname or host IP of head node, because the database container is launched on head node.
ElasticsearcDocumentStore:
- name: DocumentStore type: ElasticsearchDocumentStore actor: True params: host: $host_ip
FAISSDocumentStore:
- name: DocumentStore type: FAISSDocumentStore faiss_index_path: /home/user/data/faiss-index-so.faiss actor: False params: sql_url: postgresql://postgres:postgres@$host_ip/haystack faiss_index_factory_str: HNSW
-
Check the files path of $customer_dir and $dataset_dir. They are mounted into containers of Ray head and workers. They should include the dataset files or finetuned models downloaded by yourself. If there are subdirectories in these directories, please modify the corresponding path.
For example:
stackoverflow_indexing/faiss_indexing_pipeline.yml
- name: Dataset type: StackoverflowDataset path: /home/user/workspace/stackoverflow_dataset.py actor: False params: file: /home/user/dataset/train_v2.1.json batch_size: 200000
stackoverflow_indexing/colbert_indexing_pipeline.yml
- name: Ranker type: ColBERTRanker actor: True params: model_path: /home/user/data/colbertv2.0 ... - name: Dataset type: StackoverflowDataset path: /home/user/workspace/stackoverflow_dataset.py actor: False params: file: /home/user/dataset/train_v2.1.json batch_size: 200000
stackoverflow_indexing/emr_indexing_pipeline.yml
- name: Dataset type: StackoverflowDataset path: /home/user/workspace/stackoverflow_dataset.py actor: False params: question_file: /home/user/dataset/Questions-S.csv answer_file: /home/user/dataset/Answers-S.csv batch_size: 200000
-
Adjust the actor numbers and batch_size according to the hardware resource of the Ray cluster. Too large actor numbers and batch_size may cause out of memory. The num_replicas x num_cpus should be less than the total number of CPUs you configured in the indexing workflow YAML.
serve_deployment_kwargs: num_replicas: 80 # number of actors to create on the Ray cluster batch_size: 256 num_cpus: 2
Launch the Ray cluster for indexing workflows.
#Launch Ray cluster for stackoverflow indexing workflow
$ python launch_indexing_workflow.py -w stackoverflow_indexing_workflow.yml
Run the indexing workflows on Ray cluster. The -p option value is the name of pipeline YAML file or all.
Note: For faster debugging and demonstration demo, you can choose the Marco dataset and use the -s option to take only 500 samples from dataset for indexing
#Run all pipelines defined in stackoverflow_indexing_workflow.yml
$ python launch_indexing_workflow.py -w stackoverflow_indexing_workflow.yml -p all
#Run faiss_indexing_pipeline.yml defined in the stackoverflow_indexing_workflow.yml
$ python launch_indexing_workflow.py -w stackoverflow_indexing_workflow.yml -p faiss_indexing_pipeline.yml
#Run faiss_indexing_pipeline.yml defined in the stackoverflow_indexing_workflow.yml, only take 500 samples from dataset for indexing pipeline debugging or demo.
$ python launch_indexing_workflow.py -w stackoverflow_indexing_workflow.yml -p faiss_indexing_pipeline.yml -s 1
After generating the database of Faiss pipeline, copy and save the indexing files of $customer_dir to avoid them being overwritten by the new faiss indexing workflow.
Clean previous containers you have ran, you can use following commands on all nodes of your Ray cluster.
# Clean all the Ray and database containers running in local node.
$ ./run-ray-cluster.sh -r clean_all
# Clean all the database containers running in local node.
$ ./run-ray-cluster.sh -r clean_db
First make sure you are in applications/odqa-pipelines
folder of the repo
cd applications/odqa_pipelines
Note: Please make sure you have completed emr_indexing_pipeline.yml indexing pipeline
ElasticsearchDocumentStore->EmbeddingRetriever(deepset/sentence_bert)->Docs2Answers
Modify the config file config/env.stackoverflow.esds_emr_faq
# set the $data_dir to the data folder of elasticsearch database, please refer to applications/indexing/stackoverflow_indexing_workflow.yml
DATA_DIR=$data_dir
Run workflow:
#deploy search pipeline with stackoverflow database
docker-compose -f docker-compose/docker-compose.yml --env-file config/env.stackoverflow.esds_emr_faq up
Note: Please make sure you have completed colbert_indexing_pipeline.yml indexing pipeline
ElasticsearchDocumentStore->BM25Retriever->ColbertRanker-> Docs2Answers
Modify the config file config/env.stackoverflow.esds_bm25r_colbert
# set the $data_dir to the data folder of elasticsearch database, please refer to applications/indexing/stackoverflow_indexing_workflow.yml
DATA_DIR=$data_dir
# set the $customer_dir to the absolute ColbertV2.0 model path you placed
CUSTOMER_DIR=$customer_dir
Modify the docker compose file docker-compose/docker-compose.yml
, uncomment the following lines.
#volumes:
# - $CUSTOMER_DIR:/home/user/data
Run workflow:
#deploy search pipeline with stackoverflow database
docker-compose -f docker-compose/docker-compose.yml --env-file config/env.stackoverflow.esds_bm25r_colbert up
Note: Please make sure you have completed faiss_indexing_pipeline.yml indexing pipeline
FAISSDocumentStore->DensePassageRetriever->Docs2Answers
Modify the config file config/env.stackoverflow.faiss_dpr
# set the $data_dir to the data folder of postgresql database, please refer to applications/indexing/stackoverflow_indexing_workflow.yml
DATA_DIR=$data_dir
# set the $customer_dir to the absolute path where you store faiss indexing files.
CUSTOMER_DIR=$customer_dir
Modify the docker compose file docker-compose/docker-compose-dpr.yml
, uncomment the following lines.
#volumes:
# - $CUSTOMER_DIR:/home/user/data
Run workflows:
#deploy search pipeline with stackoverflow database
docker-compose -f docker-compose/docker-compose-dpr.yml --env-file config/env.stackoverflow.faiss_dpr up
Once you deployed search pipeline successfully, open a browser and input following url to access web UI
<host_ip>:8501