This is a project of Information Retrieval course, finished by our group.
A simple information retrieval system using python3 and spark.
plz make sure you have installed related component such as MySQL
, Redis
, Spark
and corresponding executed environment.
Firstly, you should run update/__init__.py
, which would call crawler and then build indexes and model such as word's posting list
and word co-occurrence model
.
Secondly, plz run main.py
, which would start a server and supply many service such as balabala
.
Last but not least, run ir201712-front_end/search_engine.py
to start the front-end server. OK, visit 127.0.0.1:5000
by browser to enjoy it.
- install git
- set initial params of git
git config --global user.name <github_name>
git config --global user.email <github_email>
git clone https://github.com/xuesu/ir201712.git
- install mysql
- In Ubuntu:
sudo apt install mysql-server
- open mysql terminal: (Attention, we should always use this charset 'UTF8mb4')
- Create a new db to avoid database-scale change in program:
CREATE DATABASE ir character set UTF8mb4 collate utf8mb4_bin;
- Then create a test db:
CREATE DATABASE ir_test character set UTF8mb4 collate utf8mb4_bin;
- Create a new user:
CREATE USER 'IRDBA'@'localhost' IDENTIFIED BY 'complexpwd';
- Grant privilege to the user:
GRANT ALL ON ir.* TO 'IRDBA'@'localhost';GRANT ALL ON ir_test.* TO 'IRDBA'@'localhost';
- Create a new db to avoid database-scale change in program:
- install Redis
- In Ubuntu:
sudo apt-get install redis-server
- In Ubuntu:
- install anaconda
- build a new virtualenv
conda create -n <env_name> python=3
- activate the virtualenv
source activate <env_name>
pip install -r requirements.list
- download & unzip https://www.apache.org/dyn/closer.lua/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
- edit the path:
- in ubuntu:
export SPARK_HOME="/XXXX/spark-2.2.1-bin-hadoop2.7" export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
- If you are using pyspark terminal, you can start now.
- If you are using pycharm, you need add
<spark_home>/python/pyspark
&<spark_home>/python/lib/py4j-0.9-src.zip
into content root.
It takes about 810MB memory, 囧
cd emotions
- build a new virtualenv
conda create -n <env_name2> python=2
- NOTE: This project is written in a different language!
- activate the virtualenv
source activate <env_name2>
pip install -r requirements.list
python demo_service.py
- Obey the basic coding rule if you can. BUT it is ok to write in your own style.
- Try to write some test cases.
- Always pull master and push dev!
git pull master
git add *
git commit -m "<my_change>"
git push <branch>:dev
Plz try to read update/__init__.py
for begining.