CapstoneDesign(CSI4101) 2020 Spring Semester

"Effective Ways to Select Dataset from Large Corpus"

Team: 'input.txt' (E-mail: [email protected])

Joochan Kim(Leader, Code, Idea)
Dobreva Iva(Code)
Yujin Kim(Documentation, presentation)

Video & PPT

Idea originated from Prof. Jinyeong Yeo @ Convei lab in Yonsei Univ.

Assistance was given by Gayeon Lee @ Convei lab in Yonsei Univ.

Most of the NLP datasets are so big that lead developers spend a lot of time and costs to train model. To reduce this burden, We propose a new approach that reduces size of data to lessen time and costs needed and improves performances.

Model

- CEDR: Contextualized Embeddings for Document Ranking

Download

- TIM_PLUS: Two-phase Influence Maximization

Download

Dataset

- Robust04: TREC Robust document collection for Retrieval task

Download

How to Run?

Download Models and Dataset and unzip
run graph/graph-generator.py to make graph (Change data location at line 105 to /filename.pkl. Check README.md)
run TIM_PLUS using step 2's result (check README)
Use step 3's result to make seed.txt (Just copy the result and write into it)
run /graph/create-set.py (data, pkl and seed.txt needed, check README.md)
run /Robust-Ranker-Master/main.py (Check README)
Compare MAP! ^-^

Result

Number of dataset: 110000 -> 50000

Train time spend: 11h -> 6h

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
presentation		presentation
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CapstoneDesign(CSI4101) 2020 Spring Semester

"Effective Ways to Select Dataset from Large Corpus"

Team: 'input.txt' (E-mail: [email protected])

Video & PPT

Idea originated from Prof. Jinyeong Yeo @ Convei lab in Yonsei Univ.

Assistance was given by Gayeon Lee @ Convei lab in Yonsei Univ.

Most of the NLP datasets are so big that lead developers spend a lot of time and costs to train model. To reduce this burden, We propose a new approach that reduces size of data to lessen time and costs needed and improves performances.

Model

- CEDR: Contextualized Embeddings for Document Ranking

Download

- TIM_PLUS: Two-phase Influence Maximization

Download

Dataset

- Robust04: TREC Robust document collection for Retrieval task

Download

How to Run?

Result

Number of dataset: 110000 -> 50000

Train time spend: 11h -> 6h

About

Languages

TikaToka/CapstoneSpring

Folders and files

Latest commit

History

Repository files navigation

CapstoneDesign(CSI4101) 2020 Spring Semester

"Effective Ways to Select Dataset from Large Corpus"

Team: 'input.txt' (E-mail: [email protected])

Video & PPT

Idea originated from Prof. Jinyeong Yeo @ Convei lab in Yonsei Univ.

Assistance was given by Gayeon Lee @ Convei lab in Yonsei Univ.

Most of the NLP datasets are so big that lead developers spend a lot of time and costs to train model. To reduce this burden, We propose a new approach that reduces size of data to lessen time and costs needed and improves performances.

Model

- CEDR: Contextualized Embeddings for Document Ranking

Download

- TIM_PLUS: Two-phase Influence Maximization

Download

Dataset

- Robust04: TREC Robust document collection for Retrieval task

Download

How to Run?

Result

Number of dataset: 110000 -> 50000

Train time spend: 11h -> 6h

About

Resources

Stars

Watchers

Forks

Languages