Skip to content

iknocho/level2_mrc_nlp-level3-nlp-08

ย 
ย 

Repository files navigation

ODQA(Open Domain Question Answering)

์†Œ๊ฐœ

์ง€๋ฌธ์ด ์ฃผ์–ด์ง„ ์ƒํƒœ์—์„œ ์งˆ์˜์— ํ•ด๋‹นํ•˜๋Š” ๋‹ต์„ ์ฐพ๋Š” Task๋ฅผ MRC(Machine Reading Comprehension)๋ผ๊ณ  ํ•œ๋‹ค.
ODQA๋Š” ์ง€๋ฌธ์ด ์ฃผ์–ด์ง„ ์ƒํƒœ๊ฐ€ ์•„๋‹ˆ๋ผ wiki๋‚˜ ์›น ์ „์ฒด ๋“ฑ๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ documents๋“ค ์ค‘ ์ ์ ˆํ•œ ์ง€๋ฌธ์„ ์ฐพ๋Š” retrieval ๋‹จ๊ณ„์™€ ์ถ”์ถœ๋œ ์ง€๋ฌธ๋“ค ์‚ฌ์ด์—์„œ ์ ์ ˆํ•œ ๋‹ต์„ ์ฐพ๋Š” reader ๋‹จ๊ณ„, 2-stage๋กœ ์ด๋ฃจ์–ด์ง„ Task๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

Retrieval

๋ฌธ์„œ๋“ค์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด ์ผ๋ จ์˜ ๋ฒกํ„ฐ ํ˜•ํƒœ๋กœ ํ‘œํ˜„ํ•ด์•ผ ํ•˜๋Š”๋ฐ ๋Œ€ํ‘œ์ ์œผ๋กœ Sparse Embedding ๋ฐฉ์‹๊ณผ Dense Embedding ๋ฐฉ์‹์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ง„๋‹ค.

Reader

์งˆ์˜์— ๋งž๋Š” ์ ์ ˆํ•œ ๋‹ต์„ ์ถ”์ถœ๋œ ๋ฌธ์„œ๋“ค ์‚ฌ์ด์—์„œ ์ฐพ๋Š” ๋‹จ๊ณ„๋กœ, ์ •๋‹ต์˜ span์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋œ๋‹ค.


ํŒ€ ๊ตฌ์„ฑ

๊น€ํ˜„์ˆ˜ ์ด์„ฑ๊ตฌ ์ดํ˜„์ค€ ์กฐ๋ฌธ๊ธฐ ์กฐ์ต๋…ธ
Elasticsearch ๊ตฌ์„ฑ
KoELECTRA ํ•™์Šต ๋ฐ ํ‰๊ฐ€
BERT(multilingual) ํ•™์Šต ๋ฐ ํ‰๊ฐ€
ColBERT Retriever ์ ์šฉ ๋ฐ ๊ฐœ์„ 
ColBERT์™€ BM25 ์•™์ƒ๋ธ”
๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
klue/RoBERTa-large ํ•™์Šต ๋ฐ ํ‰๊ฐ€
K-fold ๊ตฌํ˜„
์ตœ์ข… ์•™์ƒ๋ธ” ๊ตฌํ˜„
BM25 ๊ตฌํ˜„
Elasticsearch ๊ตฌํ˜„

Requirements

# data (51.2 MB)
tar -xzf data.tar.gz

# ํ•„์š”ํ•œ ํŒŒ์ด์ฌ ํŒจํ‚ค์ง€ ์„ค์น˜. 
bash ./install/install_requirements.sh

ํŒŒ์ผ ๊ตฌ์„ฑ

์ €์žฅ์†Œ ๊ตฌ์กฐ

.
|-- README.md
|
|-- arguments.py # data, model, training arguments ๊ด€๋ฆฌ
|
|-- colbert # dense retrieval(ColBERT) ํ•™์Šต ๋ฐ ์ถ”๋ก 
|   |-- evaluate.py
|   |-- inference.py
|   |-- model.py
|   |-- tokenizer.py
|   `-- train.py
|
|-- es_retrieval.py # sparse retrieval(Elasticsearch) connetion
|-- retrieval.py # tfidf, bm25, elasticsearch retrieval class
|-- settings.json # elasticsearch settings
|
|-- kfold_ensemble_hard.py # k-fold hard voting
|-- kfold_ensemble_soft.py # k-fold soft voting
|-- make_folds.py
|
|-- models # model ์ €์žฅ์†Œ
|   `-- model_folder
|
|-- outputs # matrix ์ €์žฅ์†Œ
|   `-- output_folder
|
|-- train.py # reader ํ•™์Šต
|-- train_kfold.py # reader ํ•™์Šต(with. k-fold)
|-- inference.py # retrieval + reader (end-to-end) ํ‰๊ฐ€ ๋ฐ ์ถ”๋ก 
|-- trainer_qa.py # Trainer class
`-- utils_qa.py # utility function

๋ฐ์ดํ„ฐ ์†Œ๊ฐœ

Competition Datasets (Retrieval)

./data/                        # ์ „์ฒด ๋ฐ์ดํ„ฐ
    ./wikipedia_documents.json # ์œ„ํ‚คํ”ผ๋””์•„ ๋ฌธ์„œ ์ง‘ํ•ฉ. retrieval์„ ์œ„ํ•ด ์“ฐ์ด๋Š” corpus.

์ด ์•ฝ 60000๊ฐœ์˜ ์œ„ํ‚คํ”ผ๋””์•„ ๋ฌธ์„œ์—์„œ context ๊ฐ€ ์˜จ์ „ํžˆ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ์— ํ•œํ•ด์„œ ์ค‘๋ณต ์ œ๊ฑฐ๋ฅผ ํ†ตํ•ด ์•ฝ 56000๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ์ง„ํ–‰.


Competition Datasets (Reader)

๋ฐ์ดํ„ฐ ๋ถ„ํฌ

๋ฐ์ดํ„ฐ์…‹์€ ํŽธ์˜์„ฑ์„ ์œ„ํ•ด Huggingface ์—์„œ ์ œ๊ณตํ•˜๋Š” datasets๋ฅผ ์ด์šฉํ•˜์—ฌ pyarrow ํ˜•์‹์˜ ๋ฐ์ดํ„ฐ๋กœ ์ €์žฅ๋˜์–ด ์žˆ์Œ. ๋ฐ์ดํ„ฐ์…‹์˜ ๊ตฌ์„ฑ

./data/                        # ์ „์ฒด ๋ฐ์ดํ„ฐ
    ./train_dataset/           # ํ•™์Šต์— ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹. train ๊ณผ validation ์œผ๋กœ ๊ตฌ์„ฑ 
    ./test_dataset/            # ์ œ์ถœ์— ์‚ฌ์šฉ๋  ๋ฐ์ดํ„ฐ์…‹. validation ์œผ๋กœ ๊ตฌ์„ฑ 

Additional Datasets (Reader)

์™ธ๋ถ€ ๋ฐ์ดํ„ฐ์ธ KorQuAD, Ko-WIKI๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์•ฝ 12๋งŒ๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ.

./data/                           # ์ „์ฒด ๋ฐ์ดํ„ฐ
    ./wiki_korQuAD_aug_dataset/   # ํ•™์Šต์— ์‚ฌ์šฉํ•  ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ์…‹.

data์— ๋Œ€ํ•œ argument ๋Š” arguments.py ์˜ DataTrainingArguments ์—์„œ ํ™•์ธ ๊ฐ€๋Šฅ.
๋งŒ์•ฝ arguments ์— ๋Œ€ํ•œ ์„ธํŒ…์„ ์ง์ ‘ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด arguments.py ๋ฅผ ์ฐธ๊ณ .


Usage

roberta ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ tokenizer ์‚ฌ์šฉ์‹œ ์•„๋ž˜ ํ•จ์ˆ˜์˜ ์˜ต์…˜์„ ์ˆ˜์ •ํ•ด์•ผํ•จ.
tokenizer๋Š” train, validation (train.py), test(inference.py) ์ „์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ํ˜ธ์ถœ๋˜์–ด ์‚ฌ์šฉ๋จ.
(tokenizer์˜ return_token_type_ids=False๋กœ ์„ค์ •ํ•ด์ฃผ์–ด์•ผ ํ•จ)

# train.py
def prepare_train_features(examples):
        # truncation๊ณผ padding(length๊ฐ€ ์งง์„๋•Œ๋งŒ)์„ ํ†ตํ•ด toknization์„ ์ง„ํ–‰ํ•˜๋ฉฐ, stride๋ฅผ ์ด์šฉํ•˜์—ฌ overflow๋ฅผ ์œ ์ง€ํ•จ.
        # ๊ฐ example๋“ค์€ ์ด์ „์˜ context์™€ ์กฐ๊ธˆ์”ฉ ๊ฒน์ณ์ง.
        tokenized_examples = tokenizer(
            examples[question_column_name if pad_on_right else context_column_name],
            examples[context_column_name if pad_on_right else question_column_name],
            truncation="only_second" if pad_on_right else "only_first",
            max_length=max_seq_length,
            stride=data_args.doc_stride,
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            # return_token_type_ids=False, # roberta๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ False, bert๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ True๋กœ ํ‘œ๊ธฐํ•ด์•ผํ•จ.
            padding="max_length" if data_args.pad_to_max_length else False,
        )

train

# ํ•™์Šต ์˜ˆ์‹œ (train_dataset ์‚ฌ์šฉ)
python train.py --output_dir ./models/train_dataset --do_train
  • train.py ์—์„œ sparse embedding ์„ ํ›ˆ๋ จํ•˜๊ณ  ์ €์žฅํ•˜๋Š” ๊ณผ์ •์€ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ์ง€ ์•Š์•„ ๋”ฐ๋กœ argument ์˜ default ๊ฐ€ True๋กœ ์„ค์ •๋˜์–ด ์žˆ์Œ. ์‹คํ–‰ ํ›„ sparse_embedding.bin ๊ณผ tfidfv.bin ์ด ์ €์žฅ๋จ. ๋งŒ์•ฝ sparse retrieval ๊ด€๋ จ ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•œ๋‹ค๋ฉด, ๊ผญ ๋‘ ํŒŒ์ผ์„ ์ง€์šฐ๊ณ  ๋‹ค์‹œ ์‹คํ–‰! ์•ˆ๊ทธ๋Ÿฌ๋ฉด ๊ธฐ์กด ํŒŒ์ผ์ด load ๋จ.

  • ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ --overwrite_cache ๋ฅผ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š์œผ๋ฉด ๊ฐ™์€ ํด๋”์— ์ €์žฅ๋˜์ง€ ์•Š์Œ.

  • ./outputs/ ํด๋” ๋˜ํ•œ --overwrite_output_dir ์„ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š์œผ๋ฉด ๊ฐ™์€ ํด๋”์— ์ €์žฅ๋˜์ง€ ์•Š์Œ.


eval

MRC ๋ชจ๋ธ์˜ ํ‰๊ฐ€๋Š”(--do_eval) ๋”ฐ๋กœ ์„ค์ •ํ•ด์•ผ ํ•จ. ์œ„ ํ•™์Šต ์˜ˆ์‹œ์— ๋‹จ์ˆœํžˆ --do_eval ์„ ์ถ”๊ฐ€๋กœ ์ž…๋ ฅํ•ด์„œ ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€๋ฅผ ๋™์‹œ์— ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ.

# mrc ๋ชจ๋ธ ํ‰๊ฐ€ (train_dataset ์‚ฌ์šฉ)
python train.py --output_dir ./outputs/train_dataset --model_name_or_path ./models/train_dataset/ --do_eval 

inference

retrieval ๊ณผ mrc ๋ชจ๋ธ์˜ ํ•™์Šต์ด ์™„๋ฃŒ๋˜๋ฉด inference.py ๋ฅผ ์ด์šฉํ•ด odqa ๋ฅผ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ.

  • ํ•™์Šตํ•œ ๋ชจ๋ธ์˜ test_dataset์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ์ถœํ•˜๊ธฐ ์œ„ํ•ด์„  ์ถ”๋ก (--do_predict)๋งŒ ์ง„ํ–‰.

  • ํ•™์Šตํ•œ ๋ชจ๋ธ์ด train_dataset ๋Œ€ํ•ด์„œ ODQA ์„ฑ๋Šฅ์ด ์–ด๋–ป๊ฒŒ ๋‚˜์˜ค๋Š”์ง€ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด ํ‰๊ฐ€(--do_eval)๋ฅผ ์ง„ํ–‰.

# ODQA ์‹คํ–‰ (test_dataset ์‚ฌ์šฉ)
# wandb ๊ฐ€ ๋กœ๊ทธ์ธ ๋˜์–ด์žˆ๋‹ค๋ฉด ์ž๋™์œผ๋กœ ๊ฒฐ๊ณผ๊ฐ€ wandb ์— ์ €์žฅ. ์•„๋‹ˆ๋ฉด ๋‹จ์ˆœํžˆ ์ถœ๋ ฅ๋จ
python inference.py --output_dir ./outputs/test_dataset/ --dataset_name ../data/test_dataset/ --model_name_or_path ./models/train_dataset/ --do_predict

Models

Retrieval

Sparse Embedding

  • TF-IDF

$$ TF(t,d) = \frac{\text{number of times t appears in d}}{\text{total number of terms in d}} $$

$$ IDF(t) = log \frac{N}{DF(t)} $$

$$ TF-IDF(t,d) = TF(t,d) * IDF(t) $$

๋‹จ์–ด์˜ ๋“ฑ์žฅ๋นˆ๋„(TF)์™€ ๋‹จ์–ด๊ฐ€ ์ œ๊ณตํ•˜๋Š” ์ •๋ณด์˜ ์–‘(IDF)๋ฅผ ์ด์šฉํ•œ function

  • BM25

๊ธฐ์กด TF-IDF๋ณด๋‹ค TF์˜ ์˜ํ–ฅ๋ ฅ์„ ์ค„์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ TF์˜ ํ•œ๊ณ„๋ฅผ ์ง€์ •ํ•˜์—ฌ ์ผ์ • ๋ฒ”์œ„๋ฅผ ์œ ์ง€ํ•˜๋„๋ก ํ•œ๋‹ค.
BM25๋Š” ๋ฌธ์„œ์˜ ๊ธธ์ด๊ฐ€ ๋” ์ž‘์„์ˆ˜๋ก ํฐ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜๋Š” function

  • Elasticsearch

Elasticsearch๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ scoring function์œผ๋กœ bm25 ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•œ๋‹ค.
์œ„์˜ bm25์™€ ์ฐจ์ด์ ์€ k=1.2, b=0.75๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ๊ฒ€์ƒ‰์—”์ง„์˜ Elasticsearch๋Š” ๋‹ค์–‘ํ•œ ํ”Œ๋Ÿฌ๊ทธ์ธ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ ์šฐ๋ฆฌ๋Š” ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ์ธ nori-analyzer๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ์ž์„ธํ•œ setting์€ settings.jsonํŒŒ์ผ์„ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋‹ค.

Dense Embedding

  • ColBERT

ColBERT๋Š” BERT ๊ธฐ๋ฐ˜์˜ Encoder๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ๋กœ Query ์ธ์ฝ”๋”์˜ $f_Q$ Document ์ธ์ฝ”๋”์˜ $f_D$๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.
ํ•˜๋‚˜์˜ BERT๋ชจ๋ธ์„ ๊ณต์œ ํ•˜์—ฌ ์ด๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด *[Q], [D]*์˜ ์ŠคํŽ˜์…œ ํ† ํฐ์„ ์‚ฌ์šฉํ•˜๋ฉฐ Query์™€ Document์˜ relevance score๋ฅผ ๊ตฌํ•˜๋Š” ํ•จ์ˆ˜๋Š” Cosine similarity๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.
ColBERT๋Š” ๊ฐ ๋ฌธ์„œ์— ๋Œ€ํ•œ ์ ์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜๊ณ  Cross-entropy loss๋ฅผ ์ด์šฉํ•˜์—ฌ ์ตœ์ ํ™”๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค.
๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ColBERT์™€ BM25๊ฐ„์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•œ Ensemble๋ชจ๋ธ์„ ์ฑ„ํƒํ•˜์—ฌ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ ์ž์„ธํ•œ ์‹คํ—˜ ๋‚ด์šฉ์€ ์ด๊ณณ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.


Reader


BERT

Masked Language Modeling๊ณผ Next Sentence Prediction์„ ํ†ตํ•œ ์‚ฌ์ „ํ•™์Šต

  • klue/bert
    • data : ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ์œ„ํ‚ค, ๋‰ด์Šค ๋“ฑ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ 62GB
    • vocab size : 32000
    • wordpiece tokenizer

  • bert-multilingual
    • data : 104๊ฐœ ์–ธ์–ด์˜ ์œ„ํ‚คํ”ผ๋””์•„ ๋ฐ์ดํ„ฐ
    • vocab size : 119547
    • wordpiece tokenizer

RoBERTa

Dynamic Masking๊ธฐ๋ฒ•๊ณผ ๋” ๊ธด ํ•™์Šต ์‹œ๊ฐ„๊ณผ ํฐ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ

  • klue/RoBERTa
    • data : ์œ„ํ‚คํ”ผ๋””์•„, ๋‰ด์Šค ๋“ฑ
    • vocab size : 32000
    • wordpiece

  • xlm/RoBERTa
    • data : 100๊ฐœ์˜ ์–ธ์–ด๊ฐ€ ํฌํ•จ๋˜๊ณ  ํ•„ํ„ฐ๋ง๋œ CommonCrawl (2.5TB)
    • vocab size : 250002
    • sentencepiece

ELECTRA

  • Koelectra
    • data : ๋‰ด์Šค, ์œ„ํ‚ค, ๋‚˜๋ฌด์œ„ํ‚ค ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜ ๋“ฑ (20GB)
    • vocab size : 35000
    • wordpiece

Ensemble

Hard Voting

๊ฐ ๋ชจ๋ธ๋“ค์˜ ๊ฒฐ๊ณผ๊ฐ’์˜ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ํ†ตํ•ด ๊ฐ€์žฅ ํฐ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์„ ์ตœ์ข… ๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ์ฑ„ํƒํ•˜๋Š” ๋ฐฉ์‹

Soft Voting

๊ฐ ๋ชจ๋ธ๋“ค์ด ์ถ”๋ก ํ•œ N๊ฐœ์˜ ํ™•๋ฅ  ๊ฐ’์„ ๋™์ผํ•œ ์ •๋‹ต์„ ๊ฐ€์ง„ ํ™•๋ฅ ์„ ๋ชจ๋‘ ๋”ํ•˜์—ฌ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ์ฑ„ํƒํ•˜๋Š” ๋ฐฉ์‹

Weighted Voting

Soft Voting์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด ํ•ฉ์ด 10์ด ๋˜๋„๋ก ํ•˜๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ„๋ฐฐํ•˜์—ฌ weighted voting์„ ์ˆ˜ํ–‰


Results

Model Exact Match F1 score
klue/bert-base 46.25 54.89
bert-base-multilingual 49.58 55.50
klue/roberta-large 69.58 76.84
xlm-roberta-large 61.67 71.85
KoELECTRA 57.5 63.11

Competition Result

Public & Private 1st

About

level2_mrc_nlp-level3-nlp-08 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.8%
  • Shell 0.2%