Skip to content
/ KcBERT Public

๐Ÿค— Pretrained BERT model & WordPiece tokenizer trained on Korean Comments ํ•œ๊ตญ์–ด ๋Œ“๊ธ€๋กœ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ํ•œ BERT ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹

License

Notifications You must be signed in to change notification settings

Beomi/KcBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

34 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

KcBERT: Korean comments BERT

** Updates on 2022.11.07 **

  • KcELECTRA v2022 ํ•™์Šต์— ์‚ฌ์šฉํ•œ, ํ™•์žฅ๋œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹(v2022.3Q)๋ฅผ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.
  • https://github.com/Beomi/KcBERT/releases/tag/v2022.3Q
  • ๊ธฐ์กด 11GB -> ์‹ ๊ทœ 45GB, ๊ธฐ์กด 0.9์–ต๊ฑด -> ์‹ ๊ทœ 3.4์–ต๊ฑด์œผ๋กœ ๊ธฐ์กด v1 ๋ฐ์ดํ„ฐ์…‹ ๋Œ€๋น„ ์•ฝ 4๋ฐฐ ์ฆ๊ฐ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.

** Updates on 2022.10.08 **

  • KcELECTRA-base-v2022 (๊ตฌ dev) ๋ชจ๋ธ ์ด๋ฆ„์ด ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ธฐ์กด KcELECTRA-base(v2021) ๋Œ€๋น„ ๋Œ€๋ถ€๋ถ„์˜ downstream task์—์„œ ~1%p ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

** Updates on 2022.09.14 **

  • emoji์˜ v2.0.0 ์—…๋ฐ์ดํŠธ๋จ์— ๋”ฐ๋ผ Preprocessing ์ฝ”๋“œ๊ฐ€ ์ผ๋ถ€ ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

** Updates on 2021.04.07 **

  • KcELECTRA๊ฐ€ ๋ฆด๋ฆฌ์ฆˆ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค!๐Ÿค—
  • KcELECTRA๋Š” ๋ณด๋‹ค ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹, ๊ทธ๋ฆฌ๊ณ  ๋” ํฐ General vocab์„ ํ†ตํ•ด KcBERT ๋Œ€๋น„ ๋ชจ๋“  ํƒœ์Šคํฌ์—์„œ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • ์•„๋ž˜ ๊นƒํ—™ ๋งํฌ์—์„œ ์ง์ ‘ ์‚ฌ์šฉํ•ด๋ณด์„ธ์š”!
  • https://github.com/Beomi/KcELECTRA

** Updates on 2021.03.14 **

  • KcBERT Paper ์ธ์šฉ ํ‘œ๊ธฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.(bibtex)
  • KcBERT-finetune Performance score๋ฅผ ๋ณธ๋ฌธ์— ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

** Updates on 2020.12.04 **

Huggingface Transformers๊ฐ€ v4.0.0์œผ๋กœ ์—…๋ฐ์ดํŠธ๋จ์— ๋”ฐ๋ผ Tutorial์˜ ์ฝ”๋“œ๊ฐ€ ์ผ๋ถ€ ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์—…๋ฐ์ดํŠธ๋œ KcBERT-Large NSMC Finetuning Colab: Open In Colab

** Updates on 2020.09.11 **

KcBERT๋ฅผ Google Colab์—์„œ TPU๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ํŠœํ† ๋ฆฌ์–ผ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค! ์•„๋ž˜ ๋ฒ„ํŠผ์„ ๋ˆŒ๋Ÿฌ๋ณด์„ธ์š”.

Colab์—์„œ TPU๋กœ KcBERT Pretrain ํ•ด๋ณด๊ธฐ: Open In Colab

ํ…์ŠคํŠธ ๋ถ„๋Ÿ‰๋งŒ ์ „์ฒด 12G ํ…์ŠคํŠธ ์ค‘ ์ผ๋ถ€(144MB)๋กœ ์ค„์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹/์ฝ”ํผ์Šค๋ฅผ ์ข€๋” ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” Korpora ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

** Updates on 2020.09.08 **

Github Release๋ฅผ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์—…๋กœ๋“œํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ ํ•œ ํŒŒ์ผ๋‹น 2GB ์ด๋‚ด์˜ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ๋ถ„ํ• ์••์ถ•๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ๋งํฌ๋ฅผ ํ†ตํ•ด ๋ฐ›์•„์ฃผ์„ธ์š”. (๊ฐ€์ž… ์—†์ด ๋ฐ›์„ ์ˆ˜ ์žˆ์–ด์š”. ๋ถ„ํ• ์••์ถ•)

๋งŒ์•ฝ ํ•œ ํŒŒ์ผ๋กœ ๋ฐ›๊ณ ์‹ถ์œผ์‹œ๊ฑฐ๋‚˜/Kaggle์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ดํŽด๋ณด๊ณ  ์‹ถ์œผ์‹œ๋‹ค๋ฉด ์•„๋ž˜์˜ ์บ๊ธ€ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•ด์ฃผ์„ธ์š”.

** Updates on 2020.08.22 **

Pretrain Dataset ๊ณต๊ฐœ

Kaggle์— ํ•™์Šต์„ ์œ„ํ•ด ์ •์ œํ•œ(์•„๋ž˜ clean์ฒ˜๋ฆฌ๋ฅผ ๊ฑฐ์นœ) Dataset์„ ๊ณต๊ฐœํ•˜์˜€์Šต๋‹ˆ๋‹ค!

์ง์ ‘ ๋‹ค์šด๋ฐ›์œผ์…”์„œ ๋‹ค์–‘ํ•œ Task์— ํ•™์Šต์„ ์ง„ํ–‰ํ•ด๋ณด์„ธ์š” :)


๊ณต๊ฐœ๋œ ํ•œ๊ตญ์–ด BERT๋Š” ๋Œ€๋ถ€๋ถ„ ํ•œ๊ตญ์–ด ์œ„ํ‚ค, ๋‰ด์Šค ๊ธฐ์‚ฌ, ์ฑ… ๋“ฑ ์ž˜ ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•œํŽธ, ์‹ค์ œ๋กœ NSMC์™€ ๊ฐ™์€ ๋Œ“๊ธ€ํ˜• ๋ฐ์ดํ„ฐ์…‹์€ ์ •์ œ๋˜์ง€ ์•Š์•˜๊ณ  ๊ตฌ์–ด์ฒด ํŠน์ง•์— ์‹ ์กฐ์–ด๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์˜คํƒˆ์ž ๋“ฑ ๊ณต์‹์ ์ธ ๊ธ€์“ฐ๊ธฐ์—์„œ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ํ‘œํ˜„๋“ค์ด ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.

KcBERT๋Š” ์œ„์™€ ๊ฐ™์€ ํŠน์„ฑ์˜ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ์˜จ๋ผ์ธ ๋‰ด์Šค์—์„œ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ์ˆ˜์ง‘ํ•ด, ํ† ํฌ๋‚˜์ด์ €์™€ BERT๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•œ Pretrained BERT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

KcBERT๋Š” Huggingface์˜ Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๊ฐ„ํŽธํžˆ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๋ณ„๋„์˜ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.)

KcBERT Performance

Size
(์šฉ๋Ÿ‰)
NSMC
(acc)
Naver NER
(F1)
PAWS
(acc)
KorNLI
(acc)
KorSTS
(spearman)
Question Pair
(acc)
KorQuaD (Dev)
(EM/F1)
KcBERT-Base 417M 89.62 84.34 66.95 74.85 75.57 93.93 60.25 / 84.39
KcBERT-Large 1.2G 90.68 85.53 70.15 76.99 77.49 94.06 62.16 / 86.64
KoBERT 351M 89.63 86.11 80.65 79.00 79.64 93.93 52.81 / 80.27
XLM-Roberta-Base 1.03G 89.49 86.26 82.95 79.92 79.09 93.53 64.70 / 88.94
HanBERT 614M 90.16 87.31 82.40 80.89 83.33 94.19 78.74 / 92.02
KoELECTRA-Base 423M 90.21 86.87 81.90 80.85 83.21 94.20 61.10 / 89.59
KoELECTRA-Base-v2 423M 89.70 87.02 83.90 80.61 84.30 94.72 84.34 / 92.58
DistilKoBERT 108M 88.41 84.13 62.55 70.55 73.21 92.48 54.12 / 77.80

*HanBERT์˜ Size๋Š” Bert Model๊ณผ Tokenizer DB๋ฅผ ํ•ฉ์นœ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

*config์˜ ์„ธํŒ…์„ ๊ทธ๋Œ€๋กœ ํ•˜์—ฌ ๋Œ๋ฆฐ ๊ฒฐ๊ณผ์ด๋ฉฐ, hyperparameter tuning์„ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•  ์‹œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

How to use

Requirements

  • pytorch <= 1.8.0
  • transformers ~= 3.0.1
    • transformers ~= 4.0.0 ๋„ ํ˜ธํ™˜๋ฉ๋‹ˆ๋‹ค.
  • emoji ~= 0.6.0
  • soynlp ~= 0.0.493
from transformers import AutoTokenizer, AutoModelWithLMHead

# Base Model (108M)

tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-base")

model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-base")

# Large Model (334M)

tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-large")

model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-large")

Pretrain & Finetune Colab ๋งํฌ ๋ชจ์Œ

Pretrain Data

Pretrain Code

Colab์—์„œ TPU๋กœ KcBERT Pretrain ํ•ด๋ณด๊ธฐ: Open In Colab

Finetune Samples

KcBERT-Base NSMC Finetuning with PyTorch-Lightning (Colab) Open In Colab

KcBERT-Large NSMC Finetuning with PyTorch-Lightning (Colab) Open In Colab

์œ„ ๋‘ ์ฝ”๋“œ๋Š” Pretrain ๋ชจ๋ธ(base, large)์™€ batch size๋งŒ ๋‹ค๋ฅผ ๋ฟ, ๋‚˜๋จธ์ง€ ์ฝ”๋“œ๋Š” ์™„์ „ํžˆ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

Train Data & Preprocessing

Raw Data

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 2019.01.01 ~ 2020.06.15 ์‚ฌ์ด์— ์ž‘์„ฑ๋œ ๋Œ“๊ธ€ ๋งŽ์€ ๋‰ด์Šค ๊ธฐ์‚ฌ๋“ค์˜ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ๋ชจ๋‘ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๋Š” ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ์‹œ ์•ฝ 15.4GB์ด๋ฉฐ, 1์–ต1์ฒœ๋งŒ๊ฐœ ์ด์ƒ์˜ ๋ฌธ์žฅ์œผ๋กœ ์ด๋ค„์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

Preprocessing

PLM ํ•™์Šต์„ ์œ„ํ•ด์„œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•œ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. ํ•œ๊ธ€ ๋ฐ ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž, ๊ทธ๋ฆฌ๊ณ  ์ด๋ชจ์ง€(๐Ÿฅณ)๊นŒ์ง€!

    ์ •๊ทœํ‘œํ˜„์‹์„ ํ†ตํ•ด ํ•œ๊ธ€, ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ํฌํ•จํ•ด Emoji๊นŒ์ง€ ํ•™์Šต ๋Œ€์ƒ์— ํฌํ•จํ–ˆ์Šต๋‹ˆ๋‹ค.

    ํ•œํŽธ, ํ•œ๊ธ€ ๋ฒ”์œ„๋ฅผ ใ„ฑ-ใ…Ž๊ฐ€-ํžฃ ์œผ๋กœ ์ง€์ •ํ•ด ใ„ฑ-ํžฃ ๋‚ด์˜ ํ•œ์ž๋ฅผ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

  2. ๋Œ“๊ธ€ ๋‚ด ์ค‘๋ณต ๋ฌธ์ž์—ด ์ถ•์•ฝ

    ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹์™€ ๊ฐ™์ด ์ค‘๋ณต๋œ ๊ธ€์ž๋ฅผ ใ…‹ใ…‹์™€ ๊ฐ™์€ ๊ฒƒ์œผ๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.

  3. Cased Model

    KcBERT๋Š” ์˜๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” ๋Œ€์†Œ๋ฌธ์ž๋ฅผ ์œ ์ง€ํ•˜๋Š” Cased model์ž…๋‹ˆ๋‹ค.

  4. ๊ธ€์ž ๋‹จ์œ„ 10๊ธ€์ž ์ดํ•˜ ์ œ๊ฑฐ

    10๊ธ€์ž ๋ฏธ๋งŒ์˜ ํ…์ŠคํŠธ๋Š” ๋‹จ์ผ ๋‹จ์–ด๋กœ ์ด๋ค„์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.

  5. ์ค‘๋ณต ์ œ๊ฑฐ

    ์ค‘๋ณต์ ์œผ๋กœ ์“ฐ์ธ ๋Œ“๊ธ€์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ์ค‘๋ณต ๋Œ“๊ธ€์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด ๋งŒ๋“  ์ตœ์ข… ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 12.5GB, 8.9์ฒœ๋งŒ๊ฐœ ๋ฌธ์žฅ์ž…๋‹ˆ๋‹ค.

์•„๋ž˜ ๋ช…๋ น์–ด๋กœ pip๋กœ ์„ค์น˜ํ•œ ๋’ค, ์•„๋ž˜ cleanํ•จ์ˆ˜๋กœ ํด๋ฆฌ๋‹์„ ํ•˜๋ฉด Downstream task์—์„œ ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•„์ง‘๋‹ˆ๋‹ค. ([UNK] ๊ฐ์†Œ)

pip install soynlp emoji

์•„๋ž˜ clean ํ•จ์ˆ˜๋ฅผ Text data์— ์‚ฌ์šฉํ•ด์ฃผ์„ธ์š”.

import re
import emoji
from soynlp.normalizer import repeat_normalize

pattern = re.compile(f'[^ .,?!/@$%~๏ผ…ยทโˆผ()\x00-\x7Fใ„ฑ-ใ…ฃ๊ฐ€-ํžฃ]+')
url_pattern = re.compile(
    r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')

def clean(x): 
    x = pattern.sub(' ', x)
    x = emoji.replace_emoji(x, replace='') #emoji ์‚ญ์ œ
    x = url_pattern.sub('', x)
    x = x.strip()
    x = repeat_normalize(x, num_repeats=2)

    return x

Cleaned Data (Released on Kaggle)

์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์œ„ cleanํ•จ์ˆ˜๋กœ ์ •์ œํ•œ 12GB๋ถ„๋Ÿ‰์˜ txt ํŒŒ์ผ์„ ์•„๋ž˜ Kaggle Dataset์—์„œ ๋‹ค์šด๋ฐ›์œผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค :)

https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments

Tokenizer Train

Tokenizer๋Š” Huggingface์˜ Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ์ค‘ BertWordPieceTokenizer ๋ฅผ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , Vocab Size๋Š” 30000์œผ๋กœ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Tokenizer๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์—๋Š” 1/10๋กœ ์ƒ˜ํ”Œ๋งํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ๋ณด๋‹ค ๊ณจ๊ณ ๋ฃจ ์ƒ˜ํ”Œ๋งํ•˜๊ธฐ ์œ„ํ•ด ์ผ์ž๋ณ„๋กœ stratify๋ฅผ ์ง€์ •ํ•œ ๋’ค ํ–‘์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

BERT Model Pretrain

  • KcBERT Base config
{
    "max_position_embeddings": 300,
    "hidden_dropout_prob": 0.1,
    "hidden_act": "gelu",
    "initializer_range": 0.02,
    "num_hidden_layers": 12,
    "type_vocab_size": 2,
    "vocab_size": 30000,
    "hidden_size": 768,
    "attention_probs_dropout_prob": 0.1,
    "directionality": "bidi",
    "num_attention_heads": 12,
    "intermediate_size": 3072,
    "architectures": [
        "BertForMaskedLM"
    ],
    "model_type": "bert"
}
  • KcBERT Large config
{
    "type_vocab_size": 2,
    "initializer_range": 0.02,
    "max_position_embeddings": 300,
    "vocab_size": 30000,
    "hidden_size": 1024,
    "hidden_dropout_prob": 0.1,
    "model_type": "bert",
    "directionality": "bidi",
    "pad_token_id": 0,
    "layer_norm_eps": 1e-12,
    "hidden_act": "gelu",
    "num_hidden_layers": 24,
    "num_attention_heads": 16,
    "attention_probs_dropout_prob": 0.1,
    "intermediate_size": 4096,
    "architectures": [
        "BertForMaskedLM"
    ]
}

BERT Model Config๋Š” Base, Large ๊ธฐ๋ณธ ์„ธํŒ…๊ฐ’์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. (MLM 15% ๋“ฑ)

TPU v3-8 ์„ ์ด์šฉํ•ด ๊ฐ๊ฐ 3์ผ, N์ผ(Large๋Š” ํ•™์Šต ์ง„ํ–‰ ์ค‘)์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ˜„์žฌ Huggingface์— ๊ณต๊ฐœ๋œ ๋ชจ๋ธ์€ 1m(100๋งŒ) step์„ ํ•™์Šตํ•œ ckpt๊ฐ€ ์—…๋กœ๋“œ ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ํ•™์Šต Loss๋Š” Step์— ๋”ฐ๋ผ ์ดˆ๊ธฐ 200k์— ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ Loss๊ฐ€ ์ค„์–ด๋“ค๋‹ค 400k์ดํ›„๋กœ๋Š” ์กฐ๊ธˆ์”ฉ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Base Model Loss

KcBERT-Base Pretraining Loss

  • Large Model Loss

KcBERT-Large Pretraining Loss

ํ•™์Šต์€ GCP์˜ TPU v3-8์„ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ•™์Šต ์‹œ๊ฐ„์€ Base Model ๊ธฐ์ค€ 2.5์ผ์ •๋„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. Large Model์€ ์•ฝ 5์ผ์ •๋„ ์ง„ํ–‰ํ•œ ๋’ค ๊ฐ€์žฅ ๋‚ฎ์€ loss๋ฅผ ๊ฐ€์ง„ ์ฒดํฌํฌ์ธํŠธ๋กœ ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

Example

HuggingFace MASK LM

HuggingFace kcbert-base ๋ชจ๋ธ ์—์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ํ…Œ์ŠคํŠธ ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ "์ข‹๋„ค์š”", KcBERT-Base

๋ฌผ๋ก  kcbert-large ๋ชจ๋ธ ์—์„œ๋„ ํ…Œ์ŠคํŠธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

image-20200806160624340

NSMC Binary Classification

๋„ค์ด๋ฒ„ ์˜ํ™”ํ‰ ์ฝ”ํผ์Šค ๋ฐ์ดํ„ฐ์…‹์„ ๋Œ€์ƒ์œผ๋กœ Fine Tuning์„ ์ง„ํ–‰ํ•ด ์„ฑ๋Šฅ์„ ๊ฐ„๋‹จํžˆ ํ…Œ์ŠคํŠธํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

Base Model์„ Fine Tuneํ•˜๋Š” ์ฝ”๋“œ๋Š” Open In Colab ์—์„œ ์ง์ ‘ ์‹คํ–‰ํ•ด๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Large Model์„ Fine Tuneํ•˜๋Š” ์ฝ”๋“œ๋Š” Open In Colab ์—์„œ ์ง์ ‘ ์‹คํ–‰ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • GPU๋Š” P100 x1๋Œ€ ๊ธฐ์ค€ 1epoch์— 2-3์‹œ๊ฐ„, TPU๋Š” 1epoch์— 1์‹œ๊ฐ„ ๋‚ด๋กœ ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.
  • GPU RTX Titan x4๋Œ€ ๊ธฐ์ค€ 30๋ถ„/epoch ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ ์ฝ”๋“œ๋Š” pytorch-lightning์œผ๋กœ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹คํ—˜๊ฒฐ๊ณผ

  • KcBERT-Base Model ์‹คํ—˜๊ฒฐ๊ณผ: Val acc .8905

    KcBERT Base finetune on NSMC

  • KcBERT-Large Model ์‹คํ—˜ ๊ฒฐ๊ณผ: Val acc .9089

    image-20200806190242834

๋” ๋‹ค์–‘ํ•œ Downstream Task์— ๋Œ€ํ•ด ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๊ณ  ๊ณต๊ฐœํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

์ธ์šฉํ‘œ๊ธฐ/Citation

KcBERT๋ฅผ ์ธ์šฉํ•˜์‹ค ๋•Œ๋Š” ์•„๋ž˜ ์–‘์‹์„ ํ†ตํ•ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@inproceedings{lee2020kcbert,
  title={KcBERT: Korean Comments BERT},
  author={Lee, Junbum},
  booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
  pages={437--440},
  year={2020}
}

Acknowledgement

KcBERT Model์„ ํ•™์Šตํ•˜๋Š” GCP/TPU ํ™˜๊ฒฝ์€ TFRC ํ”„๋กœ๊ทธ๋žจ์˜ ์ง€์›์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •์—์„œ ๋งŽ์€ ์กฐ์–ธ์„ ์ฃผ์‹  Monologg ๋‹˜ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)

Reference

Github Repos

Papers

Blogs

About

๐Ÿค— Pretrained BERT model & WordPiece tokenizer trained on Korean Comments ํ•œ๊ตญ์–ด ๋Œ“๊ธ€๋กœ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ํ•œ BERT ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹

Topics

Resources

License

Stars

Watchers

Forks