Skip to content

๐ŸŽฏํ•œ๊ตญ์–ด LLM์˜ ๋„๊ตฌ ํ˜ธ์ถœ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ ํ”„๋กœ์ ํŠธ

License

Notifications You must be signed in to change notification settings

TaskerJang/Ko-AgentBench

ย 
ย 

Repository files navigation

Ko-AgentBench

banner

English | ํ•œ๊ตญ์–ด

Dataset Leaderboard


Ko-AgentBench โœจ

ํ•œ๊ตญ์–ด ๋„๊ตฌ ํ˜ธ์ถœ(Tool-Calling) ์—์ด์ „ํŠธ๋ฅผ ์œ„ํ•œ ์ข…ํ•ฉ ํ‰๊ฐ€ ๋ฒค์น˜๋งˆํฌ

Ko-AgentBench๋Š” AI ์—์ด์ „ํŠธ๊ฐ€ ๋„ค์ด๋ฒ„ ๊ฒ€์ƒ‰, ์นด์นด์˜ค๋งต, ์•”ํ˜ธํ™”ํ ๊ฑฐ๋ž˜์†Œ, ์ฃผ์‹ ์กฐํšŒ ๋“ฑ ํ•œ๊ตญ ์‚ฌ์šฉ์ž๊ฐ€ ์‹ค์ œ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋„๊ตฌ๋“ค์„ ์–ผ๋งˆ๋‚˜ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š”์ง€ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๐ŸŽฏ ํ•ต์‹ฌ ํŠน์ง•

  • ๐Ÿ‡ฐ๐Ÿ‡ท ํ•œ๊ตญ ํŠนํ™”: ๋„ค์ด๋ฒ„, ์นด์นด์˜ค, ํ‹ฐ๋งต, ์—…๋น„ํŠธ/๋น—์ธ, LS์ฆ๊ถŒ, ์•Œ๋ผ๋”˜ ๋“ฑ ํ•œ๊ตญ ์„œ๋น„์Šค API ํ™œ์šฉ
  • ๐Ÿ”‘ API ํ‚ค ๋ถˆํ•„์š”: ์บ์‹œ ๊ธฐ๋ฐ˜ ์Šˆ๋„ API๋กœ ์‹ค์ œ API ํ‚ค ์—†์ด๋„ ์ฆ‰์‹œ ํ‰๊ฐ€ ๊ฐ€๋Šฅ
  • ๐ŸŽฏ 7๊ฐ€์ง€ ๋…๋ฆฝ ํ‰๊ฐ€: ๋„๊ตฌ ์„ ํƒ, ์ˆœ์ฐจ/๋ณ‘๋ ฌ ์ถ”๋ก , ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ, ํšจ์œจ์„ฑ, ์žฅ๊ธฐ ๋งฅ๋ฝ ๋“ฑ ๋‹ค๊ฐ๋„ ์ธก์ •
  • ๐Ÿ”„ ์žฌํ˜„ ๊ฐ€๋Šฅ: ๋™์ผ ์กฐ๊ฑด์—์„œ ๋ฐ˜๋ณต ์‹คํ–‰ ๋ณด์žฅ, ์—ฐ๊ตฌ ์žฌํ˜„์„ฑ ํ™•๋ณด

Tip

Why Ko-AgentBench?

๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ์˜ ํ•œ๊ณ„

  • ๋Œ€๋ถ€๋ถ„์˜ ๋„๊ตฌ ํ˜ธ์ถœ ๋ฒค์น˜๋งˆํฌ๋Š” ์˜์–ด ์ค‘์‹ฌ์ด๋ฉฐ, ํ•œ๊ตญ์–ด ํ™˜๊ฒฝ๊ณผ ํ•œ๊ตญ ์‚ฌ์šฉ์ž์˜ ์‹ค์ œ ์‚ฌ์šฉ ์‚ฌ๋ก€๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•จ
  • ๋‹จ์ˆœํžˆ "์ •ํ™•ํ•œ API๋ฅผ ํ˜ธ์ถœํ–ˆ๋Š”๊ฐ€"๋งŒ ํ‰๊ฐ€ํ•˜๊ณ , ์‹ค์ œ ์—…๋ฌด ํ๋ฆ„์˜ ๋ณต์žก์„ฑ(์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ, ํšจ์œจ์„ฑ, ๋งฅ๋ฝ ์œ ์ง€ ๋“ฑ)์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ

Ko-AgentBench์˜ ์ฐจ๋ณ„์ 

  • ํ•œ๊ตญ ์‹ค์‚ฌ์šฉ ๋„๊ตฌ ๊ธฐ๋ฐ˜ ํ‰๊ฐ€: ๋„ค์ด๋ฒ„ ๊ฒ€์ƒ‰/๋ธ”๋กœ๊ทธ, ์นด์นด์˜ค ์ง€๋„/์žฅ์†Œ๊ฒ€์ƒ‰, ํ‹ฐ๋งต ๊ฒฝ๋กœ์•ˆ๋‚ด, ์—…๋น„ํŠธ/๋น—์ธ ์•”ํ˜ธํ™”ํ ๊ฑฐ๋ž˜, LS์ฆ๊ถŒ ์ฃผ์‹์กฐํšŒ, ํ•œ๊ตญ๊ด€๊ด‘๊ณต์‚ฌ ์ถ•์ œ์ •๋ณด, ์•Œ๋ผ๋”˜ ๋„์„œ๊ฒ€์ƒ‰ ๋“ฑ ํ•œ๊ตญ์ธ์ด ์ผ์ƒ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์„œ๋น„์Šค๋กœ ๊ตฌ์„ฑ
  • ํ˜„์‹ค์ ์ธ ๋‹ค์ค‘ ํ„ด ์‹œ๋‚˜๋ฆฌ์˜ค: ๋‹จ์ผ API ํ˜ธ์ถœ์ด ์•„๋‹Œ, ์—ฌ๋Ÿฌ ๋„๊ตฌ๋ฅผ ์—ฐ๊ฒฐํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ์ „๋‹ฌํ•˜๋Š” ์‹ค์ œ ์—…๋ฌด ํ๋ฆ„ ๋ฐ˜์˜
  • ์ข…ํ•ฉ์  ํ‰๊ฐ€ ์ฒด๊ณ„: ํ˜„์‹ค์„ฑ, ๋ช…ํ™•์„ฑ, ํŒ๋ณ„๋ ฅ, ๊ฒฌ๊ณ ์„ฑ, ํšจ์œจ์„ฑ, ์žฌํ˜„์„ฑ, ํ™•์žฅ์„ฑ์˜ 7๊ฐ€์ง€ ์›์น™ ๊ธฐ๋ฐ˜ ํ‰๊ฐ€

๐Ÿ’ก ์บ์‹œ ์‹œ์Šคํ…œ: API ํ‚ค ์—†์ด ๋ฐ”๋กœ ์‹œ์ž‘

Ko-AgentBench๋Š” ์‚ฌ์ „ ์ˆ˜์ง‘๋œ API ์‘๋‹ต ์บ์‹œ๋ฅผ ์ œ๊ณตํ•˜์—ฌ ์‹ค์ œ API ํ˜ธ์ถœ ์—†์ด๋„ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Read ๋ชจ๋“œ (๊ธฐ๋ณธ): ์บ์‹œ๋งŒ ์‚ฌ์šฉ, API ํ‚ค ๋ถˆํ•„์š” โ†’ ๋ˆ„๊ตฌ๋‚˜ ์ฆ‰์‹œ ํ‰๊ฐ€ ๊ฐ€๋Šฅ
  • Write ๋ชจ๋“œ: ์‹ค์ œ API ํ˜ธ์ถœ ํ›„ ์บ์‹œ ์ €์žฅ โ†’ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹ ํ™•์žฅ ์‹œ ์‚ฌ์šฉ
# API ํ‚ค ์—†์ด ์‹คํ–‰ (์บ์‹œ ๋ชจ๋“œ)
uv run run_benchmark_with_logging.py --levels L1 --model openai/gpt-4

# ์‹ค์ œ API ํ˜ธ์ถœ (API ํ‚ค ํ•„์š”)
uv run run_benchmark_with_logging.py --cache-mode write

๐Ÿ› ๏ธ ์ œ๊ณต๋˜๋Š” API ๋„๊ตฌ

Ko-AgentBench๋Š” ํ•œ๊ตญ ์‚ฌ์šฉ์ž๊ฐ€ ์‹ค์ œ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค์–‘ํ•œ ์„œ๋น„์Šค์˜ API๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์„œ๋น„์Šค ๋„๊ตฌ ์„ค๋ช…
๋„ค์ด๋ฒ„ ๊ฒ€์ƒ‰ Search_naver_web
Search_naver_blog
Search_naver_news
๋„ค์ด๋ฒ„ ํ†ตํ•ฉ๊ฒ€์ƒ‰, ๋ธ”๋กœ๊ทธ, ๋‰ด์Šค ๊ฒ€์ƒ‰ API
์นด์นด์˜ค ๋กœ์ปฌ AddressToCoord_kakao
CoordToAddress_kakao
PlaceSearch_kakao
CategorySearch_kakao
์ฃผ์†Œ-์ขŒํ‘œ ๋ณ€ํ™˜, ์žฅ์†Œ ๊ฒ€์ƒ‰, ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ ๊ฒ€์ƒ‰
์—…๋น„ํŠธ CryptoPrice_upbit
MarketList_upbit
CryptoCandle_upbit
์•”ํ˜ธํ™”ํ ํ˜„์žฌ๊ฐ€, ๋งˆ์ผ“ ๋ชฉ๋ก, ์บ”๋“ค ์ฐจํŠธ ๋ฐ์ดํ„ฐ ์กฐํšŒ
๋น—์ธ CryptoPrice_bithumb
OrderBook_bithumb
MarketList_bithumb
CryptoCandle_bithumb
์•”ํ˜ธํ™”ํ ํ˜„์žฌ๊ฐ€, ํ˜ธ๊ฐ€, ๋งˆ์ผ“ ๋ชฉ๋ก, ์บ”๋“ค ์ฐจํŠธ ์กฐํšŒ
LS์ฆ๊ถŒ StockPrice_ls
MarketIndex_ls
OrderBook_ls
SectorStock_ls
StockTrades_ls
๊ตญ๋‚ด์™ธ ์ฃผ์‹ ํ˜„์žฌ๊ฐ€, ์‹œ์žฅ์ง€์ˆ˜, ํ˜ธ๊ฐ€, ์—…์ข…๋ณ„ ์ข…๋ชฉ, ์ฒด๊ฒฐ ๋‚ด์—ญ ์กฐํšŒ
ํ•œ๊ตญํˆฌ์ž์ฆ๊ถŒ StockPrice_kis
USStockPrice_kis
StockChart_kis
๊ตญ๋‚ด ์ฃผ์‹ ํ˜„์žฌ๊ฐ€, ๋ฏธ๊ตญ ์ฃผ์‹ ํ˜„์žฌ๊ฐ€, ์ฐจํŠธ ๋ฐ์ดํ„ฐ ์กฐํšŒ
์•Œ๋ผ๋”˜ ItemSearch_aladin
ItemList_aladin
ItemLookup_aladin
๋„์„œ ๊ฒ€์ƒ‰, ๋ฒ ์ŠคํŠธ์…€๋Ÿฌ/์‹ ๊ฐ„ ๋ชฉ๋ก, ๋„์„œ ์ƒ์„ธ์ •๋ณด ์กฐํšŒ
ํ‹ฐ๋งต POISearch_tmap
Geocoding_tmap
ReverseGeocoding_tmap
CarRoute_tmap
CategorySearch_tmap
POI ๊ฒ€์ƒ‰, ์ฃผ์†Œ-์ขŒํ‘œ ๋ณ€ํ™˜, ์ž๋™์ฐจ ๊ฒฝ๋กœ ์•ˆ๋‚ด, ์นดํ…Œ๊ณ ๋ฆฌ ๊ฒ€์ƒ‰
๋„ค์ด๋ฒ„ ์ง€๋„ Directions_naver ๋Œ€์ค‘๊ตํ†ต/์ž๋™์ฐจ/๋„๋ณด ๊ฒฝ๋กœ ์•ˆ๋‚ด

Note: ๋ชจ๋“  API๋Š” ์บ์‹œ ๊ธฐ๋ฐ˜์œผ๋กœ ๋™์ž‘ํ•˜๋ฉฐ, Read ๋ชจ๋“œ(๊ธฐ๋ณธ๊ฐ’)์—์„œ๋Š” ์‹ค์ œ API ํ‚ค ์—†์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ“Š 7๊ฐ€์ง€ ๋…๋ฆฝ์  ํ‰๊ฐ€ ์ฐจ์›

์—์ด์ „ํŠธ์˜ ๋„๊ตฌ ํ˜ธ์ถœ ๋Šฅ๋ ฅ์€ ๋‹จ์ผ ์ฐจ์›์œผ๋กœ ์ธก์ •ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. "๋„๊ตฌ๋ฅผ ์ž˜ ์“ด๋‹ค"๋Š” ๊ฒƒ์€ ์ •ํ™•ํ•œ ๋„๊ตฌ๋ฅผ ์„ ํƒํ•˜๋Š” ๋Šฅ๋ ฅ, ์—ฌ๋Ÿฌ ๋„๊ตฌ๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ๊ณ„ํš ๋Šฅ๋ ฅ, ์˜ค๋ฅ˜์— ๋Œ€์‘ํ•˜๋Š” ๊ฐ•๊ฑด์„ฑ, ํšจ์œจ์ ์œผ๋กœ ์ž‘๋™ํ•˜๋Š” ๋Šฅ๋ ฅ ๋“ฑ ์—ฌ๋Ÿฌ ๋…๋ฆฝ์ ์ธ ์—ญ๋Ÿ‰์˜ ์กฐํ•ฉ์ž…๋‹ˆ๋‹ค. Ko-AgentBench๋Š” ์ด๋Ÿฌํ•œ ์—ญ๋Ÿ‰์„ 7๊ฐ€์ง€๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋‚œ์ด๋„ ๋‹จ๊ณ„๊ฐ€ ์•„๋‹ˆ๋ผ, ์„œ๋กœ ๋‹ค๋ฅธ ์ธก๋ฉด์˜ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ์ฒด๊ณ„์ž…๋‹ˆ๋‹ค.

๋ ˆ๋ฒจ Task ์„ค๋ช…
L1 ๋‹จ์ผ ๋„๊ตฌ ํ˜ธ์ถœ ์ฃผ์–ด์ง„ ๋‹จ์ผ ๋„๊ตฌ๋ฅผ ์ •ํ™•ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์‹คํ–‰ํ•˜๋Š” ๋Šฅ๋ ฅ ๊ฒ€์ฆ
L2 ๋„๊ตฌ ์„ ํƒ ์—ฌ๋Ÿฌ ๋„๊ตฌ ์ค‘ ์‚ฌ์šฉ์ž ์š”์ฒญ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ๋„๊ตฌ๋ฅผ ์„ ํƒํ•˜๋Š” ๋Šฅ๋ ฅ ํ‰๊ฐ€
L3 ์ˆœ์ฐจ์  ์ถ”๋ก  ํ•œ ๋„๊ตฌ์˜ ์ถœ๋ ฅ์„ ๋‹ค์Œ ๋„๊ตฌ์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์ˆœ์ฐจ์  ๊ณ„ํš ๋ฐ ์‹คํ–‰ ๋Šฅ๋ ฅ ํ‰๊ฐ€
L4 ๋ณ‘๋ ฌ์  ์ถ”๋ก  ์—ฌ๋Ÿฌ ๋„๊ตฌ๋ฅผ ๋™์‹œ์— ํ˜ธ์ถœํ•œ ํ›„, ๊ทธ ๊ฒฐ๊ณผ๋“ค์„ ์ข…ํ•ฉํ•˜์—ฌ ๊ฒฐ๋ก ์„ ๋„์ถœํ•˜๋Š” ๋Šฅ๋ ฅ ํ‰๊ฐ€
L5 ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ์™€ ๊ฐ•๊ฑด์„ฑ API ํ˜ธ์ถœ ์‹คํŒจ, ์ •๋ณด ๋ถ€์กฑ ๋“ฑ ์˜ˆ์™ธ์ ์ธ ์˜ค๋ฅ˜ ์ƒํ™ฉ์— ๋Œ€์ฒ˜ํ•˜๋Š” ๋Šฅ๋ ฅ ํ‰๊ฐ€
L6 ํšจ์œจ์ ์ธ ๋„๊ตฌ ํ™œ์šฉ ์ด์ „ ๋Œ€ํ™”์˜ ๋„๊ตฌ ํ˜ธ์ถœ ๊ฒฐ๊ณผ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ๋ถˆํ•„์š”ํ•œ ๋ฐ˜๋ณต ์‹คํ–‰์„ ํ”ผํ•˜๋Š” ํšจ์œจ์„ฑ ํ‰๊ฐ€
L7 ์žฅ๊ธฐ ์ปจํ…์ŠคํŠธ ๊ธฐ์–ต ๊ธด ๋Œ€ํ™”์˜ ์ „์ฒด ๋งฅ๋ฝ์„ ๊ธฐ์–ตํ•˜๊ณ  ํ™œ์šฉํ•˜์—ฌ ์ ์ ˆํ•œ ๋„๊ตฌ๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ๋Šฅ๋ ฅ ํ‰๊ฐ€

7๊ฐ€์ง€ ์—ญ๋Ÿ‰์€ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ ์ธก์ •๋˜๋ฉฐ, ์ข…ํ•ฉ ์ ์ˆ˜๋Š” ์ด๋“ค์˜ ๊ฐ€์ค‘ ํ‰๊ท ์œผ๋กœ ์‚ฐ์ถœ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ๊ฐ•์ ๊ณผ ์•ฝ์ ์„ ์„ธ๋ฐ€ํ•˜๊ฒŒ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ฐœ์„ ์ด ํ•„์š”ํ•œ ์˜์—ญ์„ ๋ช…ํ™•ํžˆ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


๐Ÿš€ ๋น ๋ฅธ ์‹œ์ž‘

1) ์„ค์น˜

# ์ €์žฅ์†Œ ํด๋ก 
git clone https://github.com/Hugging-Face-KREW/Ko-AgentBench
cd Ko-AgentBench

# uv ์„ค์น˜ (๋ฏธ์„ค์น˜ ์‹œ)
curl -LsSf https://astral.sh/uv/install.sh | sh

# ์˜์กด์„ฑ ์„ค์น˜
uv sync

2) LLM API ํ‚ค ์„ค์ •

ํ‰๊ฐ€ํ•  ๋ชจ๋ธ์˜ API ํ‚ค๋งŒ ์„ค์ •ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค (๋„๊ตฌ API ํ‚ค๋Š” ์บ์‹œ ๋ชจ๋“œ์—์„œ ๋ถˆํ•„์š”).

# LLM Model API key (ํ•„์ˆ˜)
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GEMINI_API_KEY="your-gemini-key"

3) ์‹คํ–‰๊ณผ ํ‰๊ฐ€

# ๋ฒค์น˜๋งˆํฌ ์‹คํ–‰ (L1 ๋ ˆ๋ฒจ, ์บ์‹œ ์ฝ๊ธฐ ๋ชจ๋“œ)
uv run run_benchmark_with_logging.py --levels L1 --model openai/gpt-4

# ํ‰๊ฐ€ (์‹คํ–‰ ๋‚ ์งœ๋ฅผ YYYYMMDD ํ˜•์‹์œผ๋กœ ์ž…๋ ฅ)
uv run evaluate_model_run.py --date 20251022 --model openai/gpt-4 --format all

๐Ÿ“ ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

ํด๋” ๊ตฌ์กฐ

Ko-AgentBench/
โ”œโ”€ bench/
โ”‚  โ”œโ”€ tasks/       # YAML ํƒœ์Šคํฌ ์ •์˜
โ”‚  โ”œโ”€ tools/       # ๋„๊ตฌ ์ŠคํŽ™๊ณผ ์–ด๋Œ‘ํ„ฐ
โ”‚  โ”œโ”€ runner/      # ์‹คํ–‰ ์—”์ง„๊ณผ ๋ฉ”ํŠธ๋ฆญ
โ”‚  โ””โ”€ cache/       # API ์‘๋‹ต ์บ์‹œ
โ”œโ”€ logs/           # ์‹คํ–‰ ๋กœ๊ทธ
โ”œโ”€ reports/        # ํ‰๊ฐ€ ๋ณด๊ณ ์„œ
โ”œโ”€ configs/        # ์„ค์ • ํŒŒ์ผ
โ”œโ”€ run_benchmark_with_logging.py
โ”œโ”€ evaluate_model_run.py
โ””โ”€ README.md

ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์กฐ

[1] ์‹คํ–‰: run_benchmark_with_logging.py
    โ””โ”€ ์‹คํ–‰ ๋กœ๊ทธ โ†’ logs/ ๋””๋ ‰ํ† ๋ฆฌ
         (๋Œ€ํ™”, ๋„๊ตฌ ํ˜ธ์ถœ, ํŒŒ๋ผ๋ฏธํ„ฐ, ์บ์‹œ ์ ์ค‘๋ฅ , ์˜ค๋ฅ˜)

[2] ํ‰๊ฐ€: evaluate_model_run.py
    โ””โ”€ ํ‰๊ฐ€ ๋ณด๊ณ ์„œ โ†’ reports/{model}_{date}/
         (์ง€ํ‘œ ์ง‘๊ณ„, JSON/CSV/Markdown ํ˜•์‹)

์‹คํ–‰๊ณผ ํ‰๊ฐ€๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ๋…๋ฆฝ์ ์ธ ์žฌํ˜„๊ณผ ์ž๋™ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.


โšก ๋ฒค์น˜๋งˆํฌ ์‹คํ–‰

๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•

run_benchmark_with_logging.py๋กœ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

# ์ „์ฒด ๋ ˆ๋ฒจ ์‹คํ–‰ (์บ์‹œ ์ฝ๊ธฐ ๋ชจ๋“œ)
uv run run_benchmark_with_logging.py

# ํŠน์ • ๋ ˆ๋ฒจ๋งŒ ์‹คํ–‰
uv run run_benchmark_with_logging.py --levels L1,L2,L3

# ํŠน์ • ๋ชจ๋ธ ์ง€์ •
uv run run_benchmark_with_logging.py --model openai/gpt-4

# ๋กœ์ปฌ ๋ชจ๋ธ ์‚ฌ์šฉ
uv run run_benchmark_with_logging.py --use-local --model Qwen/Qwen2.5-7B-Instruct

# API ํ˜ธ์ถœ ํ›„ ์บ์‹œ ์ €์žฅ
uv run run_benchmark_with_logging.py --cache-mode write

# ํƒœ์Šคํฌ 3ํšŒ ๋ฐ˜๋ณต ์ˆ˜ํ–‰
uv run run_benchmark_with_logging.py --repetition 3

์ฃผ์š” ์˜ต์…˜

๋ฐ์ดํ„ฐ์…‹ ์„ ํƒ

  • --levels: ์‹คํ–‰ํ•  ๋ ˆ๋ฒจ (์˜ˆ: L1,L2,L6,L7) โ€” ๊ธฐ๋ณธ๊ฐ’: ์ „์ฒด

๋ชจ๋ธ ์„ค์ •

  • --model: ๋ชจ๋ธ ID (์˜ˆ: openai/gpt-4, anthropic/claude-3-5-sonnet-20241022)
  • --use-local: ๋กœ์ปฌ Transformers ์‚ฌ์šฉ
  • --quantization: 4bit/8bit
  • --device: cuda/cpu/auto
  • --dtype: auto/float16/bfloat16/float32

์‹คํ–‰ ์ œ์–ด

  • --max-steps: ํƒœ์Šคํฌ๋‹น ์ตœ๋Œ€ ๋‹จ๊ณ„ (๊ธฐ๋ณธ: 10)
  • --timeout: ํƒœ์Šคํฌ๋‹น ์‹œ๊ฐ„ ์ œํ•œ(์ดˆ) (๊ธฐ๋ณธ: 60)
  • --repetitions: ๋ฐ˜๋ณต ์‹คํ–‰ ํšŸ์ˆ˜ (Pass@k ๊ณ„์‚ฐ์šฉ)
  • --no-save-logs: ๋กœ๊ทธ ์ €์žฅ ๋น„ํ™œ์„ฑํ™”

์บ์‹œ ๋ชจ๋“œ

  • --cache-mode:
    • read (๊ธฐ๋ณธ): ์ €์žฅ๋œ ์บ์‹œ๋งŒ ์‚ฌ์šฉ
    • write: ์‹ค์ œ API ํ˜ธ์ถœ ํ›„ ์บ์‹œ ์ €์žฅ (configs/secrets์— API ํ‚ค ํ•„์š”)

๊ฒฐ๊ณผ ์ €์žฅ ์œ„์น˜

  • ๊ธฐ๋ณธ: logs/benchmark_results/by_model/{๋ชจ๋ธ๋ช…}/{ํƒ€์ž„์Šคํƒฌํ”„}/

๐Ÿ“ ํ‰๊ฐ€

์‚ฌ์šฉ๋ฒ•

evaluate_model_run.py๋กœ ์‹คํ–‰ ๋กœ๊ทธ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ๋ณด๊ณ ์„œ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

# ๊ธฐ๋ณธ ํ‰๊ฐ€
uv run evaluate_model_run.py --date 20251022 --model azure/gpt-4o

# ๋น ๋ฅธ ํ…Œ์ŠคํŠธ (๋ ˆ๋ฒจ๋‹น 1๊ฐœ)
uv run evaluate_model_run.py --date 20251022 --model azure/gpt-4o --quick

์ฃผ์š” ์˜ต์…˜

  • --date: ๋ฒค์น˜๋งˆํฌ ์‹คํ–‰ ๋‚ ์งœ (YYYYMMDD ํ˜•์‹)
  • --model: ํ‰๊ฐ€ํ•  ๋ชจ๋ธ ID
  • --judge-models: ํ‰๊ฐ€ ๋ชจ๋ธ(๋“ค) (๊ธฐ๋ณธ: gpt-4o ๋‹จ์ผ ๋ชจ๋ธ, ์•™์ƒ๋ธ” ์›ํ•˜๋ฉด ์—ฌ๋Ÿฌ ๊ฐœ ๋ชจ๋ธ ์ง€์ •)
  • --sample N: ๋ ˆ๋ฒจ๋‹น N๊ฐœ๋งŒ ํ‰๊ฐ€
  • --quick: ๋ ˆ๋ฒจ๋‹น 1๊ฐœ๋งŒ ํ‰๊ฐ€ (์ƒ˜ํ”Œ๋ง)
  • --format: ์ถœ๋ ฅ ํ˜•์‹ (json/csv/markdown/all)

๊ฒฐ๊ณผ ์œ„์น˜

  • ์ถœ๋ ฅ: reports/{model}_{date}/
    • evaluation_report.json
    • evaluation_summary.csv
    • evaluation_report.md

๐Ÿงฉ ํ‰๊ฐ€ ๋ ˆ๋ฒจ๊ณผ ํƒœ์Šคํฌ

์—์ด์ „ํŠธ์˜ ๋„๊ตฌ ํ˜ธ์ถœ ๋Šฅ๋ ฅ์„ 7๊ฐœ ๋ ˆ๋ฒจ๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

Level ํ‰๊ฐ€ ์˜์—ญ ์˜ˆ์‹œ ์ฃผ์š” ์ง€ํ‘œ
L1 ๋‹จ์ผ ๋„๊ตฌ ํ˜ธ์ถœ "ํŒ๊ต์—ญ์—์„œ ์ž ์‹ค์•ผ๊ตฌ์žฅ๊นŒ์ง€ ์ž์ฐจ๋กœ ๋ช‡ ๋ถ„ ๊ฑธ๋ฆด๊นŒ?" ToolAcc, ArgAcc, CallEM, RespOK
L2 ๋„๊ตฌ ์„ ํƒ "POSCOํ™€๋”ฉ์Šค ์ฃผ์‹์˜ ํ˜„์žฌ ํ˜ธ๊ฐ€์ฐฝ์„ ํ™•์ธํ•˜๊ณ  ์‹ถ์–ด" SelectAcc
L3 ๋„๊ตฌ ์ˆœ์ฐจ ์ถ”๋ก  "์ฒญ๋Ÿ‰๋ฆฌ์—ญ ๊ทผ์ฒ˜ ๋Œ€ํ•™๊ต ์ฐพ์•„๋ณด๊ณ , ๊ทธ ํ•™๊ต ๊ทผ์ฒ˜์— ๋ณ‘์› ๋ช‡ ๊ฐœ ์žˆ๋Š”์ง€ ์กฐ์‚ฌํ•ด์ค˜" FSM, PSM, ฮ”Steps_norm, ProvAcc
L4 ๋„๊ตฌ ๋ณ‘๋ ฌ ์ถ”๋ก  "์—ฌ๋Ÿฌ ๊ฑฐ๋ž˜์†Œ์—์„œ ๋น„ํŠธ์ฝ”์ธ ์‹œ์„ธ ๋™์‹œ ์กฐํšŒ ํ›„ ๋น„๊ต" Coverage, SourceEPR
L5 ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ์™€ ๊ฐ•๊ฑด์„ฑ "์•„์ดํฐ 17 ์ถœ์‹œ์ผ ๊ฒ€์ƒ‰ํ•ด์ค˜" (API ์‹คํŒจ ์‹œ ๋Œ€์ฒด ๊ฒฝ๋กœ) AdaptiveRoutingScore, FallbackSR
L6 ํšจ์œจ์ ์ธ ๋„๊ตฌ ํ™œ์šฉ "ํŒŒ์ด์ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŠธ๋ ˆ์ด๋”ฉ ์ฑ… ์ฐพ์•„์ค˜" (์ค‘๋ณต ํ˜ธ์ถœ ๋ฐฉ์ง€) ReuseRate, RedundantCallRate, EffScore
L7 ์žฅ๊ธฐ ์ปจํ…์ŠคํŠธ ๊ธฐ์–ต "์š”์ฆ˜ ๋น„ํŠธ์ฝ”์ธ์— ๊ด€์‹ฌ์ด ์ƒ๊ฒผ๋Š”๋ฐ..." (๋ฉ€ํ‹ฐํ„ด ๋Œ€ํ™”) ContextRetention, RefRecall

๐Ÿงฉ ํ‰๊ฐ€ ์ง€ํ‘œ

๊ณตํ†ต ์ง€ํ‘œ (๋ชจ๋“  ๋ ˆ๋ฒจ)

์ง€ํ‘œ ์„ค๋ช… ๊ณ„์‚ฐ ๋ฐฉ์‹
SR (Success Rate) ํƒœ์Šคํฌ ์™„์ˆ˜ ์ ์ˆ˜ LLM Judge ํ‰๊ฐ€ 1-5์  โ†’ (์ ์ˆ˜-1)/4
EPR/CVR ์œ ํšจ ๋„๊ตฌ ํ˜ธ์ถœ ๋น„์œจ ์œ ํšจ ํ˜ธ์ถœ ์ˆ˜ / ์ „์ฒด ํ˜ธ์ถœ ์ˆ˜
Pass@k kํšŒ ์‹œ๋„ ์„ฑ๊ณต๋ฅ  ์„ฑ๊ณต ์‹œ๋„ ์ˆ˜ / k

๋ ˆ๋ฒจ๋ณ„ ์ „์šฉ ์ง€ํ‘œ

L1: ๋‹จ์ผ ๋„๊ตฌ ํ˜ธ์ถœ

์ง€ํ‘œ ์„ค๋ช… ๊ณ„์‚ฐ ๋ฐฉ์‹
ToolAcc ์˜ฌ๋ฐ”๋ฅธ ๋„๊ตฌ ์„ ํƒ ์ผ์น˜=1, ๋ถˆ์ผ์น˜=0
ArgAcc ์ธ์ž ์ •ํ™•๋„ LLM Judge ํ‰๊ฐ€ 1-5 โ†’ 0-1
CallEM (Call Exact Match) ๋„๊ตฌ+์ธ์ž ์™„์ „ ์ผ์น˜ 0 ๋˜๋Š” 1
RespOK ์‘๋‹ต ํ˜•์‹ ์ค€์ˆ˜ 0 ๋˜๋Š” 1

L2: ๋„๊ตฌ ์„ ํƒ

์ง€ํ‘œ ์„ค๋ช… ๊ณ„์‚ฐ ๋ฐฉ์‹
SelectAcc ์˜ฌ๋ฐ”๋ฅธ ๋„๊ตฌ ์„ ํƒ๋ฅ  0 ๋˜๋Š” 1

L3: ์ˆœ์ฐจ์  ์ถ”๋ก 

์ง€ํ‘œ ์„ค๋ช… ๊ณ„์‚ฐ ๋ฐฉ์‹
FSM (Full Sequence Match) ํ˜ธ์ถœ ์ˆœ์„œ ์™„์ „ ์ผ์น˜ 0 ๋˜๋Š” 1
PSM (Partial Sequence Match) ํ•„์ˆ˜ ๋„๊ตฌ ํฌํ•จ๋ฅ  ํฌํ•จ๋œ ํ•„์ˆ˜ ๋„๊ตฌ / ์ „์ฒด ํ•„์ˆ˜ ๋„๊ตฌ
ฮ”Steps_norm ํšจ์œจ์„ฑ (์ตœ์†Œ ๊ฒฝ๋กœ ๋Œ€๋น„) min(1, ์ตœ์†Œ ๋‹จ๊ณ„ / ์‹ค์ œ ๋‹จ๊ณ„)
ProvAcc ์ธ์ž ์ „๋‹ฌ ์ •ํ™•๋„ ์˜ฌ๋ฐ”๋ฅธ ๋ฐ์ดํ„ฐ ํ๋ฆ„ / ์ „์ฒด ํ๋ฆ„

L4: ๋ณ‘๋ ฌ์  ์ถ”๋ก 

์ง€ํ‘œ ์„ค๋ช… ๊ณ„์‚ฐ ๋ฐฉ์‹
Coverage ํ•„์ˆ˜ ๋„๊ตฌ ์‹คํ–‰๋ฅ  ์„ฑ๊ณตํ•œ ํ•„์ˆ˜ ๋„๊ตฌ / ์ „์ฒด ํ•„์ˆ˜ ๋„๊ตฌ
SourceEPR ๋„๊ตฌ๋ณ„ ์œ ํšจ ํ˜ธ์ถœ๋ฅ  ํ‰๊ท  ํ‰๊ท (์œ ํšจ ํ˜ธ์ถœ / ์ „์ฒด ํ˜ธ์ถœ)

L5: ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ์™€ ๊ฐ•๊ฑด์„ฑ

์ง€ํ‘œ ์„ค๋ช… ๊ณ„์‚ฐ ๋ฐฉ์‹
AdaptiveRoutingScore ์‹คํŒจ ํ›„ ๋Œ€์ฒด ๊ฒฝ๋กœ ์ „ํ™˜ ๋ฏผ์ฒฉ์„ฑ 1 / (1 + ์ „ํ™˜ ์ง€์—ฐ ๋‹จ๊ณ„)
FallbackSR ๋Œ€์ฒด ๊ฒฝ๋กœ ์„ฑ๊ณต๋ฅ  ๋Œ€์ฒด ์„ฑ๊ณต / ๋Œ€์ฒด ์‹œ๋„

L6: ํšจ์œจ์ ์ธ ๋„๊ตฌ ํ™œ์šฉ

์ง€ํ‘œ ์„ค๋ช… ๊ณ„์‚ฐ ๋ฐฉ์‹
ReuseRate ์žฌ์‚ฌ์šฉ๋ฅ  ์žฌ์‚ฌ์šฉ / (์žฌ์‚ฌ์šฉ+์ค‘๋ณต)
RedundantCallRate ์ค‘๋ณต ํ˜ธ์ถœ ๋ฐฉ์ง€์œจ 1 - (์ค‘๋ณต ํ˜ธ์ถœ / ์žฌ์‚ฌ์šฉ ๊ธฐํšŒ)
EffScore ์„ฑ๊ณต ์‹œ ํšจ์œจ ์ ์ˆ˜ min(1, ์ตœ์†Œ ๋‹จ๊ณ„ / ์‹ค์ œ ๋‹จ๊ณ„)

L7: ์žฅ๊ธฐ ์ปจํ…์ŠคํŠธ ๊ธฐ์–ต

์ง€ํ‘œ ์„ค๋ช… ๊ณ„์‚ฐ ๋ฐฉ์‹
ContextRetention ๋งฅ๋ฝ ์œ ์ง€ ๋Šฅ๋ ฅ LLM Judge ํ‰๊ฐ€ 1-5 โ†’ 0-1
RefRecall ์ •๋ณด ํšŒ์ƒ ์ •ํ™•๋„ LLM Judge ํ‰๊ฐ€ 1-5 โ†’ 0-1

Judge ํ‰๊ฐ€

  • ํ‰๊ฐ€ ๋ชจ๋ธ: GPT-4o, Claude, Gemini ์•™์ƒ๋ธ”
  • ์ ์ˆ˜ ์ง‘๊ณ„: ํ‰๊ท  ๋˜๋Š” ์ค‘์•™๊ฐ’
  • ๋ชจ๋ธ๋ช…์„ ์ œ๊ฑฐํ•œ ๋ธ”๋ผ์ธ๋“œ ํ‰๊ฐ€๋กœ ๊ณต์ •์„ฑ ํ™•๋ณด

๐Ÿงฎ ์ข…ํ•ฉ ์ ์ˆ˜

  • ๊ธฐ๋ณธ ๋Šฅ๋ ฅ = L1-L3 ์ง€ํ‘œ ํ‰๊ท  (40%)
  • ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ = L5 ์ง€ํ‘œ ํ‰๊ท  (20%)
  • ํšจ์œจ์„ฑ = L6 ์ง€ํ‘œ ํ‰๊ท  (25%)
  • ๋งฅ๋ฝ ์ฒ˜๋ฆฌ = L7 ์ง€ํ‘œ ํ‰๊ท  (15%)

๐Ÿ“Š ๋ฆฌ๋”๋ณด๋“œ

ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋Š” reports/{model}_{date}/์— ์ž๋™ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค:

  • JSON (evaluation_report.json)
  • CSV (evaluation_summary.csv)
  • Markdown (evaluation_report.md)

์‚ฌ์šฉ ์˜ˆ์‹œ

# 1) GPT-4๋กœ L1-L3 ๋ ˆ๋ฒจ ํ‰๊ฐ€
uv run run_benchmark_with_logging.py --levels L1,L2,L3 --model openai/gpt-4

# 2) Claude๋กœ ์ „์ฒด ๋ ˆ๋ฒจ ํ‰๊ฐ€ + ์บ์‹œ ์ƒ์„ฑ
uv run run_benchmark_with_logging.py --model anthropic/claude-3-5-sonnet-20241022 --cache-mode write

# 3) ๋กœ์ปฌ ๋ชจ๋ธ 4bit ์–‘์žํ™” ํ‰๊ฐ€
uv run run_benchmark_with_logging.py --use-local --model Qwen/Qwen2.5-7B-Instruct --quantization 4bit --device cuda

# 4) ๋ฉ€ํ‹ฐํ„ด ๋Œ€ํ™” ๋ ˆ๋ฒจ ํ‰๊ฐ€
uv run run_benchmark_with_logging.py --levels L6,L7 --max-steps 20

# 5) ํ‰๊ฐ€ ๋ณด๊ณ ์„œ ์ƒ์„ฑ (๊ธฐ๋ณธ: ๋‹จ์ผ Judge)
uv run evaluate_model_run.py --date 20251022 --model azure/gpt-4o --format all

# 6) ๋น ๋ฅธ ์ƒ˜ํ”Œ ํ‰๊ฐ€
uv run evaluate_model_run.py --date 20251022 --model azure/gpt-4o --quick

โš–๏ธ ๋ผ์ด์„ ์Šค

Apache-2.0 ๋ผ์ด์„ ์Šค๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

About

๐ŸŽฏํ•œ๊ตญ์–ด LLM์˜ ๋„๊ตฌ ํ˜ธ์ถœ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ ํ”„๋กœ์ ํŠธ

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%