Skip to content

A comprehensive collection of multilingual datasets and large language models, meticulously curated for evaluating and enhancing the performance of large language models across diverse languages and tasks.

License

Notifications You must be signed in to change notification settings

zabir-nabil/awesome-multilingual-large-language-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

logo

Awesome Multilingual Large Language Models

A comprehensive collection of multilingual datasets and large language models, meticulously curated for evaluating and enhancing the performance of large language models across diverse languages and tasks.

contributors last update forks stars open issues


Datasets

Dataset Year Languages GitHub Download
Star
OMGEval : An Open Multilingual Generative Evaluation Benchmark for Large Language Models
2024 Chinese (zh) (๐Ÿ‡จ๐Ÿ‡ณ), Russian (ru) (๐Ÿ‡ท๐Ÿ‡บ), French (fr) (๐Ÿ‡ซ๐Ÿ‡ท), Spanish (es) (๐Ÿ‡ช๐Ÿ‡ธ), Arabic (ar) (๐Ÿ‡ธ๐Ÿ‡ฆ) Github Data
Star
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property
2024 Chinese (zh) (๐Ÿ‡จ๐Ÿ‡ณ), English (en) (๐Ÿ‡ฌ๐Ÿ‡ง), German (de) (๐Ÿ‡ฉ๐Ÿ‡ช), Japanese (ja) (๐Ÿ‡ฏ๐Ÿ‡ต), French (fr) (๐Ÿ‡ซ๐Ÿ‡ท), Korean (ko) (๐Ÿ‡ฐ๐Ÿ‡ท), Russian (ru) (๐Ÿ‡ท๐Ÿ‡บ), Spanish (es) (๐Ÿ‡ช๐Ÿ‡ธ), Portuguese (pt) (๐Ÿ‡ต๐Ÿ‡น), Catalan (ca) (๐Ÿ‡ฆ๐Ÿ‡ฉ) Github Data
Star
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models
2024 English (en) (๐Ÿ‡ฌ๐Ÿ‡ง), Chinese (zh) (๐Ÿ‡จ๐Ÿ‡ณ), Japanese (ja) (๐Ÿ‡ฏ๐Ÿ‡ต), French (fr) (๐Ÿ‡ซ๐Ÿ‡ท), German (de) (๐Ÿ‡ฉ๐Ÿ‡ช) Github Data
Star
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
2023 English (๐Ÿ‡บ๐Ÿ‡ธ), Chinese (๐Ÿ‡จ๐Ÿ‡ณ), Italian (๐Ÿ‡ฎ๐Ÿ‡น), Portuguese (๐Ÿ‡ง๐Ÿ‡ท), Vietnamese (๐Ÿ‡ป๐Ÿ‡ณ), Thai (๐Ÿ‡น๐Ÿ‡ญ), Swahili (๐Ÿ‡ฐ๐Ÿ‡ช), Afrikaans (๐Ÿ‡ฟ๐Ÿ‡ฆ), Javanese (๐Ÿ‡ฎ๐Ÿ‡ฉ) Github Data
Star
Language models are multilingual chain-of-thought reasoners
2023 Bengali (๐Ÿ‡ง๐Ÿ‡ฉ), Chinese (๐Ÿ‡จ๐Ÿ‡ณ), French (๐Ÿ‡ซ๐Ÿ‡ท), German (๐Ÿ‡ฉ๐Ÿ‡ช), Japanese (๐Ÿ‡ฏ๐Ÿ‡ต), Russian (๐Ÿ‡ท๐Ÿ‡บ), Spanish (๐Ÿ‡ช๐Ÿ‡ธ), Swahili (๐Ÿ‡ฐ๐Ÿ‡ช), Telugu (๐Ÿ‡ฎ๐Ÿ‡ณ), Thai (๐Ÿ‡น๐Ÿ‡ญ) Github Data
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
2023 English [๐Ÿ‡ฌ๐Ÿ‡ง], Russian [๐Ÿ‡ท๐Ÿ‡บ], Spanish [๐Ÿ‡ช๐Ÿ‡ธ], German [๐Ÿ‡ฉ๐Ÿ‡ช], French [๐Ÿ‡ซ๐Ÿ‡ท], Chinese [๐Ÿ‡จ๐Ÿ‡ณ], Italian [๐Ÿ‡ฎ๐Ÿ‡น], Portuguese [๐Ÿ‡ต๐Ÿ‡น], Polish [๐Ÿ‡ต๐Ÿ‡ฑ], Japanese [๐Ÿ‡ฏ๐Ÿ‡ต], Vietnamese [๐Ÿ‡ป๐Ÿ‡ณ], Dutch [๐Ÿ‡ณ๐Ÿ‡ฑ], Arabic [๐Ÿ‡ธ๐Ÿ‡ฆ], Turkish [๐Ÿ‡น๐Ÿ‡ท], Czech [๐Ÿ‡จ๐Ÿ‡ฟ], Persian [๐Ÿ‡ฎ๐Ÿ‡ท], Hungarian [๐Ÿ‡ญ๐Ÿ‡บ], Greek [๐Ÿ‡ฌ๐Ÿ‡ท], Romanian [๐Ÿ‡ท๐Ÿ‡ด], Swedish [๐Ÿ‡ธ๐Ÿ‡ช], Ukrainian [๐Ÿ‡บ๐Ÿ‡ฆ], Finnish [๐Ÿ‡ซ๐Ÿ‡ฎ], Korean [๐Ÿ‡ฐ๐Ÿ‡ท], Danish [๐Ÿ‡ฉ๐Ÿ‡ฐ], Bulgarian [๐Ÿ‡ง๐Ÿ‡ฌ], Norwegian [๐Ÿ‡ณ๐Ÿ‡ด], Hindi [๐Ÿ‡ฎ๐Ÿ‡ณ], Slovak [๐Ÿ‡ธ๐Ÿ‡ฐ], Thai [๐Ÿ‡น๐Ÿ‡ญ], Lithuanian [๐Ÿ‡ฑ๐Ÿ‡น], Catalan [๐Ÿ‡ช๐Ÿ‡ธ], Indonesian [๐Ÿ‡ฎ๐Ÿ‡ฉ], Bangla [๐Ÿ‡ง๐Ÿ‡ฉ], Estonian [๐Ÿ‡ช๐Ÿ‡ช], Slovenian [๐Ÿ‡ธ๐Ÿ‡ฎ], Latvian [๐Ÿ‡ฑ๐Ÿ‡ป], Hebrew [๐Ÿ‡ฎ๐Ÿ‡ฑ], Serbian [๐Ÿ‡ท๐Ÿ‡ธ], Tamil [๐Ÿ‡ฎ๐Ÿ‡ณ], Albanian [๐Ÿ‡ฆ๐Ÿ‡ฑ], Azerbaijani [๐Ÿ‡ฆ๐Ÿ‡ฟ] ๐Ÿค— Data
Star
Language models are multilingual chain-of-thought reasoners
2023 Bengali (๐Ÿ‡ง๐Ÿ‡ฉ), Chinese (๐Ÿ‡จ๐Ÿ‡ณ), French (๐Ÿ‡ซ๐Ÿ‡ท), German (๐Ÿ‡ฉ๐Ÿ‡ช), Japanese (๐Ÿ‡ฏ๐Ÿ‡ต), Russian (๐Ÿ‡ท๐Ÿ‡บ), Spanish (๐Ÿ‡ช๐Ÿ‡ธ), Swahili (๐Ÿ‡ฐ๐Ÿ‡ช), Telugu (๐Ÿ‡ฎ๐Ÿ‡ณ), Thai (๐Ÿ‡น๐Ÿ‡ญ) Github Data
Wiki-40B: Multilingual Language Model Dataset 2020 English (๐Ÿ‡บ๐Ÿ‡ธ), German (๐Ÿ‡ฉ๐Ÿ‡ช), French (๐Ÿ‡ซ๐Ÿ‡ท), Russian (๐Ÿ‡ท๐Ÿ‡บ), Spanish (๐Ÿ‡ช๐Ÿ‡ธ), Italian (๐Ÿ‡ฎ๐Ÿ‡น), Japanese (๐Ÿ‡ฏ๐Ÿ‡ต), Chinese Simplified (๐Ÿ‡จ๐Ÿ‡ณ), Chinese Traditional (๐Ÿ‡น๐Ÿ‡ผ), Polish (๐Ÿ‡ต๐Ÿ‡ฑ), Ukrainian (๐Ÿ‡บ๐Ÿ‡ฆ), Dutch (๐Ÿ‡ณ๐Ÿ‡ฑ), Swedish (๐Ÿ‡ธ๐Ÿ‡ช), Portuguese (๐Ÿ‡ต๐Ÿ‡น), Serbian (๐Ÿ‡ท๐Ÿ‡ธ), Hungarian (๐Ÿ‡ญ๐Ÿ‡บ), Catalan (๐Ÿ‡ช๐Ÿ‡ธ), Czech (๐Ÿ‡จ๐Ÿ‡ฟ), Finnish (๐Ÿ‡ซ๐Ÿ‡ฎ), Arabic (๐Ÿ‡ธ๐Ÿ‡ฆ), Korean (๐Ÿ‡ฐ๐Ÿ‡ท), Persian (๐Ÿ‡ฎ๐Ÿ‡ท), Norwegian (๐Ÿ‡ณ๐Ÿ‡ด), Vietnamese (๐Ÿ‡ป๐Ÿ‡ณ), Hebrew (๐Ÿ‡ฎ๐Ÿ‡ฑ), Indonesian (๐Ÿ‡ฎ๐Ÿ‡ฉ), Romanian (๐Ÿ‡ท๐Ÿ‡ด), Turkish (๐Ÿ‡น๐Ÿ‡ท), Bulgarian (๐Ÿ‡ง๐Ÿ‡ฌ), Estonian (๐Ÿ‡ช๐Ÿ‡ช), Malay (๐Ÿ‡ฒ๐Ÿ‡พ), Danish (๐Ÿ‡ฉ๐Ÿ‡ฐ), Slovak (๐Ÿ‡ธ๐Ÿ‡ฐ), Croatian (๐Ÿ‡ญ๐Ÿ‡ท), Greek (๐Ÿ‡ฌ๐Ÿ‡ท), Lithuanian (๐Ÿ‡ฑ๐Ÿ‡น), Slovenian (๐Ÿ‡ธ๐Ÿ‡ฎ), Thai (๐Ÿ‡น๐Ÿ‡ญ), Hindi (๐Ÿ‡ฎ๐Ÿ‡ณ), Latvian (๐Ÿ‡ฑ๐Ÿ‡ป), Filipino (๐Ÿ‡ต๐Ÿ‡ญ) ๐Ÿ‘๏ธ Data
Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning 2021 English (๐Ÿ‡บ๐Ÿ‡ธ), German (๐Ÿ‡ฉ๐Ÿ‡ช), French (๐Ÿ‡ซ๐Ÿ‡ท), Russian (๐Ÿ‡ท๐Ÿ‡บ), Spanish (๐Ÿ‡ช๐Ÿ‡ธ), Hindi (๐Ÿ‡ฎ๐Ÿ‡ณ), Vietnamese (๐Ÿ‡ป๐Ÿ‡ณ), Bulgarian (๐Ÿ‡ง๐Ÿ‡ฌ), Chinese (๐Ÿ‡จ๐Ÿ‡ณ), Dutch (๐Ÿ‡ณ๐Ÿ‡ฑ), Italian (๐Ÿ‡ฎ๐Ÿ‡น), Japanese (๐Ÿ‡ฏ๐Ÿ‡ต), Polish (๐Ÿ‡ต๐Ÿ‡ฑ), Portuguese (๐Ÿ‡ต๐Ÿ‡น), Arabic (๐Ÿ‡ธ๐Ÿ‡ฆ), Swahili (๐Ÿ‡น๐Ÿ‡ฟ), Urdu (๐Ÿ‡ต๐Ÿ‡ฐ) GitHub๏ธ Data
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset 2022 Akan (๐Ÿ‡ฌ๐Ÿ‡ญ), Arabic (๐Ÿ‡ธ๐Ÿ‡ฆ), Assamese (๐Ÿ‡ฎ๐Ÿ‡ณ), Bambara (๐Ÿ‡ฒ๐Ÿ‡ฑ), Basque (๐Ÿ‡ช๐Ÿ‡ธ), Bengali (๐Ÿ‡ง๐Ÿ‡ฉ), Catalan (๐Ÿ‡ช๐Ÿ‡ธ), Chichewa (๐Ÿ‡ฒ๐Ÿ‡ผ), chiShona (๐Ÿ‡ฟ๐Ÿ‡ผ), Chitumbuka (๐Ÿ‡ฒ๐Ÿ‡ผ), English (๐Ÿ‡ฌ๐Ÿ‡ง), Fon (๐Ÿ‡ง๐Ÿ‡ฏ), French (๐Ÿ‡ซ๐Ÿ‡ท), Gujarati (๐Ÿ‡ฎ๐Ÿ‡ณ), Hindi (๐Ÿ‡ฎ๐Ÿ‡ณ), Igbo (๐Ÿ‡ณ๐Ÿ‡ฌ), Indonesian (๐Ÿ‡ฎ๐Ÿ‡ฉ), isiXhosa (๐Ÿ‡ฟ๐Ÿ‡ฆ), isiZulu (๐Ÿ‡ฟ๐Ÿ‡ฆ), Kannada (๐Ÿ‡ฎ๐Ÿ‡ณ), Kikuyu (๐Ÿ‡ฐ๐Ÿ‡ช), Kinyarwanda (๐Ÿ‡ท๐Ÿ‡ผ), Kirundi (๐Ÿ‡ง๐Ÿ‡ฎ), Lingala (๐Ÿ‡จ๐Ÿ‡ฉ), Luganda (๐Ÿ‡บ๐Ÿ‡ฌ), Malayalam (๐Ÿ‡ฎ๐Ÿ‡ณ), Marathi (๐Ÿ‡ฎ๐Ÿ‡ณ), Nepali (๐Ÿ‡ณ๐Ÿ‡ต), Northern Sotho (๐Ÿ‡ฟ๐Ÿ‡ฆ), Odia (๐Ÿ‡ฎ๐Ÿ‡ณ), Portuguese (๐Ÿ‡ต๐Ÿ‡น), Punjabi (๐Ÿ‡ฎ๐Ÿ‡ณ), Sesotho (๐Ÿ‡ฑ๐Ÿ‡ธ), Setswana (๐Ÿ‡ง๐Ÿ‡ผ), Simplified Chinese (๐Ÿ‡จ๐Ÿ‡ณ), Spanish (๐Ÿ‡ช๐Ÿ‡ธ), Swahili (๐Ÿ‡ฐ๐Ÿ‡ช), Tamil (๐Ÿ‡ฎ๐Ÿ‡ณ), Telugu (๐Ÿ‡ฎ๐Ÿ‡ณ), Traditional Chinese (๐Ÿ‡น๐Ÿ‡ผ), Twi (๐Ÿ‡ฌ๐Ÿ‡ญ), Urdu (๐Ÿ‡ต๐Ÿ‡ฐ), Vietnamese (๐Ÿ‡ป๐Ÿ‡ณ), Wolof (๐Ÿ‡ธ๐Ÿ‡ณ), Xitsonga (๐Ÿ‡ฟ๐Ÿ‡ฆ), Yoruba (๐Ÿ‡ณ๐Ÿ‡ฌ), Programming Languages (๐Ÿ’ป) GitHub๏ธ Data
GEOMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models 2022 English (๐Ÿ‡บ๐Ÿ‡ธ), Chinese (๐Ÿ‡จ๐Ÿ‡ณ), Hindi (๐Ÿ‡ฎ๐Ÿ‡ณ), Persian (๐Ÿ‡ฎ๐Ÿ‡ท), Swahili (๐Ÿ‡ฐ๐Ÿ‡ช) GitHub๏ธ ๐Ÿ”

Models

Title Year Languages Code Demo
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
2024 Afrikaans [๐Ÿ‡ฟ๐Ÿ‡ฆ], Amharic [๐Ÿ‡ช๐Ÿ‡น], Arabic [๐Ÿ‡ธ๐Ÿ‡ฆ], Azerbaijani [๐Ÿ‡ฆ๐Ÿ‡ฟ], Belarusian [๐Ÿ‡ง๐Ÿ‡พ], Bengali [๐Ÿ‡ง๐Ÿ‡ฉ], Bulgarian [๐Ÿ‡ง๐Ÿ‡ฌ], Catalan [๐Ÿ‡ช๐Ÿ‡ธ], Cebuano [๐Ÿ‡ต๐Ÿ‡ญ], Czech [๐Ÿ‡จ๐Ÿ‡ฟ], Welsh [๐Ÿด], Danish [๐Ÿ‡ฉ๐Ÿ‡ฐ], German [๐Ÿ‡ฉ๐Ÿ‡ช], Greek [๐Ÿ‡ฌ๐Ÿ‡ท], English [๐Ÿ‡ฌ๐Ÿ‡ง], Esperanto [๐Ÿ‡ช๐Ÿ‡ธ], Estonian [๐Ÿ‡ช๐Ÿ‡ช], Basque [๐Ÿ‡ช๐Ÿ‡ธ], Finnish [๐Ÿ‡ซ๐Ÿ‡ฎ], Tagalog [๐Ÿ‡ต๐Ÿ‡ญ], French [๐Ÿ‡ซ๐Ÿ‡ท], Western Frisian [๐Ÿ‡ณ๐Ÿ‡ฑ], Scottish Gaelic [๐Ÿด], Irish [๐Ÿ‡ฎ๐Ÿ‡ช], Galician [๐Ÿ‡ช๐Ÿ‡ธ], Gujarati [๐Ÿ‡ฎ๐Ÿ‡ณ], Haitian Creole [๐Ÿ‡ญ๐Ÿ‡น], Hausa [๐Ÿ‡ณ๐Ÿ‡ช], Hebrew [๐Ÿ‡ฎ๐Ÿ‡ฑ], Hindi [๐Ÿ‡ฎ๐Ÿ‡ณ], Hungarian [๐Ÿ‡ญ๐Ÿ‡บ], Armenian [๐Ÿ‡ฆ๐Ÿ‡ฒ], Igbo [๐Ÿ‡ณ๐Ÿ‡ฌ], Indonesian [๐Ÿ‡ฎ๐Ÿ‡ฉ], Icelandic [๐Ÿ‡ฎ๐Ÿ‡ธ], Italian [๐Ÿ‡ฎ๐Ÿ‡น], Javanese [๐Ÿ‡ฎ๐Ÿ‡ฉ], Japanese [๐Ÿ‡ฏ๐Ÿ‡ต], Kannada [๐Ÿ‡ฎ๐Ÿ‡ณ], Georgian [๐Ÿ‡ฌ๐Ÿ‡ช], Kazakh [๐Ÿ‡ฐ๐Ÿ‡ฟ], Khmer [๐Ÿ‡ฐ๐Ÿ‡ญ], Kyrgyz [๐Ÿ‡ฐ๐Ÿ‡ฌ], Korean [๐Ÿ‡ฐ๐Ÿ‡ท], Kurdish [๐Ÿ‡น๐Ÿ‡ท], Lao [๐Ÿ‡ฑ๐Ÿ‡ฆ], Latvian [๐Ÿ‡ฑ๐Ÿ‡ป], Latin [๐Ÿ‡ป๐Ÿ‡ฆ], Lithuanian [๐Ÿ‡ฑ๐Ÿ‡น], Luxembourgish [๐Ÿ‡ฑ๐Ÿ‡บ], Malayalam [๐Ÿ‡ฎ๐Ÿ‡ณ], Marathi [๐Ÿ‡ฎ๐Ÿ‡ณ], Macedonian [๐Ÿ‡ฒ๐Ÿ‡ฐ], Malagasy [๐Ÿ‡ฒ๐Ÿ‡ฌ], Maltese [๐Ÿ‡ฒ๐Ÿ‡น], Mongolian [๐Ÿ‡ฒ๐Ÿ‡ณ], Maori [๐Ÿ‡ณ๐Ÿ‡ฟ], Malay [๐Ÿ‡ฒ๐Ÿ‡พ], Burmese [๐Ÿ‡ฒ๐Ÿ‡ฒ], Nepali [๐Ÿ‡ณ๐Ÿ‡ต], Dutch [๐Ÿ‡ณ๐Ÿ‡ฑ], Norwegian [๐Ÿ‡ณ๐Ÿ‡ด], Northern Sotho [๐Ÿ‡ฟ๐Ÿ‡ฆ], Chichewa [๐Ÿ‡ฒ๐Ÿ‡ผ], Oriya [๐Ÿ‡ฎ๐Ÿ‡ณ], Punjabi [๐Ÿ‡ฎ๐Ÿ‡ณ], Persian [๐Ÿ‡ฎ๐Ÿ‡ท], Polish [๐Ÿ‡ต๐Ÿ‡ฑ], Portuguese [๐Ÿ‡ต๐Ÿ‡น], Pashto [๐Ÿ‡ฆ๐Ÿ‡ซ], Romanian [๐Ÿ‡ท๐Ÿ‡ด], Russian [๐Ÿ‡ท๐Ÿ‡บ], Sinhala [๐Ÿ‡ฑ๐Ÿ‡ฐ], Slovak [๐Ÿ‡ธ๐Ÿ‡ฐ], Slovenian [๐Ÿ‡ธ๐Ÿ‡ฎ], Samoan [๐Ÿ‡ผ๐Ÿ‡ธ], Shona [๐Ÿ‡ฟ๐Ÿ‡ผ], Sindhi [๐Ÿ‡ต๐Ÿ‡ฐ], Somali [๐Ÿ‡ธ๐Ÿ‡ด], Southern Sotho [๐Ÿ‡ฑ๐Ÿ‡ธ], Spanish [๐Ÿ‡ช๐Ÿ‡ธ], Albanian [๐Ÿ‡ฆ๐Ÿ‡ฑ], Serbian [๐Ÿ‡ท๐Ÿ‡ธ], Sundanese [๐Ÿ‡ฎ๐Ÿ‡ฉ], Swahili [๐Ÿ‡ฐ๐Ÿ‡ช], Swedish [๐Ÿ‡ธ๐Ÿ‡ช], Tamil [๐Ÿ‡ฎ๐Ÿ‡ณ], Telugu [๐Ÿ‡ฎ๐Ÿ‡ณ], Tajik [๐Ÿ‡น๐Ÿ‡ฏ], Thai [๐Ÿ‡น๐Ÿ‡ญ], Turkish [๐Ÿ‡น๐Ÿ‡ท], Twi [๐Ÿ‡ฌ๐Ÿ‡ญ], Ukrainian [๐Ÿ‡บ๐Ÿ‡ฆ], Urdu [๐Ÿ‡ต๐Ÿ‡ฐ], Uzbek [๐Ÿ‡บ๐Ÿ‡ฟ], Vietnamese [๐Ÿ‡ป๐Ÿ‡ณ], Xhosa [๐Ÿ‡ฟ๐Ÿ‡ฆ], Yiddish [๐Ÿ‡ฎ๐Ÿ‡ฑ], Yoruba [๐Ÿ‡ณ๐Ÿ‡ฌ], Chinese [๐Ÿ‡จ๐Ÿ‡ณ], Zulu [๐Ÿ‡ฟ๐Ÿ‡ฆ] Source ๐Ÿค—
Star
LANGBRIDGE: Multilingual Reasoning Without Multilingual Supervision
2024 Arabic (ar) (๐Ÿ‡ธ๐Ÿ‡ฆ), Bengali (bn) (๐Ÿ‡ง๐Ÿ‡ฉ), Chinese (zh) (๐Ÿ‡จ๐Ÿ‡ณ), Danish (da) (๐Ÿ‡ฉ๐Ÿ‡ฐ), Dutch (nl) (๐Ÿ‡ณ๐Ÿ‡ฑ), English (en) (๐Ÿ‡ฌ๐Ÿ‡ง), French (fr) (๐Ÿ‡ซ๐Ÿ‡ท), German (de) (๐Ÿ‡ฉ๐Ÿ‡ช), Hindi (hi) (๐Ÿ‡ฎ๐Ÿ‡ณ), Japanese (ja) (๐Ÿ‡ฏ๐Ÿ‡ต), Korean (ko) (๐Ÿ‡ฐ๐Ÿ‡ท), Marathi (mr) (๐Ÿ‡ฎ๐Ÿ‡ณ), Punjabi (pa) (๐Ÿ‡ฎ๐Ÿ‡ณ), Russian (ru) (๐Ÿ‡ท๐Ÿ‡บ), Spanish (es) (๐Ÿ‡ช๐Ÿ‡ธ), Swahili (sw) (๐Ÿ‡ฐ๐Ÿ‡ช), Telugu (te) (๐Ÿ‡ฎ๐Ÿ‡ณ), Turkish (tr) (๐Ÿ‡น๐Ÿ‡ท), Urdu (ur) (๐Ÿ‡ต๐Ÿ‡ฐ) Github ๐Ÿค—
Star
Orion-14B: Open-source Multilingual Large Language Models
2024 English [๐Ÿ‡ฌ๐Ÿ‡ง], Chinese [๐Ÿ‡จ๐Ÿ‡ณ], Japanese [๐Ÿ‡ฏ๐Ÿ‡ต], Korean [๐Ÿ‡ฐ๐Ÿ‡ท], Spanish [๐Ÿ‡ช๐Ÿ‡ธ], French [๐Ÿ‡ซ๐Ÿ‡ท], German [๐Ÿ‡ฉ๐Ÿ‡ช], Arabic [๐Ÿ‡ธ๐Ÿ‡ฆ] Github ๐Ÿค—
Star
Baichuan 2: Open Large-scale Language Models
2023 Arabic (ar) (๐Ÿ‡ธ๐Ÿ‡ฆ), Chinese (zh) (๐Ÿ‡จ๐Ÿ‡ณ), English (en) (๐Ÿ‡ฌ๐Ÿ‡ง), French (fr) (๐Ÿ‡ซ๐Ÿ‡ท), Russian (ru) (๐Ÿ‡ท๐Ÿ‡บ), Spanish (es) (๐Ÿ‡ช๐Ÿ‡ธ), German (de) (๐Ÿ‡ฉ๐Ÿ‡ช), Japanese (ja) (๐Ÿ‡ฏ๐Ÿ‡ต) Github ๐Ÿค—
Star
Larger-Scale Transformers for Multilingual Masked Language Modeling
2021 Afrikaans (๐Ÿ‡ฟ๐Ÿ‡ฆ), Albanian (๐Ÿ‡ฆ๐Ÿ‡ฑ), Amharic (๐Ÿ‡ช๐Ÿ‡น), Arabic (๐Ÿ‡ธ๐Ÿ‡ฆ), Armenian (๐Ÿ‡ฆ๐Ÿ‡ฒ), Assamese (๐Ÿ‡ฎ๐Ÿ‡ณ), Azerbaijani (๐Ÿ‡ฆ๐Ÿ‡ฟ), Basque (๐Ÿ‡ช๐Ÿ‡ธ), Belarusian (๐Ÿ‡ง๐Ÿ‡พ), Bengali (๐Ÿ‡ง๐Ÿ‡ฉ), Bengali Romanize (๐Ÿ‡ง๐Ÿ‡ฉ), Bosnian (๐Ÿ‡ง๐Ÿ‡ฆ), Breton (๐Ÿด), Bulgarian (๐Ÿ‡ง๐Ÿ‡ฌ), Burmese (๐Ÿ‡ฒ๐Ÿ‡ฒ), Burmese zawgyi font (๐Ÿ‡ฒ๐Ÿ‡ฒ), Catalan (๐Ÿ‡ช๐Ÿ‡ธ), Chinese (Simplified) (๐Ÿ‡จ๐Ÿ‡ณ), Chinese (Traditional) (๐Ÿ‡น๐Ÿ‡ผ), Croatian (๐Ÿ‡ญ๐Ÿ‡ท), Czech (๐Ÿ‡จ๐Ÿ‡ฟ), Danish (๐Ÿ‡ฉ๐Ÿ‡ฐ), Dutch (๐Ÿ‡ณ๐Ÿ‡ฑ), English (๐Ÿ‡ฌ๐Ÿ‡ง), Esperanto (๐Ÿด), Estonian (๐Ÿ‡ช๐Ÿ‡ช), Filipino (๐Ÿ‡ต๐Ÿ‡ญ), Finnish (๐Ÿ‡ซ๐Ÿ‡ฎ), French (๐Ÿ‡ซ๐Ÿ‡ท), Galician (๐Ÿ‡ช๐Ÿ‡ธ), Georgian (๐Ÿ‡ฌ๐Ÿ‡ช), German (๐Ÿ‡ฉ๐Ÿ‡ช), Greek (๐Ÿ‡ฌ๐Ÿ‡ท), Gujarati (๐Ÿ‡ฎ๐Ÿ‡ณ), Hausa (๐Ÿ‡ณ๐Ÿ‡ฌ), Hebrew (๐Ÿ‡ฎ๐Ÿ‡ฑ), Hindi (๐Ÿ‡ฎ๐Ÿ‡ณ), Hindi Romanize (๐Ÿ‡ฎ๐Ÿ‡ณ), Hungarian (๐Ÿ‡ญ๐Ÿ‡บ), Icelandic (๐Ÿ‡ฎ๐Ÿ‡ธ), Indonesian (๐Ÿ‡ฎ๐Ÿ‡ฉ), Irish (๐Ÿ‡ฎ๐Ÿ‡ช), Italian (๐Ÿ‡ฎ๐Ÿ‡น), Japanese (๐Ÿ‡ฏ๐Ÿ‡ต), Javanese (๐Ÿ‡ฎ๐Ÿ‡ฉ), Kannada (๐Ÿ‡ฎ๐Ÿ‡ณ), Kazakh (๐Ÿ‡ฐ๐Ÿ‡ฟ), Khmer (๐Ÿ‡ฐ๐Ÿ‡ญ), Korean (๐Ÿ‡ฐ๐Ÿ‡ท), Kurdish (Kurmanji) (๐Ÿ‡น๐Ÿ‡ท), Kyrgyz (๐Ÿ‡ฐ๐Ÿ‡ฌ), Lao (๐Ÿ‡ฑ๐Ÿ‡ฆ), Latin (๐Ÿ›๏ธ), Latvian (๐Ÿ‡ฑ๐Ÿ‡ป), Lithuanian (๐Ÿ‡ฑ๐Ÿ‡น), Macedonian (๐Ÿ‡ฒ๐Ÿ‡ฐ), Malagasy (๐Ÿ‡ฒ๐Ÿ‡ฌ), Malay (๐Ÿ‡ฒ๐Ÿ‡พ), Malayalam (๐Ÿ‡ฎ๐Ÿ‡ณ), Marathi (๐Ÿ‡ฎ๐Ÿ‡ณ), Mongolian (๐Ÿ‡ฒ๐Ÿ‡ณ), Nepali (๐Ÿ‡ณ๐Ÿ‡ต), Norwegian (๐Ÿ‡ณ๐Ÿ‡ด), Oriya (๐Ÿ‡ฎ๐Ÿ‡ณ), Oromo (๐Ÿ‡ช๐Ÿ‡น), Pashto (๐Ÿ‡ฆ๐Ÿ‡ซ), Persian (๐Ÿ‡ฎ๐Ÿ‡ท), Polish (๐Ÿ‡ต๐Ÿ‡ฑ), Portuguese (๐Ÿ‡ต๐Ÿ‡น), Punjabi (๐Ÿ‡ฎ๐Ÿ‡ณ), Romanian (๐Ÿ‡ท๐Ÿ‡ด), Russian (๐Ÿ‡ท๐Ÿ‡บ), Sanskrit (๐Ÿ‡ฎ๐Ÿ‡ณ), Scottish Gaelic (๐Ÿด), Serbian (๐Ÿ‡ท๐Ÿ‡ธ), Sindhi (๐Ÿ‡ต๐Ÿ‡ฐ), Sinhala (๐Ÿ‡ฑ๐Ÿ‡ฐ), Slovak (๐Ÿ‡ธ๐Ÿ‡ฐ), Slovenian (๐Ÿ‡ธ๐Ÿ‡ฎ), Somali (๐Ÿ‡ธ๐Ÿ‡ด), Spanish (๐Ÿ‡ช๐Ÿ‡ธ), Sundanese (๐Ÿ‡ฎ๐Ÿ‡ฉ), Swahili (๐Ÿ‡ฐ๐Ÿ‡ช), Swedish (๐Ÿ‡ธ๐Ÿ‡ช), Tamil (๐Ÿ‡ฎ๐Ÿ‡ณ), Tamil Romanize (๐Ÿ‡ฎ๐Ÿ‡ณ), Telugu (๐Ÿ‡ฎ๐Ÿ‡ณ), Telugu Romanize (๐Ÿ‡ฎ๐Ÿ‡ณ), Thai (๐Ÿ‡น๐Ÿ‡ญ), Turkish (๐Ÿ‡น๐Ÿ‡ท), Ukrainian (๐Ÿ‡บ๐Ÿ‡ฆ), Urdu (๐Ÿ‡ต๐Ÿ‡ฐ), Urdu Romanize (๐Ÿ‡ต๐Ÿ‡ฐ), Uyghur (๐Ÿ‡จ๐Ÿ‡ณ), Uzbek (๐Ÿ‡บ๐Ÿ‡ฟ), Vietnamese (๐Ÿ‡ป๐Ÿ‡ณ), Welsh (๐Ÿด), Western Frisian (๐Ÿ‡ณ๐Ÿ‡ฑ), Xhosa (๐Ÿ‡ฟ๐Ÿ‡ฆ), Yiddish (๐Ÿ‡ฎ๐Ÿ‡ฑ) Github ๐Ÿ”
Star
InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities
2023 English (๐Ÿ‡บ๐Ÿ‡ธ), Chinese (๐Ÿ‡จ๐Ÿ‡ณ) Github ๐Ÿ”
PolyLM: An Open Source Polyglot Large Language Model
2023 English (EN) [๐Ÿ‡ฌ๐Ÿ‡ง], Chinese (ZH) [๐Ÿ‡จ๐Ÿ‡ณ], Russian (RU) [๐Ÿ‡ท๐Ÿ‡บ], Spanish (ES) [๐Ÿ‡ช๐Ÿ‡ธ], German (DE) [๐Ÿ‡ฉ๐Ÿ‡ช], French (FR) [๐Ÿ‡ซ๐Ÿ‡ท], Italian (IT) [๐Ÿ‡ฎ๐Ÿ‡น], Portuguese (PT) [๐Ÿ‡ต๐Ÿ‡น], Japanese (JA) [๐Ÿ‡ฏ๐Ÿ‡ต], Vietnamese (VI) [๐Ÿ‡ป๐Ÿ‡ณ], Indonesian (ID) [๐Ÿ‡ฎ๐Ÿ‡ฉ], Polish (PL) [๐Ÿ‡ต๐Ÿ‡ฑ], Dutch (NL) [๐Ÿ‡ณ๐Ÿ‡ฑ], Arabic (AR) [๐Ÿ‡ฆ๐Ÿ‡ช], Turkish (TR) [๐Ÿ‡น๐Ÿ‡ท], Thai (TH) [๐Ÿ‡น๐Ÿ‡ญ], Hebrew (HE) [๐Ÿ‡ฎ๐Ÿ‡ฑ], Korean (KO) [๐Ÿ‡ฐ๐Ÿ‡ท] Model ๐Ÿ”
Star
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
2023 Akan (๐Ÿ‡ฌ๐Ÿ‡ญ), Arabic (๐Ÿ‡ธ๐Ÿ‡ฆ), Assamese (๐Ÿ‡ฎ๐Ÿ‡ณ), Bambara (๐Ÿ‡ฒ๐Ÿ‡ฑ), Basque (๐Ÿ‡ช๐Ÿ‡ธ), Bengali (๐Ÿ‡ง๐Ÿ‡ฉ), Catalan (๐Ÿ‡ช๐Ÿ‡ธ), Chichewa (๐Ÿ‡ฒ๐Ÿ‡ผ), chiShona (๐Ÿ‡ฟ๐Ÿ‡ผ), Chitumbuka (๐Ÿ‡ฒ๐Ÿ‡ผ), English (๐Ÿ‡ฌ๐Ÿ‡ง), Fon (๐Ÿ‡ง๐Ÿ‡ฏ), French (๐Ÿ‡ซ๐Ÿ‡ท), Gujarati (๐Ÿ‡ฎ๐Ÿ‡ณ), Hindi (๐Ÿ‡ฎ๐Ÿ‡ณ), Igbo (๐Ÿ‡ณ๐Ÿ‡ฌ), Indonesian (๐Ÿ‡ฎ๐Ÿ‡ฉ), isiXhosa (๐Ÿ‡ฟ๐Ÿ‡ฆ), isiZulu (๐Ÿ‡ฟ๐Ÿ‡ฆ), Kannada (๐Ÿ‡ฎ๐Ÿ‡ณ), Kikuyu (๐Ÿ‡ฐ๐Ÿ‡ช), Kinyarwanda (๐Ÿ‡ท๐Ÿ‡ผ), Kirundi (๐Ÿ‡ง๐Ÿ‡ฎ), Lingala (๐Ÿ‡จ๐Ÿ‡ฉ), Luganda (๐Ÿ‡บ๐Ÿ‡ฌ), Malayalam (๐Ÿ‡ฎ๐Ÿ‡ณ), Marathi (๐Ÿ‡ฎ๐Ÿ‡ณ), Nepali (๐Ÿ‡ณ๐Ÿ‡ต), Northern Sotho (๐Ÿ‡ฟ๐Ÿ‡ฆ), Odia (๐Ÿ‡ฎ๐Ÿ‡ณ), Portuguese (๐Ÿ‡ต๐Ÿ‡น), Punjabi (๐Ÿ‡ฎ๐Ÿ‡ณ), Sesotho (๐Ÿ‡ฑ๐Ÿ‡ธ), Setswana (๐Ÿ‡ง๐Ÿ‡ผ), Simplified Chinese (๐Ÿ‡จ๐Ÿ‡ณ), Spanish (๐Ÿ‡ช๐Ÿ‡ธ), Swahili (๐Ÿ‡ฐ๐Ÿ‡ช), Tamil (๐Ÿ‡ฎ๐Ÿ‡ณ), Telugu (๐Ÿ‡ฎ๐Ÿ‡ณ), Traditional Chinese (๐Ÿ‡น๐Ÿ‡ผ), Twi (๐Ÿ‡ฌ๐Ÿ‡ญ), Urdu (๐Ÿ‡ต๐Ÿ‡ฐ), Vietnamese (๐Ÿ‡ป๐Ÿ‡ณ), Wolof (๐Ÿ‡ธ๐Ÿ‡ณ), Xitsonga (๐Ÿ‡ฟ๐Ÿ‡ฆ), Yoruba (๐Ÿ‡ณ๐Ÿ‡ฌ), Programming Languages (๐Ÿ’ป) Github ๐Ÿค—
Star
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
2023 hbs_Latn (๐Ÿ‡ญ๐Ÿ‡ท), mal_Mlym (๐Ÿ‡ฎ๐Ÿ‡ณ), aze_Latn (๐Ÿ‡ฆ๐Ÿ‡ฟ), guj_Gujr (๐Ÿ‡ฎ๐Ÿ‡ณ), ben_Beng (๐Ÿ‡ฎ๐Ÿ‡ณ), kan_Knda (๐Ÿ‡ฎ๐Ÿ‡ณ), tel_Telu (๐Ÿ‡ฎ๐Ÿ‡ณ), mlt_Latn (๐Ÿ‡ฒ๐Ÿ‡น), fra_Latn (๐Ÿ‡ซ๐Ÿ‡ท), spa_Latn (๐Ÿ‡ช๐Ÿ‡ธ), eng_Latn (๐Ÿ‡ฌ๐Ÿ‡ง), fil_Latn (๐Ÿ‡ต๐Ÿ‡ญ), nob_Latn (๐Ÿ‡ณ๐Ÿ‡ด), rus_Cyrl (๐Ÿ‡ท๐Ÿ‡บ), deu_Latn (๐Ÿ‡ฉ๐Ÿ‡ช), tur_Latn (๐Ÿ‡น๐Ÿ‡ท), pan_Guru (๐Ÿ‡ฎ๐Ÿ‡ณ), mar_Deva (๐Ÿ‡ฎ๐Ÿ‡ณ), por_Latn (๐Ÿ‡ต๐Ÿ‡น), nld_Latn (๐Ÿ‡ณ๐Ÿ‡ฑ), ara_Arab (๐Ÿ‡ธ๐Ÿ‡ฆ), zho_Hani (๐Ÿ‡จ๐Ÿ‡ณ), ita_Latn (๐Ÿ‡ฎ๐Ÿ‡น), ind_Latn (๐Ÿ‡ฎ๐Ÿ‡ฉ), ell_Grek (๐Ÿ‡ฌ๐Ÿ‡ท), bul_Cyrl (๐Ÿ‡ง๐Ÿ‡ฌ), swe_Latn (๐Ÿ‡ธ๐Ÿ‡ช), ces_Latn (๐Ÿ‡จ๐Ÿ‡ฟ), isl_Latn (๐Ÿ‡ฎ๐Ÿ‡ธ), pol_Latn (๐Ÿ‡ต๐Ÿ‡ฑ), ron_Latn (๐Ÿ‡ท๐Ÿ‡ด), dan_Latn (๐Ÿ‡ฉ๐Ÿ‡ฐ), hun_Latn (๐Ÿ‡ญ๐Ÿ‡บ), tgk_Cyrl (๐Ÿ‡น๐Ÿ‡ฏ), srp_Latn (๐Ÿ‡ท๐Ÿ‡ธ), fas_Arab (๐Ÿ‡ฎ๐Ÿ‡ท), ceb_Latn (๐Ÿ‡ต๐Ÿ‡ญ), heb_Hebr (๐Ÿ‡ฎ๐Ÿ‡ฑ), hrv_Latn (๐Ÿ‡ญ๐Ÿ‡ท), glg_Latn (๐Ÿ‡ช๐Ÿ‡ธ), fin_Latn (๐Ÿ‡ซ๐Ÿ‡ฎ), slv_Latn (๐Ÿ‡ธ๐Ÿ‡ฎ), vie_Latn (๐Ÿ‡ป๐Ÿ‡ณ), mkd_Cyrl (๐Ÿ‡ฒ๐Ÿ‡ฐ), slk_Latn (๐Ÿ‡ธ๐Ÿ‡ฐ), nor_Latn (๐Ÿ‡ณ๐Ÿ‡ด), est_Latn (๐Ÿ‡ช๐Ÿ‡ช), ltz_Latn (๐Ÿ‡ฑ๐Ÿ‡บ), eus_Latn (๐Ÿ‡ช๐Ÿ‡ธ), lit_Latn (๐Ÿ‡ฑ๐Ÿ‡น), kaz_Cyrl (๐Ÿ‡ฐ๐Ÿ‡ฟ), lav_Latn (๐Ÿ‡ฑ๐Ÿ‡ป), bos_Latn (๐Ÿ‡ง๐Ÿ‡ฆ), epo_Latn (๐Ÿ‡บ๐Ÿ‡ธ), cat_Latn (๐Ÿ‡ช๐Ÿ‡ธ), tha_Thai (๐Ÿ‡น๐Ÿ‡ญ), ukr_Cyrl (๐Ÿ‡บ๐Ÿ‡ฆ), tgl_Latn (๐Ÿ‡ต๐Ÿ‡ญ), sin_Sinh (๐Ÿ‡ฑ๐Ÿ‡ฐ), gle_Latn (๐Ÿ‡ฎ๐Ÿ‡ช), hin_Deva (๐Ÿ‡ฎ๐Ÿ‡ณ), kor_Hang (๐Ÿ‡ฐ๐Ÿ‡ท), ory_Orya (๐Ÿ‡ฎ๐Ÿ‡ณ), urd_Arab (๐Ÿ‡ต๐Ÿ‡ฐ), swa_Latn (๐Ÿ‡ฐ๐Ÿ‡ช), sqi_Latn (๐Ÿ‡ฆ๐Ÿ‡ฑ), bel_Cyrl (๐Ÿ‡ง๐Ÿ‡พ), afr_Latn (๐Ÿ‡ฟ๐Ÿ‡ฆ), nno_Latn (๐Ÿ‡ณ๐Ÿ‡ด), tat_Cyrl (๐Ÿ‡ท๐Ÿ‡บ), asm_Beng (๐Ÿ‡ฎ๐Ÿ‡ณ), hil_Latn (๐Ÿ‡ต๐Ÿ‡ญ), nso_Latn (๐Ÿ‡ฟ๐Ÿ‡ฆ), ibo_Latn (๐Ÿ‡ณ๐Ÿ‡ฌ), kin_Latn (๐Ÿ‡ท๐Ÿ‡ผ), tpi_Latn (๐Ÿ‡ต๐Ÿ‡ฌ), twi_Latn (๐Ÿ‡ฌ๐Ÿ‡ญ), kir_Cyrl (๐Ÿ‡ฐ๐Ÿ‡ฌ), nep_Deva (๐Ÿ‡ณ๐Ÿ‡ต), azj_Latn (๐Ÿ‡ฆ๐Ÿ‡ฟ), bcl_Latn (๐Ÿ‡ต๐Ÿ‡ญ), xho_Latn (๐Ÿ‡ฟ๐Ÿ‡ฆ), cym_Latn (๐Ÿด), gaa_Latn (๐Ÿ‡ฌ๐Ÿ‡ญ), ton_Latn (๐Ÿ‡น๐Ÿ‡ด), tah_Latn (๐Ÿ‡ต๐Ÿ‡ซ), lat_Latn (๐Ÿ‡ป๐Ÿ‡ฆ), srn_Latn (๐Ÿ‡ธ๐Ÿ‡ท), ewe_Latn (๐Ÿ‡ฌ๐Ÿ‡ญ), bem_Latn (๐Ÿ‡ฟ๐Ÿ‡ฒ), orm_Latn (๐Ÿ‡ช๐Ÿ‡น), haw_Latn (๐Ÿ‡บ๐Ÿ‡ธ), hmo_Latn (๐Ÿ‡ต๐Ÿ‡ฌ), kat_Geor (๐Ÿ‡ฌ๐Ÿ‡ช), pag_Latn (๐Ÿ‡ต๐Ÿ‡ญ), loz_Latn (๐Ÿ‡ฟ๐Ÿ‡ฒ), fry_Latn (๐Ÿ‡ณ๐Ÿ‡ฑ), mya_Mymr (๐Ÿ‡ฒ๐Ÿ‡ฒ), nds_Latn (๐Ÿ‡ฉ๐Ÿ‡ช), run_Latn (๐Ÿ‡ง๐Ÿ‡ฎ), pnb_Arab (๐Ÿ‡ต๐Ÿ‡ฐ), rar_Latn (๐Ÿ‡จ๐Ÿ‡ฐ), fij_Latn (๐Ÿ‡ซ๐Ÿ‡ฏ), wls_Latn (๐Ÿ‡ผ๐Ÿ‡ธ), ckb_Arab (๐Ÿ‡ฎ๐Ÿ‡ถ), ven_Latn (๐Ÿ‡ฟ๐Ÿ‡ฆ), zsm_Latn (๐Ÿ‡ฒ๐Ÿ‡พ), chv_Cyrl (๐Ÿ‡ท๐Ÿ‡บ), lua_Latn (๐Ÿ‡จ๐Ÿ‡ฉ), que_Latn (๐Ÿ‡ต๐Ÿ‡ช), sag_Latn (๐Ÿ‡จ๐Ÿ‡ซ), guw_Latn (๐Ÿ‡ฌ๐Ÿ‡ผ), bre_Latn (๐Ÿ‡ซ๐Ÿ‡ท), toi_Latn (๐Ÿ‡จ๐Ÿ‡ซ), pus_Arab (๐Ÿ‡ฆ๐Ÿ‡ซ), che_Cyrl (๐Ÿ‡ท๐Ÿ‡บ), pis_Latn (๐Ÿ‡ธ๐Ÿ‡ง), kon_Latn (๐Ÿ‡จ๐Ÿ‡ฉ), oss_Cyrl (๐Ÿ‡ท๐Ÿ‡บ), hyw_Armn (๐Ÿ‡ฆ๐Ÿ‡ฒ), iso_Latn (๐Ÿ‡ป๐Ÿ‡บ), nan_Latn (๐Ÿ‡น๐Ÿ‡ผ), lub_Latn (๐Ÿ‡จ๐Ÿ‡ฉ), lim_Latn (๐Ÿ‡ณ๐Ÿ‡ฑ), tuk_Latn (๐Ÿ‡น๐Ÿ‡ฒ), tir_Ethi (๐Ÿ‡ช๐Ÿ‡น), tgk_Latn (๐Ÿ‡น๐Ÿ‡ฏ), yua_Latn (๐Ÿ‡ฒ๐Ÿ‡ฝ), min_Latn (๐Ÿ‡ฎ๐Ÿ‡ฉ), lue_Latn (๐Ÿ‡จ๐Ÿ‡ฉ), khm_Khmr (๐Ÿ‡ฐ๐Ÿ‡ญ), tum_Latn (๐Ÿ‡ฒ๐Ÿ‡ผ), tll_Latn (๐Ÿ‡ณ๐Ÿ‡ฆ), ekk_Latn (๐Ÿ‡ช๐Ÿ‡ช), lug_Latn (๐Ÿ‡บ๐Ÿ‡ฌ), niu_Latn (๐Ÿ‡ณ๐Ÿ‡บ), tzo_Latn (๐Ÿ‡ฒ๐Ÿ‡ฝ), mah_Latn (๐Ÿ‡ฒ๐Ÿ‡ญ), tvl_Latn (๐Ÿ‡น๐Ÿ‡ป), jav_Latn (๐Ÿ‡ฎ๐Ÿ‡ฉ), hau_Latn (๐Ÿ‡ณ๐Ÿ‡ฌ), som_Latn (๐Ÿ‡ธ๐Ÿ‡ด), uzb_Latn (๐Ÿ‡บ๐Ÿ‡ฟ), sot_Latn (๐Ÿ‡ฟ๐Ÿ‡ฆ), uzb_Cyrl (๐Ÿ‡บ๐Ÿ‡ฟ), cos_Latn (๐Ÿ‡ซ๐Ÿ‡ท), als_Latn (๐Ÿ‡ฆ๐Ÿ‡ฑ), amh_Ethi (๐Ÿ‡ช๐Ÿ‡น), sun_Latn (๐Ÿ‡ฎ๐Ÿ‡ฉ), war_Latn (๐Ÿ‡ต๐Ÿ‡ญ), div_Thaa (๐Ÿ‡ฒ๐Ÿ‡ป), yor_Latn (๐Ÿ‡ณ๐Ÿ‡ฌ), fao_Latn (๐Ÿ‡ซ๐Ÿ‡ด), uzn_Cyrl (๐Ÿ‡บ๐Ÿ‡ฟ), smo_Latn (๐Ÿ‡ผ๐Ÿ‡ธ), bak_Cyrl (๐Ÿ‡ท๐Ÿ‡บ), ilo_Latn (๐Ÿ‡ต๐Ÿ‡ญ), tso_Latn (๐Ÿ‡ฟ๐Ÿ‡ฆ), mri_Latn (๐Ÿ‡ณ๐Ÿ‡ฟ), hmn_Latn (๐Ÿ‡บ๐Ÿ‡ธ), nau_Latn (๐Ÿ‡ณ๐Ÿ‡ท), asm_Beng (๐Ÿ‡ฎ๐Ÿ‡ณ), hil_Latn (๐Ÿ‡ต๐Ÿ‡ญ), nso_Latn (๐Ÿ‡ฟ๐Ÿ‡ฆ), ibo_Latn (๐Ÿ‡ณ๐Ÿ‡ฌ), kin_Latn (๐Ÿ‡ท๐Ÿ‡ผ), tpi_Latn (๐Ÿ‡ต๐Ÿ‡ฌ), twi_Latn (๐Ÿ‡ฌ๐Ÿ‡ญ), kir_Cyrl (๐Ÿ‡ฐ๐Ÿ‡ฌ), pap_Latn (๐Ÿ‡ณ๐Ÿ‡ฑ), aze_Latn (๐Ÿ‡ฆ๐Ÿ‡ฟ), qvi_Latn (๐Ÿ‡ต๐Ÿ‡ช), cak_Latn (๐Ÿ‡ฌ๐Ÿ‡น), kbp_Latn (๐Ÿ‡ง๐Ÿ‡ซ), kri_Latn (๐Ÿ‡ธ๐Ÿ‡ฑ), mau_Latn (๐Ÿ‡ฒ๐Ÿ‡ฝ), scn_Latn (๐Ÿ‡ฎ๐Ÿ‡น), tyv_Cyrl (๐Ÿ‡ท๐Ÿ‡บ), ina_Latn (๐Ÿ‡ง๐Ÿ‡ช), btx_Latn (๐Ÿ‡ฎ๐Ÿ‡ฉ), nch_Latn (๐Ÿ‡ฒ๐Ÿ‡ฝ), ncj_Latn (๐Ÿ‡ฒ๐Ÿ‡ฝ), pau_Latn (๐Ÿ‡ต๐Ÿ‡ผ), toj_Latn (๐Ÿ‡ฒ๐Ÿ‡ฝ), pcm_Latn (๐Ÿ‡ณ๐Ÿ‡ฌ), dyu_Latn (๐Ÿ‡ง๐Ÿ‡ซ), kss_Latn (๐Ÿ‡ณ๐Ÿ‡ฌ), afb_Arab (๐Ÿ‡ธ๐Ÿ‡ฆ), urh_Latn (๐Ÿ‡ณ๐Ÿ‡ฌ), quc_Latn (๐Ÿ‡ฌ๐Ÿ‡น), new_Deva (๐Ÿ‡ณ๐Ÿ‡ต), yao_Latn (๐Ÿ‡ฒ๐Ÿ‡ผ), ngl_Latn (๐Ÿ‡ฒ๐Ÿ‡ฟ), nyu_Latn (๐Ÿ‡ฒ๐Ÿ‡ฟ), kab_Latn (๐Ÿ‡ฉ๐Ÿ‡ฟ), tuk_Cyrl (๐Ÿ‡น๐Ÿ‡ฒ), xmf_Geor (๐Ÿ‡ฌ๐Ÿ‡ช), ndc_Latn (๐Ÿ‡ฒ๐Ÿ‡ฟ), san_Deva (๐Ÿ‡ฎ๐Ÿ‡ณ), nba_Latn (๐Ÿ‡ณ๐Ÿ‡ฌ), bpy_Beng (๐Ÿ‡ฎ๐Ÿ‡ณ), ncx_Latn (๐Ÿ‡ฒ๐Ÿ‡ฝ), qug_Latn (๐Ÿ‡ต๐Ÿ‡ช), rmn_Latn (๐Ÿ‡ฎ๐Ÿ‡ณ), cjk_Latn (๐Ÿ‡ฌ๐Ÿ‡น), arb_Arab (๐Ÿ‡ธ๐Ÿ‡ฆ), kea_Latn (๐Ÿ‡จ๐Ÿ‡ป), mck_Latn (๐Ÿ‡จ๐Ÿ‡ฉ), arn_Latn (๐Ÿ‡จ๐Ÿ‡ฑ), pdt_Latn (๐Ÿ‡ฉ๐Ÿ‡ช), her_Latn (๐Ÿ‡ณ๐Ÿ‡ฆ), tlh_Latn (๐Ÿ‡บ๐Ÿ‡ธ), suz_Deva (๐Ÿ‡ฎ๐Ÿ‡ณ), kat_Geor (๐Ÿ‡ฌ๐Ÿ‡ช), kmr_Cyrl (๐Ÿ‡ท๐Ÿ‡บ), gcr_Latn (๐Ÿ‡ฌ๐Ÿ‡ต), jbo_Latn (๐Ÿ‡บ๐Ÿ‡ธ), tbz_Latn (๐Ÿ‡ต๐Ÿ‡ผ), bam_Latn (๐Ÿ‡ฒ๐Ÿ‡ฑ), prk_Latn (๐Ÿ‡ธ๐Ÿ‡ฎ), jam_Latn (๐Ÿ‡ฏ๐Ÿ‡ฒ), twx_Latn (๐Ÿ‡น๐Ÿ‡ผ), sme_Latn (๐Ÿ‡ซ๐Ÿ‡ฎ), gom_Latn (๐Ÿ‡ฎ๐Ÿ‡ณ), bum_Latn (๐Ÿ‡จ๐Ÿ‡ฒ), mgr_Latn (๐Ÿ‡ฒ๐Ÿ‡ผ), ahk_Latn (๐Ÿ‡ต๐Ÿ‡ฐ), kur_Arab (๐Ÿ‡ฎ๐Ÿ‡ถ), bas_Latn (๐Ÿ‡จ๐Ÿ‡ฒ), bin_Latn (๐Ÿ‡ณ๐Ÿ‡ฌ), tsz_Latn (๐Ÿ‡ฒ๐Ÿ‡ฝ), sid_Latn (๐Ÿ‡ช๐Ÿ‡น), diq_Latn (๐Ÿ‡น๐Ÿ‡ท), srd_Latn (๐Ÿ‡ฎ๐Ÿ‡น), tcf_Latn (๐Ÿ‡ฒ๐Ÿ‡ฝ), bzj_Latn (๐Ÿ‡ฎ๐Ÿ‡ณ), udm_Cyrl (๐Ÿ‡ท๐Ÿ‡บ), cce_Latn (๐Ÿ‡จ๐Ÿ‡ฒ), meu_Latn (๐Ÿ‡จ๐Ÿ‡ฉ), chw_Latn (๐Ÿ‡จ๐Ÿ‡ฒ), cbk_Latn (๐Ÿ‡ต๐Ÿ‡ญ), ibg_Latn (๐Ÿ‡ฎ๐Ÿ‡ฉ), bhw_Latn (๐Ÿ‡ฎ๐Ÿ‡ฉ), ngu_Latn (๐Ÿ‡ฒ๐Ÿ‡ฝ), nyy_Latn (๐Ÿ‡น๐Ÿ‡ฟ), szl_Latn (๐Ÿ‡ต๐Ÿ‡ฑ), ish_Latn (๐Ÿ‡น๐Ÿ‡ฟ), naq_Latn (๐Ÿ‡ณ๐Ÿ‡ฆ), toh_Latn (๐Ÿ‡ณ๐Ÿ‡ฟ), ttj_Latn (๐Ÿ‡ฐ๐Ÿ‡ช), nse_Latn (๐Ÿ‡ณ๐Ÿ‡ฌ), ami_Latn (๐Ÿ‡น๐Ÿ‡ผ), alz_Latn (๐Ÿ‡ธ๐Ÿ‡ฉ), apc_Arab (๐Ÿ‡ธ๐Ÿ‡พ), vls_Latn (๐Ÿ‡ณ๐Ÿ‡ฑ), mhr_Cyrl (๐Ÿ‡ท๐Ÿ‡บ), djk_Latn (๐Ÿ‡ฉ๐Ÿ‡ช), prs_Arab (๐Ÿ‡ฆ๐Ÿ‡ซ), san_Latn (๐Ÿ‡ฎ๐Ÿ‡ณ), som_Arab (๐Ÿ‡ธ๐Ÿ‡ด), uig_Latn (๐Ÿ‡จ๐Ÿ‡ณ), hau_Arab (๐Ÿ‡ณ๐Ÿ‡ฌ) Github ๐Ÿ”
Star
Few-shot Learning with Multilingual Generative Language Models
2022 English (๐Ÿ‡บ๐Ÿ‡ธ), Russian (๐Ÿ‡ท๐Ÿ‡บ), Chinese (๐Ÿ‡จ๐Ÿ‡ณ), German (๐Ÿ‡ฉ๐Ÿ‡ช), Spanish (๐Ÿ‡ช๐Ÿ‡ธ), French (๐Ÿ‡ซ๐Ÿ‡ท), Japanese (๐Ÿ‡ฏ๐Ÿ‡ต), Italian (๐Ÿ‡ฎ๐Ÿ‡น), Portuguese (๐Ÿ‡ต๐Ÿ‡น), Greek (๐Ÿ‡ฌ๐Ÿ‡ท), Romanian (๐Ÿ‡ท๐Ÿ‡ด), Ukrainian (๐Ÿ‡บ๐Ÿ‡ฆ), Hungarian (๐Ÿ‡ญ๐Ÿ‡บ), Korean (๐Ÿ‡ฐ๐Ÿ‡ท), Polish (๐Ÿ‡ต๐Ÿ‡ฑ), Norwegian (๐Ÿ‡ณ๐Ÿ‡ด), Dutch (๐Ÿ‡ณ๐Ÿ‡ฑ), Finnish (๐Ÿ‡ซ๐Ÿ‡ฎ), Danish (๐Ÿ‡ฉ๐Ÿ‡ฐ), Indonesian (๐Ÿ‡ฎ๐Ÿ‡ฉ), Croatian (๐Ÿ‡ญ๐Ÿ‡ท), Turkish (๐Ÿ‡น๐Ÿ‡ท), Arabic (๐Ÿ‡ธ๐Ÿ‡ฆ), Vietnamese (๐Ÿ‡ป๐Ÿ‡ณ), Thai (๐Ÿ‡น๐Ÿ‡ญ), Bulgarian (๐Ÿ‡ง๐Ÿ‡ฌ), Persian (๐Ÿ‡ฎ๐Ÿ‡ท), Swedish (๐Ÿ‡ธ๐Ÿ‡ช), Malay (๐Ÿ‡ฒ๐Ÿ‡พ), Hebrew (๐Ÿ‡ฎ๐Ÿ‡ฑ), Czech (๐Ÿ‡จ๐Ÿ‡ฟ), Slovak (๐Ÿ‡ธ๐Ÿ‡ฐ), Catalan (๐Ÿ‡ช๐Ÿ‡ธ), Lithuanian (๐Ÿ‡ฑ๐Ÿ‡น), Slovene (๐Ÿ‡ธ๐Ÿ‡ฎ), Hindi (๐Ÿ‡ฎ๐Ÿ‡ณ), Estonian (๐Ÿ‡ช๐Ÿ‡ช), Latvian (๐Ÿ‡ฑ๐Ÿ‡ป), Tagalog (๐Ÿ‡ต๐Ÿ‡ญ), Albanian (๐Ÿ‡ฆ๐Ÿ‡ฑ), Serbian (๐Ÿ‡ท๐Ÿ‡ธ), Azerbaijani (๐Ÿ‡ฆ๐Ÿ‡ฟ), Bengali (๐Ÿ‡ง๐Ÿ‡ฉ), Tamil (๐Ÿ‡ฎ๐Ÿ‡ณ), Urdu (๐Ÿ‡ต๐Ÿ‡ฐ), Kazakh (๐Ÿ‡ฐ๐Ÿ‡ฟ), Armenian (๐Ÿ‡ฆ๐Ÿ‡ฒ), Georgian (๐Ÿ‡ฌ๐Ÿ‡ช), Icelandic (๐Ÿ‡ฎ๐Ÿ‡ธ), Belarusian (๐Ÿ‡ง๐Ÿ‡พ), Bosnian (๐Ÿ‡ง๐Ÿ‡ฆ), Malayalam (๐Ÿ‡ฎ๐Ÿ‡ณ), Macedonian (๐Ÿ‡ฒ๐Ÿ‡ฐ), Swahili (๐Ÿ‡น๐Ÿ‡ฟ), Afrikaans (๐Ÿ‡ฟ๐Ÿ‡ฆ), Telugu (๐Ÿ‡ฎ๐Ÿ‡ณ), Arabic Romanized (๐Ÿ‡ธ๐Ÿ‡ฆ), Mongolian (๐Ÿ‡ฒ๐Ÿ‡ณ), Latin (๐Ÿ‡ฎ๐Ÿ‡น), Nepali (๐Ÿ‡ณ๐Ÿ‡ต), Sinhalese (๐Ÿ‡ฑ๐Ÿ‡ฐ), Marathi (๐Ÿ‡ฎ๐Ÿ‡ณ), Kannada (๐Ÿ‡ฎ๐Ÿ‡ณ), Somali (๐Ÿ‡ธ๐Ÿ‡ด), Welsh (๐Ÿด), Javanese (๐Ÿ‡ฎ๐Ÿ‡ฉ), Pashto (๐Ÿ‡ฆ๐Ÿ‡ซ), Uzbek (๐Ÿ‡บ๐Ÿ‡ฟ), Gujarati (๐Ÿ‡ฎ๐Ÿ‡ณ), Khmer (๐Ÿ‡ฐ๐Ÿ‡ญ), Urdu Romanized (๐Ÿ‡ต๐Ÿ‡ฐ), Amharic (๐Ÿ‡ช๐Ÿ‡น), Bengali Romanized (๐Ÿ‡ง๐Ÿ‡ฉ), Punjabi (๐Ÿ‡ฎ๐Ÿ‡ณ), Galician (๐Ÿ‡ช๐Ÿ‡ธ), Hausa (๐Ÿ‡ณ๐Ÿ‡ฌ), Sanskrit (๐Ÿ‡ฎ๐Ÿ‡ณ), Basque (๐Ÿ‡ช๐Ÿ‡ธ), Burmese (๐Ÿ‡ฒ๐Ÿ‡ฒ), Sundanese (๐Ÿ‡ฎ๐Ÿ‡ฉ), Oriya (๐Ÿ‡ฎ๐Ÿ‡ณ), Haitian (๐Ÿ‡ญ๐Ÿ‡น), Lao (๐Ÿ‡ฑ๐Ÿ‡ฆ), Kyrgyz (๐Ÿ‡ฐ๐Ÿ‡ฌ), Breton (๐Ÿ‡ซ๐Ÿ‡ท), Irish (๐Ÿ‡ฎ๐Ÿ‡ช), Yoruba (๐Ÿ‡ณ๐Ÿ‡ฌ), Esperanto (๐ŸŒ), Tamil Romanized (๐Ÿ‡ฎ๐Ÿ‡ณ), Zulu (๐Ÿ‡ฟ๐Ÿ‡ฆ), Tigrinya (๐Ÿ‡ช๐Ÿ‡ท), Telugu Romanized (๐Ÿ‡ฎ๐Ÿ‡ณ), Kurdish (๐Ÿ‡น๐Ÿ‡ท), Oromo (๐Ÿ‡ช๐Ÿ‡น), Xhosa (๐Ÿ‡ฟ๐Ÿ‡ฆ), Scottish Gaelic (๐Ÿ‡ฌ๐Ÿ‡ง), Igbo (๐Ÿ‡ณ๐Ÿ‡ฌ), Assamese (๐Ÿ‡ฎ๐Ÿ‡ณ), Ganda (๐Ÿ‡บ๐Ÿ‡ฌ), Wolof (๐Ÿ‡ธ๐Ÿ‡ณ), Western Frisian (๐Ÿ‡ณ๐Ÿ‡ฑ), Tswana (๐Ÿ‡ง๐Ÿ‡ผ), Fula (๐Ÿ‡ธ๐Ÿ‡ณ), Guaranรญ (๐Ÿ‡ต๐Ÿ‡พ), Sindhi (๐Ÿ‡ต๐Ÿ‡ฐ), Lingala (๐Ÿ‡จ๐Ÿ‡ฉ), Bambara (๐Ÿ‡ฒ๐Ÿ‡ฑ), Inuktitut (๐Ÿ‡จ๐Ÿ‡ฆ), Kongo (๐Ÿ‡จ๐Ÿ‡ฉ), Quechua (๐Ÿ‡ต๐Ÿ‡ช), Swati (๐Ÿ‡ธ๐Ÿ‡ฟ), Unassigned (๐ŸŒ) Github ๐Ÿ”
Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions
2024 English (๐Ÿ‡บ๐Ÿ‡ธ), Chinese (๐Ÿ‡จ๐Ÿ‡ณ), Telugu (๐Ÿ‡ฎ๐Ÿ‡ณ), Hindi (๐Ÿ‡ฎ๐Ÿ‡ณ), Arabic (๐Ÿ‡ธ๐Ÿ‡ฆ), Swahili (๐Ÿ‡น๐Ÿ‡ฟ), Bengali (๐Ÿ‡ง๐Ÿ‡ฉ) ๐Ÿ” ๐Ÿ”
Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning
2022 Afrikaans (๐Ÿ‡ฟ๐Ÿ‡ฆ), Amharic (๐Ÿ‡ช๐Ÿ‡น), Hausa (๐Ÿ‡ณ๐Ÿ‡ฌ), Igbo (๐Ÿ‡ณ๐Ÿ‡ฌ), Malagasy (๐Ÿ‡ฒ๐Ÿ‡ฌ), Chichewa (๐Ÿ‡ฒ๐Ÿ‡ผ), Oromo (๐Ÿ‡ช๐Ÿ‡น), Naija (๐Ÿ‡ณ๐Ÿ‡ฌ), Kinyarwanda (๐Ÿ‡ท๐Ÿ‡ผ), Kirundi (๐Ÿ‡ง๐Ÿ‡ฎ), Shona (๐Ÿ‡ฟ๐Ÿ‡ผ), Somali (๐Ÿ‡ธ๐Ÿ‡ด), Sesotho (๐Ÿ‡ฑ๐Ÿ‡ธ), Swahili (๐Ÿ‡น๐Ÿ‡ฟ), isiXhosa (๐Ÿ‡ฟ๐Ÿ‡ฆ), Yoruba (๐Ÿ‡ณ๐Ÿ‡ฌ), isiZulu (๐Ÿ‡ฟ๐Ÿ‡ฆ), English (๐Ÿ‡ฌ๐Ÿ‡ง), French (๐Ÿ‡ซ๐Ÿ‡ท), Arabic (๐Ÿ‡ธ๐Ÿ‡ฆ), Lingala (๐Ÿ‡จ๐Ÿ‡ฉ), Luganda (๐Ÿ‡บ๐Ÿ‡ฌ), Luo (๐Ÿ‡ฐ๐Ÿ‡ช), Wolof (๐Ÿ‡ธ๐Ÿ‡ณ) GitHub ๐Ÿค—
MuRIL: Multilingual Representations for Indian Languages
2021 Assamese (๐Ÿ‡ฎ๐Ÿ‡ณ), Bengali (๐Ÿ‡ง๐Ÿ‡ฉ), Gujarati (๐Ÿ‡ฎ๐Ÿ‡ณ), Hindi (๐Ÿ‡ฎ๐Ÿ‡ณ), Kannada (๐Ÿ‡ฎ๐Ÿ‡ณ), Kashmiri (๐Ÿ‡ฎ๐Ÿ‡ณ), Malayalam (๐Ÿ‡ฎ๐Ÿ‡ณ), Marathi (๐Ÿ‡ฎ๐Ÿ‡ณ), Nepali (๐Ÿ‡ณ๐Ÿ‡ต), Oriya (๐Ÿ‡ฎ๐Ÿ‡ณ), Punjabi (๐Ÿ‡ฎ๐Ÿ‡ณ), Sanskrit (๐Ÿ‡ฎ๐Ÿ‡ณ), Sindhi (๐Ÿ‡ต๐Ÿ‡ฐ), Tamil (๐Ÿ‡ฎ๐Ÿ‡ณ), Telugu (๐Ÿ‡ฎ๐Ÿ‡ณ), Urdu (๐Ÿ‡ฎ๐Ÿ‡ณ), English (๐Ÿ‡ฌ๐Ÿ‡ง) ๐Ÿ” ๐Ÿ”
From English to Foreign Languages: Transferring Pretrained Language Models
2020 French (๐Ÿ‡ซ๐Ÿ‡ท), Russian (๐Ÿ‡ท๐Ÿ‡บ), Arabic (๐Ÿ‡ฆ๐Ÿ‡ช), Chinese (๐Ÿ‡จ๐Ÿ‡ณ), Hindi (๐Ÿ‡ฎ๐Ÿ‡ณ), Vietnamese (๐Ÿ‡ป๐Ÿ‡ณ) ๐Ÿ” ๐Ÿ”

Multilingual Vision Language Models

Survey / Review Papers