Japanese RoBERTa

This repository provides snippets to use RoBERTa pre-trained on Japanese corpus. Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total. The released model is built based on that from HuggingFace.

Details

Preprocessing

We used Juman++ (version 2.0.0-rc3) as a morphological analyzer, and also applied WordPiece embedding (subword-nmt) to split each word into word pieces.

Model

Configurations of our model are following. Please refer to HuggingFace page for definitions of each parameter.

{
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 3,
  "classifier_dropout": null,
  "eos_token_id": 0,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 515,
  "max_seq_length": 512,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 2,
  "position_embedding_type": "absolute",
  "transformers_version": "4.16.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 32005
}

Training

We trained our model in the same way as RoBERTa. We optimized our model using the masked language modeling (MLM) objective. The accuracy of the MLM is 72.0%.

How to use our model

Install Juman++ following the instructions from this repository.
Install PyTorch following the instructions from this page.
Install all of the necessary python packages.

pip install -r requirements.txt

Download weight files and configuration files.

Files	Description
roberta.pth	Trained weights of RobertaModel from HuggingFace
linear_word.pth	Trained weights of linear layer for MLM
roberta_config.json	Configurations for RobertaModel
bpe.txt	Rules for splitting each word into word pieces
bpe_dict.csv	Dictionary of word pieces

Load weights and configurations.

    # Paths to each file
    bpe_file = <Path to the file (bpe.txt) which defined word pieces>
    count_file = <Path to the file (bpe_dict.csv) which defines ids for word pieces>
    roberta_config_path = <Path to the file (roberta_config.json) which defines configurations of RobertaModel>
    juman_config_path = <Path to config file for juman>
    
    roberta_weight_path = <Path to the weight file (roberta.pth) of RobertaModel>
    linear_weight_path = <Path to the weight file (linear_word.pth) of final linear layer for MLM>
    
    # load tokenizer
    processor = TextProcessor(bpe_file=bpe_file, count_file=count_file)

    # load pretrained roberta model
    with open(roberta_config_path, "r") as f:
        config_dict = json.load(f)
    config_bert = RobertaConfig().from_dict(config_dict)
    roberta = RobertaModel(config=config_roberta)
    roberta.load_state_dict(torch.load(roberta_weight_path, map_location=device))

    # load pretained decoder
    ifxroberta = IFXRoberta(roberta)
    ifxroberta.linear_word.load_state_dict(torch.load(linear_weight_path, map_location=device))

Encode inputs.

    # infer
    inp_text = "コンピュータ技術およびその関連技術に対する研究開発に努め、自己研鑽に励み、専門的な技術集団を形成する。"
    bpe_text = processor.encode(inp_text)
    with torch.no_grad():
        inp_tensor = torch.LongTensor([bpe_text])
        inp_tensor.to(device)
        out_tensor = ifxroberta(inp_tensor)

Decode model outputs.

    # decoding output
    out_text_code = torch.max(out_tensor, dim=-1, keepdim=True)[1][0]
    ids = out_text_code.squeeze(-1).cpu().numpy().tolist()
    ignore_idx = np.where(np.array(bpe_ids) < 5)[0]
    out_text = processor.decode(ids, ignore_idx=ignore_idx)

main.py helps you understand how to use our model.

License

The Apache 2.0 license

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
main.py		main.py
models.py		models.py
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Japanese RoBERTa

Details

Preprocessing

Model

Training

How to use our model

License

About

Releases

Packages

Languages

License

informatix-inc/bert

Folders and files

Latest commit

History

Repository files navigation

Japanese RoBERTa

Details

Preprocessing

Model

Training

How to use our model

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages