Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to train model for chinese #54

Open
harold1505 opened this issue Dec 10, 2020 · 1 comment
Open

Not able to train model for chinese #54

harold1505 opened this issue Dec 10, 2020 · 1 comment

Comments

@harold1505
Copy link

harold1505 commented Dec 10, 2020

While training a chinese corpus, my model doesn't seems to train. It is giving the following heuristic at the end of each iteration.

  log_2 likelihood: nan                                                                                                                                                                                                                       
     cross entropy: -nan                                                                                                                                                                                                                      
        perplexity: -nan                                                                                                                                                                                                                      
      posterior p0: 0                                                                                                                                                                                                                         
 posterior al-feat: 0                                                                                                                                                                                                                         
       size counts: 33102 

Some lines from training corpus:

3月 6日 , 安理会 举行 了 一 次 非公开 会议 ( 第4286 次 会议 ) 南斯拉夫 联盟 共和国 总理 佐兰 · 日日奇 参加 了 会议 。 ||| On 6 March, the Council held a private meeting (4286th) with the participation of the Prime Minister of the Federal Republic of Yugoslavia, Zoran Žižić.
为 了 实现 这个 目标 , 实现 千 年 发展 目标 , 我们 认为 拥有 资源 的 国 家 必须 努力 提供 与 此 挑战 相 适宜 的 资金 。 ||| To achieve that objective and to attain the Millennium Development Goals, we believe that the countries possessing the resources must make a financial effort commensurate with the challenge.
㈡ 在 政府 间 论坛 上 对 咨询 服务 表示 满意 的 机构 的 数目 ||| (ii) Number of institutions expressing satisfaction with advisory services in intergovernmental forums
本 文件 的 增编 详细 介绍 了 中心 在 1995 - 1996年 发挥 的 作用 。 ||| Details of the Centre's role during the period 1995-1996 is provided in an addendum to the present document.

I'm using thulac for tokenizing chinese corpus.

@Potato-Shy
Copy link

While training a chinese corpus, my model doesn't seems to train. It is giving the following heuristic at the end of each iteration.

  log_2 likelihood: nan                                                                                                                                                                                                                       
     cross entropy: -nan                                                                                                                                                                                                                      
        perplexity: -nan                                                                                                                                                                                                                      
      posterior p0: 0                                                                                                                                                                                                                         
 posterior al-feat: 0                                                                                                                                                                                                                         
       size counts: 33102 

Some lines from training corpus:

3月 6日 , 安理会 举行 了 一 次 非公开 会议 ( 第4286 次 会议 ) 南斯拉夫 联盟 共和国 总理 佐兰 · 日日奇 参加 了 会议 。 ||| On 6 March, the Council held a private meeting (4286th) with the participation of the Prime Minister of the Federal Republic of Yugoslavia, Zoran Žižić.
为 了 实现 这个 目标 , 实现 千 年 发展 目标 , 我们 认为 拥有 资源 的 国 家 必须 努力 提供 与 此 挑战 相 适宜 的 资金 。 ||| To achieve that objective and to attain the Millennium Development Goals, we believe that the countries possessing the resources must make a financial effort commensurate with the challenge.
㈡ 在 政府 间 论坛 上 对 咨询 服务 表示 满意 的 机构 的 数目 ||| (ii) Number of institutions expressing satisfaction with advisory services in intergovernmental forums
本 文件 的 增编 详细 介绍 了 中心 在 1995 - 1996年 发挥 的 作用 。 ||| Details of the Centre's role during the period 1995-1996 is provided in an addendum to the present document.

I'm using thulac for tokenizing chinese corpus.

I got the same problem with you, did you solv it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants