Skip to content

Commit

Permalink
#112 优化词向量,扩大词汇表,加快下载速度
Browse files Browse the repository at this point in the history
  • Loading branch information
Hai Liang Wang committed Sep 21, 2020
1 parent 1fdfaae commit 0dbe1ec
Show file tree
Hide file tree
Showing 9 changed files with 493 additions and 63 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ synonyms.egg-info
.vscode/
build/
.env
synonyms/data/words.vector
synonyms/data/words.vector*
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (2018-2020) Hu Ying Xi<>, Hai Liang Wang<hain@chatopera.com>
Copyright (2018-2020) Chatopera Inc. <https://www.chatopera.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

Expand Down
48 changes: 33 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ pip install -U synonyms

兼容 py2 和 py3,当前稳定版本 [v3.x](https://github.com/chatopera/Synonyms/releases)

**提示:安装后初次使用会下载词向量文件,下载速度取决于网络情况。**

![](./assets/3.gif)

**Node.js 用户可以使用 [node-synonyms](https://www.npmjs.com/package/node-synonyms)了。**
Expand Down Expand Up @@ -80,7 +82,7 @@ synonyms.nearby(人脸, 10) = (
095, 0.525344, 0.524009, 0.523101, 0.516046])
```

在 OOV 的情况下,返回 `([], [])`,目前的字典大小: 125,792
在 OOV 的情况下,返回 `([], [])`,目前的字典大小: 435,729

### synonyms#compare

Expand All @@ -107,16 +109,16 @@ synonyms.nearby(人脸, 10) = (
```
>>> synonyms.display("飞机")
'飞机'近义词:
1. 架飞机:0.837399
2. 客机:0.764609
3. 直升机:0.762116
4. 民航机:0.750519
5. 航机:0.750116
6. 起飞:0.735736
7. 战机:0.734975
8. 飞行中:0.732649
9. 航空器:0.723945
10. 运输机:0.720578
1. 飞机:1.0
2. 直升机:0.8423391
3. 客机:0.8393003
4. 滑翔机:0.7872388
5. 军用飞机:0.7832081
6. 水上飞机:0.77857226
7. 运输机:0.7724742
8. 航机:0.7664748
9. 航空器:0.76592904
10. 民航机:0.74209654
```

`SIZE` 是打印词汇表的数量,默认 10。
Expand Down Expand Up @@ -182,14 +184,20 @@ HowNet,也被称为知网,它并不只是一个语义字典,而是一个

### 对比

Synonyms 的词表容量是 125,792,下面选择一些在同义词词林、知网和 Synonyms 都存在的几个词,给出其近似度的对比:
Synonyms 的词表容量是 435,729,下面选择一些在同义词词林、知网和 Synonyms 都存在的几个词,给出其近似度的对比:

![](./assets/5.png)

注:同义词林及知网数据、分数[来源](https://github.com/yaleimeng/Final_word_Similarity)。Synonyms 也在不断优化中,新的分数可能和上图不一致。

更多[比对结果](./VALUATION.md)

## Used by

[Github 关联用户列表](https://github.com/chatopera/Synonyms/network/dependents?package_id=UGFja2FnZS01MjY2NDc1Nw%3D%3D)

![](./assets/6.png)

## Benchmark

Test with py3, MacBook Pro.
Expand Down Expand Up @@ -242,7 +250,7 @@ meminfo 8GB

# Promotion

[Chatopera 云服务](https://bot.chatopera.com/dashboard) 是面向企业聊天机器人构建的一站式解决方案,融合信息检索系统、机器学习、聊天机器人脚本语法和语音识别等技术,为定制化聊天机器人和自然语言交互而生!
[Chatopera 云服务](https://bot.chatopera.com/dashboard)

<p align="center">
<b>Chatopera 云服务</b><br>
Expand All @@ -251,6 +259,8 @@ meminfo 8GB
</a>
</p>

Chatopera 机器人平台包括知识库、多轮对话、意图识别和语音识别等组件,标准化聊天机器人开发,支持企业 OA 智能问答、HR 智能问答、智能客服和网络营销等场景;一站式实现聊天机器人,按量付费,让聊天机器人上线!

# References

[wikidata-corpus](https://github.com/Samurais/wikidata-corpus)
Expand All @@ -273,9 +283,9 @@ Google 发布的[word2vec](https://code.google.com/archive/p/word2vec/),该库

# Authors

[Hai Liang Wang](http://blog.chatbot.io/webcv/)
[Hai Liang Wang](https://pre-angel.com/peoples/hailiang-wang/)

[Hu Ying Xi](https://github.com/chatopera/)
[Hu Ying Xi](https://github.com/huyingxi)

# Give credits to

Expand All @@ -293,6 +303,14 @@ Google 发布的[word2vec](https://code.google.com/archive/p/word2vec/),该库

[MIT](./LICENSE)

Copyright (2018-2020) Chatopera Inc. <https://www.chatopera.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.OF

[![chatoper banner][co-banner-image]][co-url]

[co-banner-image]: https://user-images.githubusercontent.com/3538629/42383104-da925942-8168-11e8-8195-868d5fcec170.png
Expand Down
67 changes: 34 additions & 33 deletions VALUATION.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,34 @@
# synonyms 分数评测 [(v3.11.0)](https://pypi.python.org/pypi/synonyms/3.11.0)
| 词1 | 词2 | synonyms | 人工评定 |
| --- | --- | --- | --- |
| 轿车 | 汽车 | 0.892 | 0.98 |
| 宝石 | 宝物 | 1.0 | 0.96 |
| 旅游 | 游历 | 0.649 | 0.96 |
| 男孩子 | 小伙子 | 0.77 | 0.94 |
| 海岸 | 海滨 | 0.889 | 0.925 |
| 庇护所 | 精神病院 | 0.211 | 0.9025 |
| 魔术师 | 巫师 | 0.95 | 0.875 |
| 中午 | 正午 | 0.9 | 0.855 |
| 火炉 | 炉灶 | 0.889 | 0.7775 |
| 食物 | 水果 | 0.363 | 0.77 |
|| 公鸡 | 0.895 | 0.7625 |
||| 1.0 | 0.7425 |
| 工具 | 器械 | 0.881 | 0.7375 |
| 兄弟 | 和尚 | 0.139 | 0.705 |
| 起重机 | 器械 | 0.195 | 0.42 |
| 小伙子 | 兄弟 | 0.703 | 0.415 |
| 旅行 | 轿车 | 0.088 | 0.29 |
| 和尚 | 圣贤 | 0.222 | 0.275 |
| 墓地 | 林地 | 0.874 | 0.2375 |
| 食物 | 公鸡 | 0.151 | 0.2225 |
| 海岸 | 丘陵 | 0.248 | 0.2175 |
| 森林 | 墓地 | 0.14 | 0.21 |
| 岸边 | 林地 | 0.193 | 0.1575 |
| 和尚 | 奴隶 | 0.059 | 0.1375 |
| 海岸 | 森林 | 0.23 | 0.105 |
| 小伙子 | 巫师 | 0.182 | 0.105 |
| 琴弦 | 微笑 | 0.089 | 0.0325 |
| 玻璃 | 魔术师 | 0.02 | 0.0275 |
| 中午 | 绳子 | 0.049 | 0.02 |
| 公鸡 | 航行 | 0.0 | 0.02 |
# synonyms 分数评测 [(v3.12.0)](https://pypi.python.org/pypi/synonyms/3.12.0)

| 词 1 | 词 2 | synonyms | 人工评定 |
| ------ | -------- | -------- | -------- |
| 轿车 | 汽车 | 0.892 | 0.98 |
| 宝石 | 宝物 | 1.0 | 0.96 |
| 旅游 | 游历 | 0.649 | 0.96 |
| 男孩子 | 小伙子 | 0.77 | 0.94 |
| 海岸 | 海滨 | 0.889 | 0.925 |
| 庇护所 | 精神病院 | 0.211 | 0.9025 |
| 魔术师 | 巫师 | 0.95 | 0.875 |
| 中午 | 正午 | 0.9 | 0.855 |
| 火炉 | 炉灶 | 0.889 | 0.7775 |
| 食物 | 水果 | 0.363 | 0.77 |
|| 公鸡 | 0.895 | 0.7625 |
||| 1.0 | 0.7425 |
| 工具 | 器械 | 0.881 | 0.7375 |
| 兄弟 | 和尚 | 0.139 | 0.705 |
| 起重机 | 器械 | 0.195 | 0.42 |
| 小伙子 | 兄弟 | 0.703 | 0.415 |
| 旅行 | 轿车 | 0.088 | 0.29 |
| 和尚 | 圣贤 | 0.222 | 0.275 |
| 墓地 | 林地 | 0.874 | 0.2375 |
| 食物 | 公鸡 | 0.151 | 0.2225 |
| 海岸 | 丘陵 | 0.248 | 0.2175 |
| 森林 | 墓地 | 0.14 | 0.21 |
| 岸边 | 林地 | 0.193 | 0.1575 |
| 和尚 | 奴隶 | 0.059 | 0.1375 |
| 海岸 | 森林 | 0.23 | 0.105 |
| 小伙子 | 巫师 | 0.182 | 0.105 |
| 琴弦 | 微笑 | 0.089 | 0.0325 |
| 玻璃 | 魔术师 | 0.02 | 0.0275 |
| 中午 | 绳子 | 0.049 | 0.02 |
| 公鸡 | 航行 | 0.0 | 0.02 |
Binary file added assets/6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion scripts/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ export PATH=/opt/miniconda3/envs/venv-py3/bin:$PATH
cd $baseDir/..
if [ -f .env ]; then
echo "load env with" `pwd`"/.env"
source .env
#source .env
fi

python demo.py
13 changes: 6 additions & 7 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,19 @@
Synonyms
=====================
Chinese Synonyms for Natural Language Processing and Understanding.
中文近义词
Welcome
-------
https://github.com/chatopera/Synonyms
"""

setup(
name='synonyms',
version='3.11.0',
description=' 中文近义词:聊天机器人,智能问答工具包;Chinese Synonyms for Natural Language Processing and Understanding',
version='3.12.0',
description='中文近义词:聊天机器人,智能问答工具包;Chinese Synonyms for Natural Language Processing and Understanding',
long_description=LONGDOC,
author='Hai Liang Wang, Hu Ying Xi',
author_email='hailiang.hl.wang@gmail.com',
author_email='hain@chatopera.com',
url='https://github.com/chatopera/Synonyms',
license="MIT",
classifiers=[
Expand All @@ -32,6 +31,7 @@
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Topic :: Text Processing',
'Topic :: Text Processing :: Indexing',
'Topic :: Text Processing :: Linguistic'],
Expand All @@ -48,5 +48,4 @@
'synonyms': [
'**/*.gz',
'**/*.txt',
'**/*.vector',
'LICENSE']})
20 changes: 15 additions & 5 deletions synonyms/synonyms.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
__copyright__ = "Copyright (c) (2017-2020) Chatopera Inc. All Rights Reserved"
__author__ = "Hu Ying Xi<>, Hai Liang Wang<[email protected]>"
__date__ = "2017-09-27"
__version__ = "3.11.0"
__version__ = "3.12.0"

import os
import sys
Expand Down Expand Up @@ -56,6 +56,7 @@
from .utils import is_digit
import jieba
from .jieba import posseg as _tokenizer
import wget

'''
globals
Expand Down Expand Up @@ -119,19 +120,28 @@ def _segment_words(sen):
word embedding
'''
# vectors
_f_model = os.path.join(curdir, 'data', 'words.vector')
_f_url = os.environ.get("SYNONYMS_WORD2VEC_BIN_URL_ZH_CN", "https://static-public.chatopera.com/ml/synonyms/words.vector.gz")
_f_model = os.path.join(curdir, 'data', 'words.vector.gz')
_download_model = not os.path.exists(_f_model)
if "SYNONYMS_WORD2VEC_BIN_MODEL_ZH_CN" in ENVIRON:
_f_model = ENVIRON["SYNONYMS_WORD2VEC_BIN_MODEL_ZH_CN"]
_download_model = False

def _load_w2v(model_file=_f_model, binary=True):
'''
load word2vec model
'''
if not os.path.exists(model_file):
print("os.path : ", os.path)
if not os.path.exists(model_file) and _download_model:
print("\n[Synonyms] downloading data from %s to %s ... \n this only happens if SYNONYMS_WORD2VEC_BIN_URL_ZH_CN is not present and Synonyms initialization for the first time. \n It would take minutes that depends on network." % (_f_url, model_file))
wget.download(_f_url, out = model_file)
print("\n[Synonyms] download is done.\n")
elif not os.path.exists(model_file):
print("[Synonyms] os.path : ", os.path)
raise Exception("Model file [%s] does not exist." % model_file)

return KeyedVectors.load_word2vec_format(
model_file, binary=binary, unicode_errors='ignore')
print(">> Synonyms on loading vectors [%s] ..." % _f_model)
print("[Synonyms] on loading vectors [%s] ..." % _f_model)
_vectors = _load_w2v(model_file=_f_model)

def _get_wv(sentence, ignore=False):
Expand Down
Loading

0 comments on commit 0dbe1ec

Please sign in to comment.