Update README.md

chatopera · Aug 6, 2017 · 7bd35ad · 7bd35ad
1 parent 9b41a52
commit 7bd35ad
Show file tree

Hide file tree

Showing 2 changed files with 8 additions and 101 deletions.
diff --git a/README.md b/README.md
@@ -19,13 +19,8 @@
 
 ## 安装
 
-因为目前数据包大小大于pypi.python.org支持的最大限制，所以，不能分发到pypi.python.org。
-下载链接：[百度网盘](https://pan.baidu.com/s/1i5MM6nb) 密码: 8u98
-
 ```
-tar xzf insuranceqa_data-xxx.tar.gz # xxx is the version
-cd insuranceqa_data-xxx
-python setup.py install
+pip install --upgrade insuranceqa_data
 ```
 
 ## 问答语料
@@ -130,11 +125,13 @@ for x in test_data:
      (x['qid'], x['question'], x['utterance'], x['label']))
 
 vocab_data = insuranceqa.load_pairs_vocab()
-for x in vocab_data:
-    print('index %s: %s ++$++ %s' % (x, d[x]['zh'], d[x]['en']))
+vocab_data['dict_word_to_id']['UNKNOWN']
+vocab_data['dict_id_to_word'][0]
+vocab_data['tf']
+vocab_data['total']
 ```
 
-```vocab_data```包含```dict_word_to_id```(从word到id), ```dict_id_to_word```(从id到word),```tf```(词频统计)和```total```(单词总数)。 其中，未登录词的标识为```UNKNOWN```，未登录词的id为0。
+```vocab_data```包含```dict_word_to_id```(dict, 从word到id), ```dict_id_to_word```(dict, 从id到word),```tf```(dict, 词频统计)和```total```(单词总数)。 其中，未登录词的标识为```UNKNOWN```，未登录词的id为0。
 
 ```train_data```, ```test_data``` 和 ```valid_data``` 的数据格式一样。```qid``` 是问题Id，```question``` 是问题，```utterance``` 是回复，```label``` 如果是 ```[1,0]``` 代表回复是正确答案，```[0,1]``` 代表回复不是正确答案，所以 ```utterance``` 包含了正例和负例的数据。每个问题含有10个负例和1个正例。
 

diff --git a/pypi/setup.py b/pypi/setup.py
@@ -23,97 +23,7 @@
 
 欢迎任何进一步增加此数据集的想法。
 
-语料数据
---------
-
-+--------+----------+----------+----------------+
-| -      | 问题     | 答案     | 词汇（英语）   |
-+========+==========+==========+================+
-| 训练   | 12,889   | 21,325   | 107,889        |
-+--------+----------+----------+----------------+
-| 验证   | 2,000    | 3354     | 16,931         |
-+--------+----------+----------+----------------+
-| 测试   | 2,000    | 3308     | 16,815         |
-+--------+----------+----------+----------------+
-
-每条数据包括问题的中文，英文，答案的正例，答案的负例。案的正例至少1项，基本上在\ *1-5*\ 条，都是正确答案。答案的负例有\ *200*\ 条，负例根据问题使用检索的方式建立，所以和问题是相关的，但却不是正确答案。
-
-::
-
-    {
-        "INDEX": {
-            "zh": "中文",
-            "en": "英文",
-            "domain": "保险种类",
-            "answers": [""] # 答案正例列表
-            "negatives": [""] # 答案负例列表
-        },
-        more ...
-    }
-
--  训练：\ ``corpus/train.json``
-
--  验证：\ ``corpus/valid.json``
-
--  测试：\ ``corpus/test.json``
-
--  答案：\ ``corpus/answers.json`` 一共有 27,413 个回答，数据格式为
-   ``json``:
-
-   ::
-
-       {
-       "INDEX": {
-           "zh": "中文",
-           "en": "英文"
-       },
-       more ...
-       }
-
-中英文对照文件
-~~~~~~~~~~~~~~
-
-问答对
-^^^^^^
-
-::
-
-    格式 INDEX ++$++ 保险种类 ++$++ 中文 ++$++ 英文
-
-``corpus/train.txt``, ``corpus/valid.txt``, ``corpus/test.txt``.
-
-答案
-^^^^
-
-::
-
-    格式 INDEX ++$++ 中文 ++$++ 英文
-
-``corpus/answers.txt``
-
-快速开始
---------
-
-在Python环境中，使用pip安装
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: python
-
-    pip install --upgrade insuranceqa_data
-
-    import insuranceqa_data as insuranceqa
-    train_data = insuranceqa.load_train()
-    test_data = insuranceqa.load_train()
-    valid_data = insuranceqa.load_train()
-
-    # valid_data, test_data and train_data share the same properties
-    for x in train_data:
-        print('index %s value: %s ++$++ %s ++$++ %s' %
-         (x, d[x]['zh'], d[x]['en'], d[x]['answers'], d[x]['negatives']))
-
-    answers_data = insuranceqa.load_answers()
-    for x in answers_data:
-        print('index %s: %s ++$++ %s' % (x, d[x]['zh'], d[x]['en']))
+阅读 `详细文档 <https://github.com/Samurais/insuranceqa-corpus-zh>`__
 
 声明
 ----
@@ -141,7 +51,7 @@
 """
 
 setup(name='insuranceqa_data',
-      version='2.0',
+      version='2.1',
       description='Insuranceqa Corpus in Chinese for Machine Learning',
       long_description=LONGDOC,
       author='Hai Liang Wang',