You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: CONTRIBUTING.md
+8-7
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ Texthero follows an approach known as shift-left testing. According to [Wikipedi
25
25
26
26
> Shift-left testing is an approach to software testing and system testing in which testing is performed earlier in the lifecycle.
27
27
28
-
Shift-left testing reduces the number of bugs by attempting to solve the problem at the origin. Often many programming defects are not uncovered and fixed until after significant effort has been wasted on their implementation. Texthero's attempt to avoid this kind of issue.
28
+
Shift-left testing reduces the number of bugs by attempting to solve the problem at the origin. Often many programming defects are not uncovered and fixed until after significant effort has been wasted on their implementation. Texthero attempts to avoid these kind of issues.
29
29
30
30
31
31
## Improve documentation!
@@ -56,7 +56,7 @@ The following link gives some advice on how to submit a successful pull request.
56
56
57
57
## Ask questions!
58
58
59
-
We are there for you! If everything is unclear, just ask. We will do our best to answer you quickly.
59
+
We are there for you! If anything is unclear, just ask. We will do our best to answer you quickly.
60
60
61
61
## Propose new ideas!
62
62
@@ -84,15 +84,15 @@ $ cd scripts
84
84
$ ./tests.sh
85
85
```
86
86
87
-
Calling `./test.sh` is equivalent to execute form the _root_`python3 -m unittest discover -s tests -t .`
87
+
Calling `./tests.sh` is equivalent to executing it from the _root_`python3 -m unittest discover -s tests -t .`
88
88
89
89
90
90
**Important.** If you worked on a bug, you should add a test that checks the bug is not present anymore. This is extremely useful as it avoids to re-introduce the same bug again in the future.
91
91
92
92
93
93
### Passing doctests
94
94
95
-
When executing `./test.sh` it will also check that the Examples in the docstrings are correct (doctests).
95
+
When executing `./tests.sh` it will also check that the Examples in the docstrings are correct (doctests).
96
96
97
97
Passing doctests might be a bit annoying sometimes. Let's look at this example for instance:
98
98
@@ -114,7 +114,7 @@ The docstring failed? Why? The reason is that somewhere in the `Example` section
114
114
115
115
When you submit your code, all code will be tested on different operating systems using Travis CI: [TRAVIS CI texthero](https://travis-ci.com/github/jbesomi/texthero).
116
116
117
-
Make sure you pass all your test locally before opening a pull request!
117
+
Make sure you pass all your tests locally before opening a pull request!
118
118
119
119
## Formatting
120
120
@@ -182,7 +182,8 @@ $ git checkout -b new-branch
182
182
Try to commit regularly. In addition, whenever possible, group changes into distinct commits. It will be easier for the rest of us to understand what you worked on just by reading the description of your commit.
183
183
184
184
```
185
-
$ ...
185
+
$ git add README.md
186
+
$ git commit -m "added README.md"
186
187
```
187
188
188
189
1. Test your changes
@@ -200,7 +201,7 @@ The time to submit the PR has come. Head to your forked repository on Github. Th
200
201
201
202
-`./test.sh`
202
203
- Execute unittests as well as test all doctests
203
-
-`./formath.sh`
204
+
-`./format.sh`
204
205
- format all code with [black](https://github.com/psf/black)
Copy file name to clipboardexpand all lines: PURPOSE.md
+6-6
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# PURPOSE
2
2
3
-
This document attempt at defining the purpose of Texthero and it's futures enhancements.
3
+
This document attempts at defining the purpose of Texthero and it's future enhancements.
4
4
5
5
### Motivation
6
6
@@ -14,7 +14,7 @@ We can decompose the objective of Texthero in two parts:
14
14
15
15
1.** Offer an efficient tool to deal with text-based datasets (The texthero python package). Texthero is mainly a teaching tool and therefore easy to use and understand, but at the same time quite efficient and should be able to handle large quantities of data.
16
16
17
-
2.** Provide a sustain to newcomers in the NLP word to efficiently learn all the main core topics (tf-idf, text cleaning, regular expression, etc). As there are many other tutorials, the main approach is to redirect users to valuable resources and explain better any missing point. This part is done mainly through the *tutorials* on texthero.org.
17
+
2.** Provide a sustain to newcomers in the NLP world to efficiently learn all the main core topics (tf-idf, text cleaning, regular expression, etc). As there are many other tutorials, the main approach is to redirect users to valuable resources and explain better any missing point. This part is done mainly through the *tutorials* on texthero.org.
18
18
19
19
20
20
### Channels
@@ -33,23 +33,23 @@ We can decompose the objective of Texthero in two parts:
33
33
34
34
### Python package
35
35
36
-
For future development, is important to have a clear idea in mind of the purpose of Texthero as a python package.
36
+
For future development, it is important to have a clear idea in mind of the purpose of Texthero as a python package.
37
37
38
38
39
39
**Package core purpose**
40
40
41
41
The goal is to extract insights from the whole corpora, i.e collection of document and not from the single element.
42
42
43
-
Generally, the corpora are composed of a __long__ collection of documents and therefore the require techniques need to be efficient to deal with a large amount of text.
43
+
Generally, the corpora are composed of a __long__ collection of documents and therefore the required techniques need to be efficient to deal with a large amount of text.
44
44
45
45
**Neural network**
46
46
47
47
Texthero function (as of now) does not make use of a neural network solution. The main reason is that there is no need for that as there are mature libraries (PyTorch and Tensorflow to name a few).
48
48
49
-
What Texthero offers is a tool to be used in addition to any other machine learning libraries. Ideally, texthero should be used before applying any "sophisticated" approach to the dataset; to first better understand the underline data before applying any complex model.
49
+
What Texthero offers is a tool to be used in addition to any other machine learning libraries. Ideally, texthero should be used before applying any "sophisticated" approach to the dataset; to first better understand the underlying data before applying any complex model.
50
50
51
51
52
-
Note: a text corpus or collection of documents need always to be in form of a Pandas Series. "do that on a text corpus" or "do that on a Pandas Series" refers to the same act.
52
+
Note: a text corpus or collection of documents need to be always in form of a Pandas Series. "do that on a text corpus" or "do that on a Pandas Series" refers to the same act.
Copy file name to clipboardexpand all lines: README.md
+13-12
Original file line number
Diff line number
Diff line change
@@ -46,13 +46,13 @@
46
46
47
47
Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas. Texthero has the same expressiveness and power of Pandas and is extensively documented. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistic.
48
48
49
-
You can think of Texthero as a tool to help you _understand_ and work with text-based dataset. Given a tabular dataset, it's easy to _grasp the main concept_. Instead, given a text dataset, it's harder to have quick insights into the underline data. With Texthero, preprocessing text data, map it into vectors, and visualize the obtained vector space takes just a couple of lines.
49
+
You can think of Texthero as a tool to help you _understand_ and work with text-based dataset. Given a tabular dataset, it's easy to _grasp the main concept_. Instead, given a text dataset, it's harder to have quick insights into the underline data. With Texthero, preprocessing text data, mapping it into vectors, and visualizing the obtained vector space takes just a couple of lines.
50
50
51
51
Texthero include tools for:
52
52
* Preprocess text data: it offers both out-of-the-box solutions but it's also flexible for custom-solutions.
53
53
* Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.
54
54
* Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)
55
-
* Vector space analysis: clustering (K-means, Meanshift, DBSAN and Hierarchical), topic modeling (wip) and interpretation.
55
+
* Vector space analysis: clustering (K-means, Meanshift, DBSCAN and Hierarchical), topic modeling (wip) and interpretation.
56
56
* Text visualization: vector space visualization, place localization on maps (wip).
57
57
58
58
Texthero is free, open-source and [well documented](https://texthero.org/docs) (and that's what we love most by the way!).
@@ -61,9 +61,9 @@ We hope you will find pleasure working with Texthero as we had during his develo
61
61
62
62
<h2align="center">Hablas español? क्या आप हिंदी बोलते हैं? 日本語が話せるのか?</h2>
63
63
64
-
Texthero has been developed for the whole NLP community. We know how hard is to deal with different NLP tools (NLTK, SpaCy, Gensim, TextBlob, Sklearn): that's why we developed Texthero, to simplify things.
64
+
Texthero has been developed for the whole NLP community. We know how hard it is to deal with different NLP tools (NLTK, SpaCy, Gensim, TextBlob, Sklearn): that's why we developed Texthero, to simplify things.
65
65
66
-
Now, the next main milestone is to provide *multilingual support* and for this big step, we need the help of all of you. ¿Hablas español? Sie sprechen Deutsch? 你会说中文? 日本語が話せるのか? Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer!
66
+
Now, the next main milestone is to provide *multilingual support* and for this big step, we need the help of all of you. ¿Hablas español? Sie sprechen Deutsch? 你会说中文? 日本語が話せるのか? Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned here, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP, contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer!
67
67
68
68
For improving the python toolkit and provide an even better experience, your aid and feedback are crucial. If you have any problem or suggestion please open a Github [issue](https://github.com/jbesomi/texthero/issues), we will be glad to support you and help you.
69
69
@@ -72,11 +72,11 @@ For improving the python toolkit and provide an even better experience, your aid
72
72
73
73
Texthero's community is growing fast. Texthero though is still in a beta version; soon, a faster and better version will be released and it will bring some major changes.
74
74
75
-
For instance, to give a more granular control over the pipeline, starting from the next version on, all `preprocessing` functions will require as argument an already tokenized text. This will be a major changes.
75
+
For instance, to give a more granular control over the pipeline, starting from the next version on, all `preprocessing` functions will require as argument an already tokenized text. This will be a major change.
76
76
77
77
Once released the stable version (Texthero 2.0), backward compatibility will be respected. Until this point, backward compatibility will be present but it will be weaker.
78
78
79
-
If you want to be part of this fast-growing movements, do not hesitate to contribute: [CONTRIBUTING](blob/master/CONTRIBUTING.md)!
79
+
If you want to be part of this fast-growing movements, do not hesitate to contribute: [CONTRIBUTING](./CONTRIBUTING.md)!
80
80
81
81
<h2align="center">Installation</h2>
82
82
@@ -88,7 +88,7 @@ pip install texthero
88
88
89
89
> ☝️Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that.
90
90
91
-
> For fast performance, make sure you have installed Spacy version >= 2.2. Also, make sure you have a recent version of python, the higher, the best.
91
+
> For faster performance, make sure you have installed Spacy version >= 2.2. Also, make sure you have a recent version of python, the higher, the best.
92
92
93
93
<h2align="center">Getting started</h2>
94
94
@@ -98,7 +98,7 @@ In case you are an advanced python user, then `help(texthero)` should do the wor
98
98
99
99
<h2align="center">Examples</h2>
100
100
101
-
<h3>1. Text cleaning, TF-IDF representation and visualization</h3>
101
+
<h3>1. Text cleaning, TF-IDF representation and Visualization</h3>
102
102
103
103
104
104
```python
@@ -122,7 +122,7 @@ hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")
<h3>2. Text preprocessing, TF-IDF, K-means and visualization</h3>
125
+
<h3>2. Text preprocessing, TF-IDF, K-means and Visualization</h3>
126
126
127
127
```python
128
128
import texthero as hero
@@ -174,7 +174,7 @@ Remove all digits:
174
174
dtype: object
175
175
```
176
176
177
-
> Remove digits replace only blocks of digits. The digits in the string "hello123" will not be removed. If we want to remove all digits, you need to set only_blocks to false.
177
+
> Remove digits replaces only blocks of digits. The digits in the string "hello123" will not be removed. If we want to remove all digits, you need to set only_blocks to false.
178
178
179
179
Remove all types of brackets and their content.
180
180
@@ -272,7 +272,7 @@ Full documentation: [visualization](https://texthero.org/docs/api-visualization)
272
272
273
273
<h5>Why Texthero</h5>
274
274
275
-
Sometimes we just want things done, right? Texthero help with that. It helps make things easier and give the developer more time to focus on his custom requirements. We believe that start cleaning text should just take a minute. Same for finding the most important part of a text and the same for representing it.
275
+
Sometimes we just want things done, right? Texthero helps with that. It helps make things easier and give the developer more time to focus on his custom requirements. We believe that cleaning text should just take a minute. Same for finding the most important part of a text and the same for representing it.
276
276
277
277
In a very pragmatic way, texthero has just one goal: make the developer spare time. Working with text data can be a pain and in most cases, a default pipeline can be quite good to start. There is always time to come back and improve previous work.
278
278
@@ -283,7 +283,7 @@ In a very pragmatic way, texthero has just one goal: make the developer spare ti
283
283
284
284
Texthero is for all of us NLP-developers and it can continue to exist with the precious contribution of the community.
285
285
286
-
Your level of expertise of python and NLP does not matter, anyone can help and anyone is more than welcomed to contribute!
286
+
Your level of expertise of python and NLP does not matter, anyone can help and anyone is more than welcome to contribute!
287
287
288
288
**Are you an NLP expert?**
289
289
@@ -313,6 +313,7 @@ If you have just other questions or inquiry drop me a line at jonathanbesomi__AT
0 commit comments