@@ -27,14 +27,14 @@ Let's say, you have a dataset:
27
27
````
28
28
29
29
Everything seems good for now. Say, you're about to train a classifier to predict if a person has diabetes.
30
- But there is an obstacle - how can it possible to use the data in mathematical equations with string-value columns
31
- (` Gender ` , ` Country ` )? And things are getting even worse because of an empty (N/A) value in ` Diabetes ` column. There
30
+ But there is an obstacle - how can it be possible to use the data in mathematical equations with string-value columns
31
+ (` Gender ` , ` Country ` )? And things are getting even worse because of an empty (N/A) value in the ` Diabetes ` column. There
32
32
should be a way to convert this data to a valid numerical representation. Here data preprocessing techniques come to play.
33
33
You should decide, how to convert string data (aka * categorical data* ) to numbers and how to treat empty values. Of
34
- course, you can come up with your own unique algorithms to do all of these operations, but, actually, there are a
35
- bunch of well-known well-performed techniques for doing all the conversions.
34
+ course, you can come up with your unique algorithms to do all of these operations, but there are a lot of well-known
35
+ techniques for doing all the conversions.
36
36
37
- The aim of the library - to give data scientists, who are interested in Dart programming language, these preprocessing
37
+ The aim of the library is to give data scientists, who are interested in Dart programming language, these preprocessing
38
38
techniques.
39
39
40
40
## Prerequisites
@@ -47,7 +47,7 @@ before doing preprocessing. An example with a part of pubspec.yaml:
47
47
````
48
48
dependencies:
49
49
...
50
- ml_dataframe: ^0 .0.11
50
+ ml_dataframe: ^1 .0.0
51
51
...
52
52
````
53
53
@@ -90,14 +90,14 @@ Why should we fit it? Categorical data encoder fitting - a process, when all the
90
90
searched for in order to create an encoded labels list. After the fitting is complete, one may use the fitted encoder for
91
91
the new data of the same source.
92
92
93
- In order to fit the encoder it's needed to create the entity and pass the fitting data as an argument to the
93
+ In order to fit the encoder, it's needed to create the entity and pass the fitting data as an argument to the
94
94
constructor, along with the features to be encoded:
95
95
96
96
97
97
```` dart
98
98
final encoder = Encoder.oneHot(
99
99
dataFrame,
100
- featureNames : featureNames,
100
+ columnNames : featureNames,
101
101
);
102
102
103
103
````
@@ -108,56 +108,56 @@ Let's encode the features:
108
108
final encoded = encoder.process(dataFrame);
109
109
````
110
110
111
- We used the same dataframe here - it's absolutely normal, since when we created the encoder, we just fit it with the
111
+ We used the same dataframe here - it's absolutely normal since when we created the encoder, we just fit it with the
112
112
dataframe, and now is the time to apply the dataframe to the fitted encoder.
113
113
114
- It's time to take a look at our processed data! Let's read it:
114
+ It's time to take a look at our processed data. Let's read it:
115
115
116
116
```` dart
117
117
final data = encoded.toMatrix();
118
118
119
119
print(data);
120
120
````
121
121
122
- In the output we will see just numerical data, that's exactly we wanted to reach.
122
+ In the output we will see just numerical data, that's exactly what we wanted to reach.
123
123
124
124
### Label encoding
125
125
126
- Another one well-known encoding method. The technique is the same - first, we should fit the encoder and after that we
126
+ Another well-known encoding method. The technique is the same - first, we should fit the encoder and after that, we
127
127
may use this "trained" encoder in some applications:
128
128
129
129
```` dart
130
130
// fit encoder
131
131
final encoder = Encoder.label(
132
132
dataFrame,
133
- featureNames : featureNames,
133
+ columnNames : featureNames,
134
134
);
135
135
136
136
// apply fitted encoder to data
137
137
final encoded = encoder.process(dataFrame);
138
138
````
139
139
140
- ### Numerical data normalizing
140
+ ### Numerical data normalization
141
141
142
- Sometimes we need to have our numerical features normalized, that means we need to treat every dataframe row as a
142
+ Sometimes we need to have our numerical features normalized, which means we need to treat every dataframe row as a
143
143
vector and divide this vector element-wise by its norm (Euclidean, Manhattan, etc.). To do so the library exposes
144
- ` Normalizer ` entity :
144
+ ` Normalizer ` class :
145
145
146
146
```` dart
147
147
final normalizer = Normalizer(); // by default Euclidean norm will be used
148
148
final transformed = normalizer.process(dataFrame);
149
149
````
150
150
151
- Please, notice, if your data has raw categorical values, the normalization will fail as it requires only numerical
152
- values. In this case you should encode data (e.g. using one-hot encoding) before normalization.
151
+ Please, notice, that if your data has raw categorical values, the normalization will fail as it requires only numerical
152
+ values. In this case, you should encode data (e.g. using one-hot encoding) before normalization.
153
153
154
154
### Data standardization
155
155
156
156
A lot of machine learning algorithms require normally distributed data as their input. Normally distributed data
157
- means that every dedicated to a feature column in the data has zero mean and unit variance. One may reach this
158
- requirement using ` Standardizer ` class. During creation of the entity all the columns mean values and deviation values
159
- are being extracted from the passed data and stored as fields of the class, in order to apply them to standardize the
160
- other (or the same that was used for creation of the Standardizer) data:
157
+ means that every column in the data has zero mean and unit variance. One may reach this requirement using the
158
+ ` Standardizer ` class. During the creation of the class instance, all the columns' mean values and deviation values are
159
+ being extracted from the passed data and stored as fields of the class, in order to apply them to standardize the
160
+ other (or the same that was used for the creation of the Standardizer) data:
161
161
162
162
```` dart
163
163
final dataFrame = DataFrame([
@@ -175,7 +175,7 @@ final transformed = standardizer.process(dataFrame);
175
175
176
176
### Pipeline
177
177
178
- There is a convenient way to organize a bunch of data preprocessing operations - ` Pipeline ` :
178
+ There is a convenient way to organize a sequence of data preprocessing operations - ` Pipeline ` :
179
179
180
180
```` dart
181
181
final pipeline = Pipeline(dataFrame, [
@@ -186,12 +186,12 @@ final pipeline = Pipeline(dataFrame, [
186
186
]);
187
187
````
188
188
189
- Once you create (or rather fit) a pipeline, you may use it farther in your application:
189
+ Once you create (or rather fit) a pipeline, you may use it further in your application:
190
190
191
191
```` dart
192
192
final processed = pipeline.process(dataFrame);
193
193
````
194
194
195
195
` encodeAsOneHotLabels ` , ` encodeAsIntegerLabels ` , ` normalize ` and ` standardize ` are pipeable operator functions.
196
- Pipeable operator function is a factory, that takes fitting data and creates a fitted pipeable entity (e.g.,
196
+ The pipeable operator function is a factory that takes fitting data and creates a fitted pipeable entity (e.g.,
197
197
` Normalizer ` instance)
0 commit comments