Skip to content

Commit f440141

Browse files
authored
'ml_dataframe' dependency updated, public API changed (#30)
1 parent da14a20 commit f440141

38 files changed

+1362
-562
lines changed

CHANGELOG.md

+7
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# Changelog
22

3+
## 7.0.0
4+
- `ml_datframe` 1.0.0 supported
5+
- `featureNames` parameter renamed to `columnNames`
6+
- `featureIds` parameter renamed to `columnIndices`
7+
- `encodeAsIntegerLabels` renamed to `toIntegerLabels`
8+
- `encodeAsOneHotLabels` renamed to `toOneHotLabels`
9+
310
## 6.0.1
411
- `pubspec.yaml`: `ml_dataframe` dependency updated
512

README.md

+25-25
Original file line numberDiff line numberDiff line change
@@ -27,14 +27,14 @@ Let's say, you have a dataset:
2727
````
2828

2929
Everything seems good for now. Say, you're about to train a classifier to predict if a person has diabetes.
30-
But there is an obstacle - how can it possible to use the data in mathematical equations with string-value columns
31-
(`Gender`, `Country`)? And things are getting even worse because of an empty (N/A) value in `Diabetes` column. There
30+
But there is an obstacle - how can it be possible to use the data in mathematical equations with string-value columns
31+
(`Gender`, `Country`)? And things are getting even worse because of an empty (N/A) value in the `Diabetes` column. There
3232
should be a way to convert this data to a valid numerical representation. Here data preprocessing techniques come to play.
3333
You should decide, how to convert string data (aka *categorical data*) to numbers and how to treat empty values. Of
34-
course, you can come up with your own unique algorithms to do all of these operations, but, actually, there are a
35-
bunch of well-known well-performed techniques for doing all the conversions.
34+
course, you can come up with your unique algorithms to do all of these operations, but there are a lot of well-known
35+
techniques for doing all the conversions.
3636

37-
The aim of the library - to give data scientists, who are interested in Dart programming language, these preprocessing
37+
The aim of the library is to give data scientists, who are interested in Dart programming language, these preprocessing
3838
techniques.
3939

4040
## Prerequisites
@@ -47,7 +47,7 @@ before doing preprocessing. An example with a part of pubspec.yaml:
4747
````
4848
dependencies:
4949
...
50-
ml_dataframe: ^0.0.11
50+
ml_dataframe: ^1.0.0
5151
...
5252
````
5353

@@ -90,14 +90,14 @@ Why should we fit it? Categorical data encoder fitting - a process, when all the
9090
searched for in order to create an encoded labels list. After the fitting is complete, one may use the fitted encoder for
9191
the new data of the same source.
9292

93-
In order to fit the encoder it's needed to create the entity and pass the fitting data as an argument to the
93+
In order to fit the encoder, it's needed to create the entity and pass the fitting data as an argument to the
9494
constructor, along with the features to be encoded:
9595

9696

9797
````dart
9898
final encoder = Encoder.oneHot(
9999
dataFrame,
100-
featureNames: featureNames,
100+
columnNames: featureNames,
101101
);
102102
103103
````
@@ -108,56 +108,56 @@ Let's encode the features:
108108
final encoded = encoder.process(dataFrame);
109109
````
110110

111-
We used the same dataframe here - it's absolutely normal, since when we created the encoder, we just fit it with the
111+
We used the same dataframe here - it's absolutely normal since when we created the encoder, we just fit it with the
112112
dataframe, and now is the time to apply the dataframe to the fitted encoder.
113113

114-
It's time to take a look at our processed data! Let's read it:
114+
It's time to take a look at our processed data. Let's read it:
115115

116116
````dart
117117
final data = encoded.toMatrix();
118118
119119
print(data);
120120
````
121121

122-
In the output we will see just numerical data, that's exactly we wanted to reach.
122+
In the output we will see just numerical data, that's exactly what we wanted to reach.
123123

124124
### Label encoding
125125

126-
Another one well-known encoding method. The technique is the same - first, we should fit the encoder and after that we
126+
Another well-known encoding method. The technique is the same - first, we should fit the encoder and after that, we
127127
may use this "trained" encoder in some applications:
128128

129129
````dart
130130
// fit encoder
131131
final encoder = Encoder.label(
132132
dataFrame,
133-
featureNames: featureNames,
133+
columnNames: featureNames,
134134
);
135135
136136
// apply fitted encoder to data
137137
final encoded = encoder.process(dataFrame);
138138
````
139139

140-
### Numerical data normalizing
140+
### Numerical data normalization
141141

142-
Sometimes we need to have our numerical features normalized, that means we need to treat every dataframe row as a
142+
Sometimes we need to have our numerical features normalized, which means we need to treat every dataframe row as a
143143
vector and divide this vector element-wise by its norm (Euclidean, Manhattan, etc.). To do so the library exposes
144-
`Normalizer` entity:
144+
`Normalizer` class:
145145

146146
````dart
147147
final normalizer = Normalizer(); // by default Euclidean norm will be used
148148
final transformed = normalizer.process(dataFrame);
149149
````
150150

151-
Please, notice, if your data has raw categorical values, the normalization will fail as it requires only numerical
152-
values. In this case you should encode data (e.g. using one-hot encoding) before normalization.
151+
Please, notice, that if your data has raw categorical values, the normalization will fail as it requires only numerical
152+
values. In this case, you should encode data (e.g. using one-hot encoding) before normalization.
153153

154154
### Data standardization
155155

156156
A lot of machine learning algorithms require normally distributed data as their input. Normally distributed data
157-
means that every dedicated to a feature column in the data has zero mean and unit variance. One may reach this
158-
requirement using `Standardizer` class. During creation of the entity all the columns mean values and deviation values
159-
are being extracted from the passed data and stored as fields of the class, in order to apply them to standardize the
160-
other (or the same that was used for creation of the Standardizer) data:
157+
means that every column in the data has zero mean and unit variance. One may reach this requirement using the
158+
`Standardizer` class. During the creation of the class instance, all the columns' mean values and deviation values are
159+
being extracted from the passed data and stored as fields of the class, in order to apply them to standardize the
160+
other (or the same that was used for the creation of the Standardizer) data:
161161

162162
````dart
163163
final dataFrame = DataFrame([
@@ -175,7 +175,7 @@ final transformed = standardizer.process(dataFrame);
175175

176176
### Pipeline
177177

178-
There is a convenient way to organize a bunch of data preprocessing operations - `Pipeline`:
178+
There is a convenient way to organize a sequence of data preprocessing operations - `Pipeline`:
179179

180180
````dart
181181
final pipeline = Pipeline(dataFrame, [
@@ -186,12 +186,12 @@ final pipeline = Pipeline(dataFrame, [
186186
]);
187187
````
188188

189-
Once you create (or rather fit) a pipeline, you may use it farther in your application:
189+
Once you create (or rather fit) a pipeline, you may use it further in your application:
190190

191191
````dart
192192
final processed = pipeline.process(dataFrame);
193193
````
194194

195195
`encodeAsOneHotLabels`, `encodeAsIntegerLabels`, `normalize` and `standardize` are pipeable operator functions.
196-
Pipeable operator function is a factory, that takes fitting data and creates a fitted pipeable entity (e.g.,
196+
The pipeable operator function is a factory that takes fitting data and creates a fitted pipeable entity (e.g.,
197197
`Normalizer` instance)

benchmark/main.dart

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

example/black_friday/black_friday.dart

+20-11
Original file line numberDiff line numberDiff line change
@@ -2,26 +2,35 @@ import 'package:ml_dataframe/ml_dataframe.dart';
22
import 'package:ml_preprocessing/ml_preprocessing.dart';
33

44
Future processDataSetWithCategoricalData() async {
5-
final dataFrame = await fromCsv('example/black_friday/black_friday.csv',
6-
columnNames: ['Gender', 'Age', 'City_Category',
7-
'Stay_In_Current_City_Years', 'Marital_Status'],
5+
final dataFrame = await fromCsv(
6+
'example/black_friday/black_friday.csv',
7+
columnNames: [
8+
'Gender',
9+
'Age',
10+
'City_Category',
11+
'Stay_In_Current_City_Years',
12+
'Marital_Status'
13+
],
814
);
915

1016
final encoded = Encoder.oneHot(
1117
dataFrame,
12-
featureNames: ['Gender', 'Age', 'City_Category',
13-
'Stay_In_Current_City_Years', 'Marital_Status'],
18+
columnNames: [
19+
'Gender',
20+
'Age',
21+
'City_Category',
22+
'Stay_In_Current_City_Years',
23+
'Marital_Status'
24+
],
1425
).process(dataFrame);
1526

1627
final observations = encoded.toMatrix();
1728
final genderEncoded = observations.sample(columnIndices: [0, 1]);
1829
final ageEncoded = observations.sample(columnIndices: [2, 3, 4, 5, 6, 7, 8]);
19-
final cityCategoryEncoded = observations
20-
.sample(columnIndices: [9, 10, 11]);
21-
final stayInCityEncoded = observations
22-
.sample(columnIndices: [12, 13, 14, 15, 16]);
23-
final maritalStatusEncoded = observations
24-
.sample(columnIndices: [17, 18]);
30+
final cityCategoryEncoded = observations.sample(columnIndices: [9, 10, 11]);
31+
final stayInCityEncoded =
32+
observations.sample(columnIndices: [12, 13, 14, 15, 16]);
33+
final maritalStatusEncoded = observations.sample(columnIndices: [17, 18]);
2534

2635
print('Features:');
2736

example/main.dart

+5-9
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,16 @@
11
import 'package:ml_dataframe/ml_dataframe.dart';
22
import 'package:ml_preprocessing/ml_preprocessing.dart';
3-
import 'package:ml_preprocessing/src/encoder/encode_as_integer_labels.dart';
4-
import 'package:ml_preprocessing/src/encoder/encode_as_one_hot_labels.dart';
5-
import 'package:ml_preprocessing/src/pipeline/pipeline.dart';
63

74
Future main() async {
8-
final dataFrame = await fromCsv('example/dataset.csv',
9-
columns: [0, 1, 2, 3]);
5+
final dataFrame = await fromCsv('example/dataset.csv', columns: [0, 1, 2, 3]);
106

117
final pipeline = Pipeline(dataFrame, [
12-
encodeAsOneHotLabels(
13-
featureNames: ['position'],
8+
toOneHotLabels(
9+
columnNames: ['position'],
1410
headerPostfix: '_position',
1511
),
16-
encodeAsIntegerLabels(
17-
featureNames: ['country'],
12+
toIntegerLabels(
13+
columnNames: ['country'],
1814
),
1915
]);
2016

lib/ml_preprocessing.dart

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
export 'package:ml_linalg/norm.dart';
2-
export 'package:ml_preprocessing/src/encoder/encode_as_integer_labels.dart';
3-
export 'package:ml_preprocessing/src/encoder/encode_as_one_hot_labels.dart';
42
export 'package:ml_preprocessing/src/encoder/encoder.dart';
3+
export 'package:ml_preprocessing/src/encoder/to_integer_labels.dart';
4+
export 'package:ml_preprocessing/src/encoder/to_one_hot_labels.dart';
55
export 'package:ml_preprocessing/src/encoder/unknown_value_handling_type.dart';
66
export 'package:ml_preprocessing/src/normalizer/normalize.dart';
77
export 'package:ml_preprocessing/src/normalizer/normalizer.dart';

lib/src/encoder/encode_as_integer_labels.dart

-24
This file was deleted.

lib/src/encoder/encode_as_one_hot_labels.dart

-24
This file was deleted.

lib/src/encoder/encoder.dart

+26-22
Original file line numberDiff line numberDiff line change
@@ -7,31 +7,35 @@ import 'package:ml_preprocessing/src/pipeline/pipeable.dart';
77

88
/// Categorical data encoder factory
99
abstract class Encoder implements Pipeable {
10-
factory Encoder.oneHot(DataFrame fittingData, {
11-
Iterable<int>? featureIds,
12-
Iterable<String>? featureNames,
10+
factory Encoder.oneHot(
11+
DataFrame fittingData, {
12+
Iterable<int>? columnIndices,
13+
Iterable<String>? columnNames,
1314
UnknownValueHandlingType unknownValueHandlingType =
1415
defaultUnknownValueHandlingType,
15-
}) => EncoderImpl(
16-
fittingData,
17-
EncoderType.oneHot,
18-
const SeriesEncoderFactoryImpl(),
19-
featureNames: featureNames,
20-
featureIds: featureIds,
21-
unknownValueHandlingType: unknownValueHandlingType,
22-
);
16+
}) =>
17+
EncoderImpl(
18+
fittingData,
19+
EncoderType.oneHot,
20+
const SeriesEncoderFactoryImpl(),
21+
columnNames: columnNames,
22+
columnIndices: columnIndices,
23+
unknownValueHandlingType: unknownValueHandlingType,
24+
);
2325

24-
factory Encoder.label(DataFrame fittingData, {
25-
Iterable<int>? featureIds,
26-
Iterable<String>? featureNames,
26+
factory Encoder.label(
27+
DataFrame fittingData, {
28+
Iterable<int>? columnIndices,
29+
Iterable<String>? columnNames,
2730
UnknownValueHandlingType unknownValueHandlingType =
2831
defaultUnknownValueHandlingType,
29-
}) => EncoderImpl(
30-
fittingData,
31-
EncoderType.label,
32-
const SeriesEncoderFactoryImpl(),
33-
featureNames: featureNames,
34-
featureIds: featureIds,
35-
unknownValueHandlingType: unknownValueHandlingType,
36-
);
32+
}) =>
33+
EncoderImpl(
34+
fittingData,
35+
EncoderType.label,
36+
const SeriesEncoderFactoryImpl(),
37+
columnNames: columnNames,
38+
columnIndices: columnIndices,
39+
unknownValueHandlingType: unknownValueHandlingType,
40+
);
3741
}

0 commit comments

Comments
 (0)