Skip to content

Commit 920b394

Browse files
authored
Fix CI pipeline, add documentation for 'Encoder' (#31)
1 parent f440141 commit 920b394

File tree

8 files changed

+91
-19
lines changed

8 files changed

+91
-19
lines changed

.github/workflows/ci_pipeline.yml

+4-3
Original file line numberDiff line numberDiff line change
@@ -10,18 +10,19 @@ jobs:
1010
build:
1111
runs-on: ubuntu-latest
1212

13-
container:
14-
image: google/dart:beta
15-
1613
steps:
1714
- uses: actions/checkout@v2
15+
- uses: dart-lang/setup-dart@v1
1816

1917
- name: Print Dart SDK version
2018
run: dart --version
2119

2220
- name: Install dependencies
2321
run: dart pub get
2422

23+
- name: Verify formatting
24+
run: dart format --output=none --set-exit-if-changed .
25+
2526
- name: Analyze project source
2627
run: dart analyze --fatal-infos
2728

CHANGELOG.md

+5
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
# Changelog
22

3+
## 7.0.1
4+
- Added code formatting checking step to CI pipline
5+
- Corrected `README` examples
6+
- Added documentation to `Encoder` factory
7+
38
## 7.0.0
49
- `ml_datframe` 1.0.0 supported
510
- `featureNames` parameter renamed to `columnNames`

README.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -179,8 +179,8 @@ There is a convenient way to organize a sequence of data preprocessing operation
179179

180180
````dart
181181
final pipeline = Pipeline(dataFrame, [
182-
encodeAsOneHotLabels(featureNames: ['Gender', 'Age', 'City_Category']),
183-
encodeAsIntegerLabels(featureNames: ['Stay_In_Current_City_Years', 'Marital_Status']),
182+
toOneHotLabels(columnNames: ['Gender', 'Age', 'City_Category']),
183+
toIntegerLabels(columnNames: ['Stay_In_Current_City_Years', 'Marital_Status']),
184184
normalize(),
185185
standardize(),
186186
]);
@@ -192,6 +192,6 @@ Once you create (or rather fit) a pipeline, you may use it further in your appli
192192
final processed = pipeline.process(dataFrame);
193193
````
194194

195-
`encodeAsOneHotLabels`, `encodeAsIntegerLabels`, `normalize` and `standardize` are pipeable operator functions.
195+
`toOneHotLabels`, `toIntegerLabels`, `normalize` and `standardize` are pipeable operator functions.
196196
The pipeable operator function is a factory that takes fitting data and creates a fitted pipeable entity (e.g.,
197197
`Normalizer` instance)

lib/src/encoder/encoder.dart

+67-1
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,47 @@ import 'package:ml_preprocessing/src/encoder/series_encoder/series_encoder_facto
55
import 'package:ml_preprocessing/src/encoder/unknown_value_handling_type.dart';
66
import 'package:ml_preprocessing/src/pipeline/pipeable.dart';
77

8-
/// Categorical data encoder factory
8+
/// Categorical data encoder factory.
9+
///
10+
/// Algorithms that process data to create prediction models can't handle
11+
/// categorical data, since they are based on mathematical equations and work
12+
/// only with bare numbers. That means that the categorical data should be
13+
/// converted to numbers.
14+
///
15+
/// The factory exposes different ways to convert categorical data into numbers.
916
abstract class Encoder implements Pipeable {
17+
/// Gets columns by [columnIndices] or [columnNames] ([columnIndices] has a
18+
/// precedence over [columnNames]) from [fittingData], collects all unique
19+
/// values from the columns and builds a map `raw value` => `encoded value`.
20+
/// Once one calls the [process] method, the mapping will be applied.
21+
///
22+
/// The mapping is built according to the following rules:
23+
///
24+
/// Let's say, one has a list of values denoting a level of education:
25+
///
26+
/// ```
27+
/// ['BSc', 'BSc', 'PhD', 'High School', 'PhD']
28+
/// ```
29+
///
30+
/// After applying the encoder, the source sequence will be looking
31+
/// like this:
32+
///
33+
/// ```
34+
/// [[1, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 1, 0]]
35+
/// ```
36+
///
37+
/// In other words, the `one-hot` encoder created the following mapping:
38+
///
39+
/// `BSc` => [1, 0, 0]
40+
///
41+
/// `PhD` => [0, 1, 0]
42+
///
43+
/// `High School` => [0, 0, 1]
44+
///
45+
/// Keep in mind that if you apply the [process] method to your data, the
46+
/// number of columns will be increased since one categorical value in the
47+
/// case of one-hot encoding requires several cells. Headers for the new
48+
/// columns will be autogenerated from the categorical values.
1049
factory Encoder.oneHot(
1150
DataFrame fittingData, {
1251
Iterable<int>? columnIndices,
@@ -23,6 +62,33 @@ abstract class Encoder implements Pipeable {
2362
unknownValueHandlingType: unknownValueHandlingType,
2463
);
2564

65+
/// Gets columns by [columnIndices] or [columnNames] ([columnIndices] has a
66+
/// precedence over [columnNames]) from [fittingData], collects all unique
67+
/// values from the columns and builds a map `raw value` => `encoded value`.
68+
/// Once one calls the [process] method, the mapping will be applied.
69+
///
70+
/// The mapping is built according to the following rules:
71+
///
72+
/// Let's say, one has a list of values denoting a level of education:
73+
///
74+
/// ```
75+
/// ['BSc', 'BSc', 'PhD', 'High School', 'PhD']
76+
/// ```
77+
///
78+
/// After applying the encoder, the source list will be looking
79+
/// like this:
80+
///
81+
/// ```
82+
/// [0, 0, 1, 2, 1]
83+
/// ```
84+
///
85+
/// In other words, the `label` encoder created the following mapping:
86+
///
87+
/// `BSc` => 0
88+
///
89+
/// `PhD` => 1
90+
///
91+
/// `High School` => 2
2692
factory Encoder.label(
2793
DataFrame fittingData, {
2894
Iterable<int>? columnIndices,

lib/src/encoder/encoder_type.dart

+6-6
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
/// A type of categorical data encoding
22
///
3-
/// Algorithms that process data to create prediction models can't accept
3+
/// Algorithms that process data to create prediction models can't handle
44
/// categorical data, since they are based on mathematical equations and work
55
/// only with bare numbers. That means that the categorical data should be
66
/// converted to numbers.
@@ -27,8 +27,8 @@
2727
///
2828
/// `High School` => 2
2929
///
30-
/// [EncoderType.oneHot] converts categorical values into binary sequences. Let's
31-
/// say, one has a list of values denoting a level of education:
30+
/// [EncoderType.oneHot] converts categorical values into binary sequences.
31+
/// Let's say, one has a list of values denoting a level of education:
3232
///
3333
/// ```
3434
/// ['BSc', 'BSc', 'PhD', 'High School', 'PhD']
@@ -38,16 +38,16 @@
3838
/// like this:
3939
///
4040
/// ```
41-
/// [[0, 0, 1], [0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 0]]
41+
/// [[1, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 1, 0]]
4242
/// ```
4343
///
4444
/// In other words, the `one-hot` encoder created the following mapping:
4545
///
46-
/// `BSc` => [0, 0, 1]
46+
/// `BSc` => [1, 0, 0]
4747
///
4848
/// `PhD` => [0, 1, 0]
4949
///
50-
/// `High School` => [1, 0, 0]
50+
/// `High School` => [0, 0, 1]
5151
enum EncoderType {
5252
oneHot,
5353
label,

lib/src/normalizer/normalizer_impl.dart

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ class NormalizerImpl implements Normalizer {
1212
@override
1313
DataFrame process(DataFrame input) {
1414
final transformed =
15-
input.toMatrix(_dtype).mapRows((row) => row.normalize(_norm));
15+
input.toMatrix(_dtype).mapRows((row) => row.normalize(_norm));
1616

1717
return DataFrame.fromMatrix(transformed, header: input.header);
1818
}

lib/src/standardizer/standardizer_impl.dart

+4-4
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ import 'package:ml_preprocessing/src/standardizer/standardizer.dart';
55

66
class StandardizerImpl implements Standardizer {
77
StandardizerImpl(
8-
DataFrame fittingData, {
9-
DType dtype = DType.float32,
10-
}) : _dtype = dtype,
8+
DataFrame fittingData, {
9+
DType dtype = DType.float32,
10+
}) : _dtype = dtype,
1111
_mean = fittingData.toMatrix(dtype).mean(),
1212
_deviation = Vector.fromList(
1313
// TODO: Consider SIMD-aware mapping
@@ -40,7 +40,7 @@ class StandardizerImpl implements Standardizer {
4040
}
4141

4242
final processedMatrix =
43-
inputAsMatrix.mapRows((row) => (row - _mean) / _deviation);
43+
inputAsMatrix.mapRows((row) => (row - _mean) / _deviation);
4444
final discreteColumnNames = input.series
4545
.where((series) => series.isDiscrete)
4646
.map((series) => series.name);

pubspec.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
name: ml_preprocessing
22
description: Popular data preprocessing algorithms for machine learning
3-
version: 7.0.0
3+
version: 7.0.1
44
homepage: https://github.com/gyrdym/ml_preprocessing
55

66
environment:

0 commit comments

Comments
 (0)