We have moved to https://codeberg.org/KOLANICH-ML/datag.py, grab new versions there.
Under the disguise of "better security" Micro$oft-owned GitHub has discriminated users of 1FA passwords while having commercial interest in success and wide adoption of FIDO 1FA specifications and Windows Hello implementation which it promotes as a replacement for passwords. It will result in dire consequencies and is competely inacceptable, read why.
If you don't want to participate in harming yourself, it is recommended to follow the lead and migrate somewhere away of GitHub and Micro$oft. Here is the list of alternatives and rationales to do it. If they delete the discussion, there are certain well-known places where you can get a copy of it. Read why you should also leave GitHub.
This is a data cleansing, standardization and aggregation framework.
Assumme you have a few noisy bad-quality data tables produced by the ones not caring about their quality. These datasets are made just to say "we support open data", but in fact they have multiple issues. And we need to train a model on this piece of shit. In order to do it we need to make a candy of shit ..
issue | fix |
---|---|
data contains typos, even identifiers meant to uniquily identify stuff contain typos! | custom function fixing the typo |
data in different units even for the same column | determine unit for each data in the dataset and validate it |
some data is completely junk, for example an atom containing 1000 protons or mass in coulombs | detect junk by encorporating domain knowledge and discard it |
columns names are semantically incorrect and different datasets use different columns | rename columns |
some columns contain multiple data encoded with some hand-crafted format | expand them into different columns, delete the original column |
some data field is repeated, but with different values | compute an estimate using the present values or discard the value |
- Imputation
- (Re)balancing
- encoding
- any stuff doing machine learning (but you can implement it yourself)
- get a formal description on what you want from data to be
- unit
- constraints
- for each source:
- get a raw record from a source
- apply a transformation
- apply in-source validation
- do intersource
- validation and consistency checks
- merging and estimation
Spec
- a way to encode requirements to our data.Record
- just a dict with some additional properties.Source
- gets the records by their identifiers. Haspriority
spec
entity
Entity
- a way to discoverSource
s providing us withRecord
s of the same kind. Acts as a namespace and as a final validator. Hasspec
Rule
- transforms the data, detects errors and recovers the missing stuff.Disambiguator
- uses a dictionary for standardization of identifiers.Merger
- combines different datasets into a composit one.Pipeline
- aSource
of the resulting dataset. Because it is aSource
, it can be plugged further.