Contents
ptype is a probabilistic approach to type inference, which is the task of identifying the data type (e.g. Boolean, date, integer or string) of a given column of data.
Existing approaches often fail on type inference for messy datasets where data is missing or anomalous. With ptype, our goal is to develop a robust method that can deal with such data.
Normal, missing and anomalous values are denoted by green, yellow and red, respectively in the right hand figure.
ptype uses Probabilistic Finite-State Machines (PFSMs) to model known data types, missing and anomalous data. Given a column of data, we can infer a plausible column type, and also identify any values which (conditional on that type) are deemed missing or anomalous. In contrast to more familiar finite-state machines, such as regular expressions, that either accept or reject a given data value, PFSMs assign probabilities to different values. They therefore offer the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment.
If you use this package, please cite the ptype paper, using the following BibTeX entry:
@article{ceritli2020ptype, title={ptype: probabilistic type inference}, author={Ceritli, Taha and Williams, Christopher KI and Geddes, James}, journal={Data Mining and Knowledge Discovery}, year={2020}, volume = {34}, number = {3}, pages={870–-904}, doi = {10.1007/s10618-020-00680-1}, }
You can simply install ptype from PyPI:
pip install ptype
See demo notebooks in notebooks
folder. View them online via Binder.