-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
93 lines (67 loc) · 2.96 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
[](https://github.com/data-cleaning/errorlocate/actions)
[](https://CRAN.R-project.org/package=errorlocate)
[](http://www.r-pkg.org/pkg/errorlocate)
[](https://CRAN.R-project.org/package=errorlocate)
[](https://codecov.io/gh/data-cleaning/errorlocate?branch=master)
[](http://www.awesomeofficialstatistics.org)
# Error localization
Find errors in data given a set of validation rules.
The `errorlocate` helps to identify obvious errors in raw datasets.
It works in tandem with the package `validate`.
With `validate` you formulate data validation rules to which the data must comply.
For example:
- "age cannot be negative": `age >= 0`.
- "if a person is married, he must be older then 16 years": `if (married ==TRUE) age > 16`.
- "Profit is turnover minus cost": `profit == turnover - cost`.
While `validate` can check if a record is valid or not, it does not identify
which of the variables are responsible for the invalidation. This may seem a simple task,
but is actually quite tricky: a set of validation rules forms a web
of dependent variables: changing the value of an invalid record to repair for rule 1, may invalidate
the record for rule 2.
`errorlocate` provides a small framework for record based error detection and implements the Felligi Holt
algorithm. This algorithm assumes there is no other information available then the values of a record
and a set of validation rules. The algorithm minimizes the (weighted) number of values that need
to be adjusted to remove the invalidation.
# Installation
`errorlocate` can be installed from CRAN:
```r
install.packages("errorlocate")
```
Beta versions can be installed with `drat`:
```r
drat::addRepo("data-cleaning")
install.packages("errorlocate")
```
The latest development version of `errorlocate` can be installed from github with `devtools`:
```r
devtools::install_github("data-cleaning/errorlocate")
```
# Usage
```{r}
library(errorlocate)
rules <- validator( profit == turnover - cost
, cost >= 0.6 * turnover
, turnover >= 0
, cost >= 0 # is implied
)
data <- data.frame(profit=750, cost=125, turnover=200)
data_no_error <- replace_errors(data, rules)
# faulty data was replaced with NA
print(data_no_error)
er <- errors_removed(data_no_error)
print(er)
summary(er)
er$errors
```