-
Notifications
You must be signed in to change notification settings - Fork 1
Resolution strategy configuration
Resolution strategy defines how to resolve data conflicts that when integrating data from multiple sources. The strategy can be defined in <ResolutionStrategy>
and <DefaultStrategy>
tags in a LD-FusionTool configuration file (see examples).
Resolution strategy is defined by three configurable options: conflict resolution function, expected value cardinality, and aggregation error strategy.
Conflict resolution function resolves conflicting values. For example, multiple sources can contain information about a product price; a resolution function can resolve price values by averaging them, selecting maximum, latest value, or including all prices in the result.
Deciding resolution functions choose one or more values from the input. They cannot produce any value that is not present in the input.
-
ALL
- all distinct values are included in the output -
ALLBEST
- returns the value with the highest quality score; if multiple values have the same score, returns all the top values -
ANY
- returns a single arbitrary value -
BEST
- returns the value with the highest quality score; if multiple values have the same score, returns the first top value -
BEST_SOURCE
- returns value from the named graph with highest source quality score -
CERTAIN
- if there is a single distinct value, returns the value; otherwise returns no values -
FILTER
- returns numerical values falling into the given range; the range can be specified by optionalmin
andmax
parameters -
LONGEST
- returns the value with the longest lexical representation -
MAX
- returns the maximum literal value (see comparing values) -
MAX_SOURCE_METADATA
- returns a value from the named graph with maximal value of a given property; the property is specified bypredicate
parameter -
MIN
- returns the minimum literal value (see comparing values) -
MIN_SOURCE_METADATA
- returns value from the named graph with minimal value of a given property; the property is specified bypredicate
parameter -
NONE
- returns all values; unlikeALL
, preserves duplicate values -
CHOOSE_SOURCE
- returns values from the given named graph; the named graph is specified bysource
parameter -
SHORTEST
- returns the value with the longest lexical representation -
TOPN
- returns n values with the highest quality score; n is specified by then
parameter -
THRESHOLD
- returns values with the quality score above the given threshold; the threshold is specified by thethreshold
parameter -
VOTE
- returns the most frequently occurring value -
WEIGHTED_VOTE
- returns the most frequently occurring value where occurrences are weighted by source quality scores
Mediating resolution functions may produce values not included in the input.
-
AVG
- returns the numerical average of values -
SUM
- returns the sum of values -
CONCAT
- returns the concatenation of lexical representations of values, separated by value given in optional parameterseparator
(defaults to;
) -
MEDIAN
- returns the median value
Special resolution functions.
-
DEPENDENT_RESOURCE
- expects the values to be resources, and resolves their properties recursively (currently only resolution do depth 1 is supported); more details
-
Source quality score. Source quality score is determined from named graph quality score, and average quality of the corresponding data publisher. By default, the properties specifying scores are
odcs:score
for named graph quality andodcs:publisherScore
for publisher quality, and publisher of a named graph is specified byodcs:publishedBy
, whereodcs:
prefix stands forhttp://opendata.cz/infrastructure/odcleanstore/
. -
Parameters. Some resolution functions have parameters. These are specified by the
<Param>
element in the configuration XML. -
Comparing values. Resolution functions comparing literal values use different kinds of comparison depending on data types of literals. LD-FusionTool supports comparison for strings (
xsd:string
or no datatype; lexicographical), numerical values (xsd:int
,xsd:long
, ...), time (xsd:time
), date (xsd:dateTime
,xsd:gYearMont
, ...), and booleans (xsd:boolean
). If multiple types of values are present, the most frequent type is used and the rest of values are processed according to aggregation error strategy.
Expected value cardinality affects how quality is computed for resolved quads produced by LD-FusionTool.
-
MANYVALUED
- it is valid for the respective property to have multiple values (e.g., authors of a paper) -
SINGLEVALUED
- the respective property should have a single value (e.g., a birth date); quality value will be decreased if multiple values are present (depending on their similarity/difference)
Aggregation error strategy determines how values that cannot be processed by the given conflict resolution function should be treated. E.g., string values cannot be averaged using the AVG
resolution function. Such values can be either discarded, or included in the result.
-
IGNORE
- discard values that are not accepted by resolution function -
RETURN_ALL
- include all values that are not accepted by resolution function in the result