-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should the result of a merging multiple records be a single record? #51
Comments
I prepared some examples to illustrate the implications of the different approaches (see below). @stevesong, I would need some more information on the use cases that you're thinking about to assess the approaches fully. Still, hopefully, the examples and accompanying notes help to progress the discussion. Looking back at the scoping work, the overarching goal was to "build a tool to consolidate and deduplicate multiple OFDS datasets" so I think that a 'single-record' output is more in line with the expectations for the tool. I don't see any references to disaggregation in the use cases, but I think it can be achieved even with a single-record output by expanding the My main concerns with the multi-record approach are
Regarding canonical node or span identifiers, I think that the only identifiers that we could reasonably expect to be canonical are those assigned by the operators themselves (though we did not see much evidence of that in the supply-side research) rather than any that are assigned either by a tool or a regulator and passed back to operators. If preserving the original identifiers and values is a priority I would be more inclined to investigate providing an output that preserves the source data in full and annotates it to link alternative representations of the same feature, as I've sketched out in the 3rd option below. That type of approach allows full flexibility in what users can do with the output, but the output would require further processing by users to achieve the goal of the tool so I don't think it is something we should switch to this late in the development. Those are my initial thoughts - happy to discuss further! Worked examplesThese examples are in the canonical OFDS JSON format, rather than GeoJSON, for readability. Source dataTwo representations of the same network, with differing properties and geometries: {
"id": "A",
"nodes": [
{
"id": "1",
"name": "Manchester",
"location": {
"type": "Point",
"coordinates": [
0,
0
],
},
"physicalInfrastructureProvider": {
"id": "1",
"name": "Open Reach"
},
"networkProviders": [
{
"id": "2",
"name": "BT"
}
]
}
]
} {
"id": "B",
"nodes": [
{
"id": "1",
"name": "Greater Manchester",
"location": {
"type": "Point",
"coordinates": [
1,
1
]
},
"physicalInfrastructureProvider": {
"id": "1",
"name": "Talk Talk"
},
"networkProviders": [
{
"id": "2",
"name": "Talk Talk"
}
]
}
]
} Single 'record' (current approach)Consolidated with network A as the primary network: {
"id": "C", // A new network identifier (actually a UUID)
"nodes": [
{
"id": "1", // A new node identifier to avoid clashes (actually a UUID, but could equally be incremental)
"name": "Manchester", // The name from the primary network
"location": { // The location from the primary network
"type": "Point",
"coordinates": [
0,
0
]
},
"physicalInfrastructureProvider": { // The physical infrastructure provider from the primary network
"id": "A-1",
"name": "Open Reach"
},
"networkProviders": [ // The network providers from both networks
{
"id": "A-2",
"name": "BT"
},
{
"id": "B-3",
"name": "Talk Talk"
}
],
"provenance": {
// see https://github.com/Open-Telecoms-Data/ofds_consolidation_tool/blob/main/docs/howto.md#output
}
}
]
} Pros:
Cons:
One network, multiple 'records' (Steve's proposed approach)Consolidated with network A as the primary network using Steve's proposed approach:
{
"id": "C", // A new network identifier (actually a UUID)
"nodes": [
{
"id": "A-1", // A new node identifier
"name": "Manchester", // The name from the primary network
"location": { // The location from the primary network
"type": "Point",
"coordinates": [
0,
0
]
},
"physicalInfrastructureProvider": { // From the primary network
"id": "1",
"name": "Open Reach"
},
"networkProviders": [ // From the primary network
{
"id": "2",
"name": "BT"
}
],
"provenance" {
...
}
},
{
"id": "A-1", // Same as the above node identifier
"name": "Greater Manchester", // The name from the secondary network
"location": { // The location from the *primary* network
"type": "Point",
"coordinates": [
0,
0
]
},
"physicalInfrastructureProvider": { // From the secondary network
"id": "1",
"name": "Talk Talk"
},
"networkProviders": [ // From the secondary network
{
"id": "2",
"name": "Talk Talk"
}
],
"provenance": {
...
}
}
]
} Pros:
Cons:
Multiple networks, linked features (another alternative!)Annotate the features in the source networks with links to their alternative representations. {
"id": "A",
"nodes": [
{
"id": "1",
"name": "Manchester",
"location": {
"type": "Point",
"coordinates": [
0,
0
],
},
"physicalInfrastructureProvider": {
"id": "1",
"name": "Open Reach"
},
"networkProviders": [
{
"id": "2",
"name": "BT"
}
]
},
"sameAs": {
"network": "B",
"node": "1"
// Could also include `confidence`, `similarFields` and `manual` properties from existing `provenance` object
}
]
} {
"id": "B",
"nodes": [
{
"id": "1",
"name": "Greater Manchester",
"location": {
"type": "Point",
"coordinates": [
1,
1
]
},
"physicalInfrastructureProvider": {
"id": "1",
"name": "Talk Talk"
},
"networkProviders": [
{
"id": "2",
"name": "Talk Talk"
}
]
},
"sameAs": {
"network": "A",
"node": "1"
}
]
} Pros:
Cons:
|
Thanks Duncan, for taking the time to think through this. I take your point about the resulting non-standard implementation of OFDS if we take the multiple records approach. And thanks too for considering an alternative to both. I suggest we carry on with the current single record approach. I was worried about regulators being able to add value to the aggregated data source and feed it back to operators but I am sure there are effective, non-automated ways of doing this. Regarding canonical records, I fully agree that operators would/should be the authority regarding IDs for spans and nodes. In a future where there is some kind of version control for network records, perhaps permanent IDs may be possible, or perhaps it isn't important. Looking at your examples, one area where I can see it being important is at the level of operators and operator IDs. We would want to be able to aggregate operator networks across networks and countries. Doing that would require consistent use of a network operator ID. |
Thanks for the feedback 🙂 👍 to using consistent organisation identifiers. I omitted the {
"id": "01599423",
"scheme": "GB-COH" // Companies House in this example
} Based on experience from other standards, this area needs to be flagged quite early with implementers as they might not be collecting legal identifiers for the organisations in their data as a matter of course. |
I am wondering whether the result of a merge of a node or span should be a single record as is currently implemented with multiple network operators recorded in a single record or multiple individual records but which would share a common Node ID(s) and Lat/Long. In particular I am thinking about the ability to disaggregate the data and at what point the node or span ID becomes canonical for a given operators over a given infrastructure.
The text was updated successfully, but these errors were encountered: