Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support geospatial data for modeling electoral district boundaries #412

Closed
jswiesner opened this issue Mar 25, 2021 · 26 comments
Closed

Support geospatial data for modeling electoral district boundaries #412

jswiesner opened this issue Mar 25, 2021 · 26 comments
Milestone

Comments

@jswiesner
Copy link
Collaborator

Background

The VIP spec should allow for the use of geospatial data to define the boundaries of electoral districts.

StreetSegments currently provide the primary mechanism to associate the location of a registered voter with the polling location(s) they are eligible to vote at. The StreetSegment approach has proven difficult to use on both the producer and consumer side, but geospatial data can offer a scalable alternative. StreetSegments effectively amount to a point cloud of registered voters that index into polling locations. With geospatial data, on the other hand, a single polygon/shape can be used to encompass all registered voters within that district, replacing potentially thousands of StreetSegments.

Use case

A publisher of a VIP feed should be able to define the geographic boundary of an electoral district. For example, the publisher may have a polygon/shape represented by a series of lat/long points defining the boundary of the district. The publisher should then be able to define polling locations in the feed, and associate these locations with their corresponding electoral district(s). In most cases the electoral district would be a precinct, but could also be a precinct split, locality or state. If the producer provides a geographic boundary for the district, StreetSegments are no longer needed to map voters to their eligible polling locations - rather, if a voter's registered address is located within one of these geographic boundaries, then that voter is eligible to vote at any of polling locations associated with that district.

One possible solution

To kick off the discussion on this topic, one possible solution would be to add element(s) to the ElectoralDistrict element to capture the spatial extent of the district. In the simplest form, this may just be an unbounded number of lat/long points. With this extension, you could think of an ElectoralDistrict object as a pair of OCD ID plus the geographic extent of its boundary.

This solution would plug in well to the Precinct element as-is, which already contains a reference to one or more ElectoralDistricts. In the simplest example, a Precinct element would refer to a single ElectoralDistrict. That ElectoralDistrict would contain info about the OCD ID for the precinct, as well as a spatial extent definition to define the boundary of the precinct. That's all the information that would need to be provided in the feed for consumers to determine voter eligibility for this polling location.

The Locality and State elements, however, don't presently contain a reference to an ElectoralDistrict, so we would likely need to add this for easier modeling of county- and state-wide polling locations.

@afsmythe
Copy link

afsmythe commented Mar 26, 2021

We'll need input from both data providers and publishers, but it's exciting to get the conversation going. Thanks for posting!

Will we need to add geography attributes to ElectoralDistrict, Locality and State? We don't directly assign geography data to those elements now. Instead, I believe the geography is generally defined by the Precinct > StreetSegment link, via StreetSegment.PrecinctId. All other jurisdictions within a VIP feed (including electoral districts, locality and state) can be derived from that base assignment of Precinct to StreetSegment.

What are the drawbacks and possible complexities with solely assigning spatial data to Precinct, having each of the other jurisdictional boundaries (electoral districts, locality, state) derived from their relationships to those associated precincts? It feels like the current relationship between Precinct, ElectoralDistrict, Locality and State already aligns well if spatial data were only assigned to Precinct, replacing the existing street libraries. It feels less redundant, but maybe that's not best practice and we'll need redundancy by having each jurisdiction assigned it's own spatial data record.

EDIT: I think I'm using "jurisdiction" to describe any map boundary within a VIP feed including precinct, locality, electoral district and state.

@cjerdonek
Copy link
Contributor

If the producer provides a geographic boundary for the district, StreetSegments are no longer needed to map voters to their eligible polling locations - rather, if a voter's registered address is located within one of these geographic boundaries, then that voter is eligible to vote at any of polling locations associated with that district.

If this is being proposed as a replacement, how could a consumer reliably tell if a street address falls within a polygon that's defined using lat/long? It seems like this would require the consumer to use an external service (e.g. external data set and geometry library) and could result in incorrect answers depending on the service, as opposed to being self-contained.

@jswiesner
Copy link
Collaborator Author

Will we need to add geography attributes to ElectoralDistrict, Locality and State?

As far as my initial proposal goes, I was thinking we simply add geospatial attributes to the ElectoralDistrict element. To define the geographic boundary of a precinct, a publisher would need to provide geospatial attributes on an ElectoralDistrict element, and then reference that district from the corresponding Precinct.ElectoralDistrictIds field. In order to follow suit for Locality and State, we'd need to add an ElectoralDistrictIds field to those elements, similar to what already exists on the Precinct. Using this approach to model state-wide voting locations, a producer would need to create an ElectoralDistrict which contains geospatial attributes to define the perimeter of the entire state.

What are the drawbacks and possible complexities with solely assigning spatial data to Precinct, having each of the other jurisdictional boundaries (electoral districts, locality, state) derived from their relationships to those associated precincts?

That is certainly another viable solution to only define geospatial attributes for Precincts, so that Localities and States are effectively just collections of Precincts. There is an advantage to this approach, as you mentioned, that it fits well into the current spec. One downside with this approach, however, is that a publisher always has to provide Precinct-level geospatial attributes, even if the feed only contains county- or state-wide voting locations.

If this is being proposed as a replacement, how could a consumer reliably tell if a street address falls within a polygon that's defined using lat/long?

Given a voter's location (lat/long), determining their precinct is just a matter of performing a "point in polygon" evaluation (see https://en.wikipedia.org/wiki/Point_in_polygon). There are a number of common algorithms for doing this evaluation, and I presume a number of open source implementations of these as well.

@afsmythe afsmythe added this to the Version 6.0 milestone Mar 29, 2021
@afsmythe
Copy link

afsmythe commented Mar 29, 2021

To define the geographic boundary of a precinct, a publisher would need to provide geospatial attributes on an ElectoralDistrict element, and then reference that district from the corresponding Precinct.ElectoralDistrictIds field.

I believe we see in VIP feeds that precinct (and precinct splits) are generally smaller pieces of the larger electoral district they construct. Taking an exaggerated example of an ElectoralDistrict for a State, that geography on its on would not be small enough to identify the geography of each Precinct in the State. Wouldn't we need the ElectoralDistrict to derive it's geography from Precinct, rather than the other way around? Of course we could provide geographies for both data types as well.

@drrwebber
Copy link

drrwebber commented Mar 29, 2021 via email

@jswiesner
Copy link
Collaborator Author

I believe we see in VIP feeds that precinct (and precinct splits) are generally smaller pieces of the larger electoral district they construct. Taking an exaggerated example of an ElectoralDistrict for a State, that geography on its on would not be small enough to identify the geography of each Precinct in the State. Wouldn't we need the ElectoralDistrict to derive it's geography from Precinct, rather than the other way around? Of course we could provide geographies for both data types as well.

I was thinking that if a feed only provides state-wide voting locations, for example, then the more granular precinct-level geography wouldn't be necessary. That way it would give publishers the flexibility to define geospatial boundaries at the precinct, locality or state level, whichever is most convenient. But on the other hand, if the more detailed precinct-level geographies are always readily available, then it may just be easier to collect geography at the precinct level only.

Just FYI - the gerrymandering is what you are up against.In places like Pittsburgh and San Francisco they have it to another whole level - where odd/even houses on sides of streets vote differently.This is why the specifications for OASIS EML support that - amongst other nuances.Enjoy.

This is an interesting point. For the foreseeable future, the idea is that feed publishers would have the option of modeling voter locations either using street segments or geospatial data, whichever provides a better solution for each state's needs. In the fullness of time, however, the hope is that all states would migrate to the more precise geospatial approach.

The point also gives rise to the need to be able to define a precinct with one or more shapes. It's possible that a precinct may require more than one polygon to fully encapsulate its jurisdiction.

@cjerdonek
Copy link
Contributor

Given a voter's location (lat/long), determining their precinct is just a matter of performing a "point in polygon" evaluation (see https://en.wikipedia.org/wiki/Point_in_polygon). There are a number of common algorithms for doing this evaluation, and I presume a number of open source implementations of these as well.

This was only half of my comment (I said "external data set and geometry library"). If it's being proposed as a replacement, consumers would now also have to depend on an external service or data set in order to get the lat/long for a street address. In addition, an error in that external source could result in sending a voter to the wrong precinct. With explicit street segments, one doesn't have any external data dependencies and isn't limited to software languages with a point-in-polygon implementation.

Thus, it seems like this would raise the bar for consuming and using the data, so I would suggest that polygons be permitted only in addition to the current approach.

Yet another issue is that multi-level apartments may have a vertical component to their coordinates, so I'm not sure that lat/long would even be sufficient to differentiate apartment numbers located on different floors. You would need three-dimensional polyhedra in space (possibly disconnected?). In contrast, the explicit StreetSegment has an optional UnitNumber to handle this case, so such ambiguities go away.

@jdmgoogle
Copy link
Contributor

@cjerdonek Thanks for flagging both the usability and the 3-D nature of some precincts.

On the flip side, I will say that in many cases the StreetSegment approach already requires a geocoder, especially in places with multiple ways to specify an address[1], common variants of street names (First vs 1st), or places where the mailing address isn't the street address (most commonly college campuses). In these cases there's already a non-trivial risk that clients of StreetSegment-based feeds are already placing voters on the wrong segment and in the wrong precinct.

Also, in states where they've already moved to a Geo-based solution on the backend, the way to create StreetSegment entries is to take the address of every registered voter and create one StreetSegment per voter. If we're lucky the state or county will de-dupe the addresses. In any event, we end up with files which run into the small number of gigabytes while compressed. They also suffer from coverage gaps, since a new voter who doesn't make it into the VIP system will not have a point in the cloud of addresses, and there's no address range on which to place them. This problem is only going to get worse as more states move to GIS backends for their EMSs, and more VIP files move to the segment-per-voter approach.

So I agree that this will raise the bar in the ways that you point out. The question is to what extent the current system is failing clients and voters -- in silent and hard-to-debug ways, in our experience -- and what we can do about it.

[1] All of these addresses are the same place:
32 71 31st St, Astoria
32-71 31st St, Queens, NY
32-71 31st St, New York
3271 31st Street, 11106

@cjerdonek
Copy link
Contributor

With street segments, when given an address, at least one has a list of the possible targets to match against. In particular, if someone is attempting to find a match and they don't find an exact match in the list, it won't be a silent failure. If street segments are taken away, though, and someone is given an address, the process of getting a lat/long back can be a black box process if using an external service. So someone won't necessarily be alerted if the address should really be improved in some way.

I think providing geospatial data in addition to street segments is a good improvement. It would address the new voter issue you mention by providing a fallback, and it will also provide more info to match against in cases where the address provided by a voter has more than one candidate match if using the street address alone.

@cjerdonek
Copy link
Contributor

Btw, I think the questions of what to do about states that are providing a segment-per-voter as well as how things can be improved otherwise would probably be better discussed in separate tickets.

@jdmgoogle
Copy link
Contributor

the questions of what to do about states that are providing a segment-per-voter as well as
how things can be improved otherwise would probably be better discussed in separate tickets

The feedback we've gotten from those states is "please let us provide the information is geospatial formats", so I imagine that issue would just get marked as a duplicate of this one. :)

if someone is attempting to find a match and they don't find an exact match in the list, it won't be a silent failure

What you're describing is a false negative. I'm discussing both those and false positives.

I can walk you through, in practice, what it's like to try and match user-provided addresses to VIP-provided segments without using a geocoder. This is my experience from several years ago so it may be slightly outdated but I believe the numbers are directionally correct. If anyone here has any experience trying to use VIP files to set up a polling place lookup tool without relying on a geocoder, I'd love to hear it and learn about your precision and recall (and how you're detecting false positives).

  1. Rely only on straight street and city name matches, ignoring case. This will get you probably somewhere around 20-30% recall (i.e., about 70-80% of lookups will fail). This is assuming you're able to correctly tokenize the address down to the right street and city.
  2. Add in a hardcoded list of substitutions (e.g., "st" ~> "street", "dr" ~> "drive", "1st" ~> "first"). This may get you another 10-15%, but there's now a small risk (0.1%? 0.01%?) that one of these substitutions will result in a false positive.
  3. Start allowing matches on some combination of terms but not all of them. If the street name (which you're normalizing, per (2)) and ZIP code match but the city name doesn't ... maybe allow that? After all, some people refer to their mailing address by neighborhood instead of by city name, but the ZIP code and street should be stable. Coverage is up another ~10% (not all states provide ZIP codes with their street segments and not all users put in their ZIP code) but now you're looking at a higher rate of false positives, maybe starting to approach 0.5-1.0%?
  4. Create a "match score" for each user-provided address and street segment index, using some combination of terms which are present in the user address and the street segments. Doing this right is a challenge because addresses are weird and complex (not all of those apply to the US, but the majority do in some form). At this point it's effectively a basic geocoder which is trying to tune the "match score" for precision and recall. And likely you're looking at about 50-70% success and ~1-2% false positives.

Again, if anyone is aware of more recent experiences trying to use these files without an external geocoding service, I'd love to hear them. But in practice the system would fail on about a third of lookups, and of the ones where it succeeded, about one in 40 would match to the wrong segment.

@jdmgoogle
Copy link
Contributor

All of that said, the issues you raised about 3-D modeling and odd exceptions are good ones. I could certainly see a case where the state provides geospatial data for the common case, but also a set of "override" street segments to handle these kinds of corner cases. So the serving end would check the user address against an override street segment first and return that if there's a match; barring that, it could fall back on the geospatial data.

Also I know that Ohio is one of the larger "point cloud" providers, so if there's anyone from Ohio lurking on this thread or is a former election official from Ohio lurking on this thread I'd love to get their Ohio-based opinion about challenges in using a geospatial based feed.

@JDziurlaj
Copy link

I understand address standardization isn’t infallible, but it does allow EOs to directly specify the address to precinct mapping. Adding spatial data does not alleviate the need to standardize addresses, it just adds additional tasks to the pipeline:

(Street Range) Input Address -> Standardize Address -> Lookup Range
(Geosptial) Input Address -> Standardize Address -> Geocode -> Point-in-polygon

Because it is unlikely for third party geocodes to be effectively validated by EOs, I would suggest requiring (requesting?) downstream consumers show the geocoded point to the voter for validation. This would make the geocoding operation less of a black box, and possibly provide a feedback loop for geocoding improvement.

Another option would be for EOs to provide their own geocodes, which would be in addition to single point addresses. This of course would result in an even larger VIP feed, but would insure that voters get trusted info.

I would also mention that some states may not have access to election district layers, so street ranges will need to be supported for at least the near future.

@jdmgoogle
Copy link
Contributor

Just to make it clear, we're not saying that VIP6 will remove the ability to specify street segments in the usual way. What we're proposing is an alternate way in addition to the existing method to specify a precinct or precinct split boundary. Election officials would have the following options to specify how voters are associated with polling places:

  1. Purely street segments;
  2. Purely GIS shapefiles; and
  3. Some combination of shapefiles plus segments (the "override" option tossed out in my previous comment).

Any of these would be considered valid methods of saying "the person/people at this location are associated with this precinct".

What geospatial data gives the jurisdictions which choose to use it are:

  1. The ability to represent precincts in VIP in the same format they're already representing them in their EMS;
  2. 100% coverage of all addresses within a precinct, since right now the "point cloud" doesn't allow them to say that if 4 E Main Street and 8 E Main Street are in the same precinct, that 6 E Main Street is also in the same precinct, and consumers can't assume that because interpolation assumes sane-ish city boundaries, which is not always true; and
  3. Freedom from clients having to geocoding and normalize literally millions of point cloud street segments on dataset ingestion.

A strong +1 to showing the voter how the lookup tool interpreted their address, regardless of whether that's done via street segments or geospatial normalization. We've seen a number of issues where addresses in Queens, Hawaii, college campuses and dorms, tribal reservations, or other addresses outside the common "123 Main Street, Sometown, State, 12345" format get misinterpreted. Showing how the lookup tool parsed/modified the address is an excellent usability provision (although not something the spec can enforce).

As for EO-provided geocodes, we've had mixed success with that. In many cases they're of decent quality, but in many other cases they disagree with our geocoding infrastructure. It requires a lot of effort to cross-check those.

And I completely agree with your last point. We don't see street segments going away anytime in the near future. This is simply trying to find another tool for the toolbox, and address a workflow (EMSes using native geospatial layers) which some states have migrated to, and we expect more to migrate to before 2024.

At this point VIP5 is six years old (!!!), so we should develop VIP6 with the mindset that it should meet publisher and client needs through at least the 2026 election, if not 2028.

@afsmythe
Copy link

afsmythe commented Apr 5, 2021

+1 to building VIP6 for the future.

If states and localities are already building out their geo-spatial systems, we'll need a specification that can guide and support that work. Without that, we'll end up with a scenario where states and localities are building their spatial systems in unique ways. Better if we can lead on the format, rather than play catch up.

@jswiesner
Copy link
Collaborator Author

I wanted to add a few more thoughts to the point of introducing a dependency on an external geocoding service. While it is theoretically possible to work with VIP 3.x/5.x feeds without using a geocoding service, based on my experience working with VIP feeds and the Civic Information API, I strongly doubt that this is actually achievable in a way that would provide a good user experience.

As @JDziurlaj correctly pointed out above, both the street segment and geospatial based approaches require an "address standardization" step. Parsing, understanding and tokenizing a user-provided address is one of the primary utilities that a geocoding service provides, and a functionality that would be very difficult to replicate independent of a geocoder. It wouldn't be hard to get a decent level of coverage with a few heuristics like @jdmgoogle outlined above, but to do this in a reliable way that yields good coverage across the myriad of US addresses essentially amounts to building your own geocoding service.

One possible path forward is that we formally acknowledge the silent, inescapable need to use a geocoding service in order to effectively consume and serve VIP feeds. The next version of the VIP specification could be a good opportunity to bring more transparency around this aspect of making voting information useful.

I also want to consolidate the few key points of discussion so far in this thread to see where preferences lie, and to also welcome other ideas and feedback.

  1. Modeling geospatial data at precinct, locality and state levels
    One option is to only model geospatial data at the precinct level. Therefore the geographic shapes for localities and states are represented by the aggregate shapes of all the precincts contained within. Another option is that the precinct, locality and state all have the ability to be modeled with geospatial data.

  2. Hybrid approach for combining street segments and geospatial data
    In order to leverage the benefits of street segments to model edge cases like apartment numbers, we could allow for a hybrid approach where both street segments and geospatial data can be provided in the feed. If a voter address maps to a street segment, the street segment takes precedence. Otherwise, geospatial data is used to identify which precinct the user is located within.

Thoughts?

@afsmythe
Copy link

Providing some thoughts below. Thanks for synthesizing the conversation thus far.

  1. Modeling geospatial data at precinct, locality and state levels

This will be discussed with data providers in the coming months. I'll say that supporting spatial data on all jurisdictional records would provide some additional QA capability (since we can compare the 1 jurisdictional shape against the aggregate shape of the Precinct pieces). But that also adds additional "sources of truth", so it might not be ideal for a data consumer.

  1. Hybrid approach for combining street segments and geospatial data

Especially in the early days of v6, I could see a lot of value in collecting both types of data. It would provide some additional QA opportunity to cross check a geocoded StreetSegment point with the spatial polygon it should be contained within (per the spatial assignment on Precinct). That said, I believe one of the reasons for moving to spatial data is to shrink the size of these XML files.

Somewhat separate, there's been a couple of reasons offered as to why a geocoding service is a necessary tool for implementations of VIP data. With that in mind, is there any reason to also add a spatial attribute to StreetSegment? Such a record could provide a coordinate for the "point cloud" VIP feeds and a polygon-like record for a traditional street segment. Could see these potential records also serving some QA capabilities as well.

@jdmgoogle
Copy link
Contributor

is there any reason to also add a spatial attribute to StreetSegment? Such a record
could provide a coordinate for the "point cloud" VIP feeds and a polygon-like
record for a traditional street segment.

Are there any localities which represent their StreetSegments as GIS shapes? I thought one of the motivating factors for them was that you didn't need shapes and it was a straightforward port of the old "street listing" precinct mapping books.

@JDziurlaj
Copy link

In the county I worked in, we would take the street segments (lines) from the GIS system, and transform them into tabular structures for the street directory within the VRDB. This is because the GIS system and VRDB were not integrated.

@jdmgoogle
Copy link
Contributor

Thanks for that context. Did the lines from the GIS system contain within their boundaries only the road surface itself, or the road surface and the parcels?

@JDziurlaj
Copy link

The street centerlines and parcels were separate layers. We spent months reconciling the TIGER lines to our parcels, adjusting the high and low ranges on each side of the street for each segment.

@jswiesner
Copy link
Collaborator Author

This will be discussed with data providers in the coming months. I'll say that supporting spatial data on all jurisdictional records would provide some additional QA capability (since we can compare the 1 jurisdictional shape against the aggregate shape of the Precinct pieces). But that also adds additional "sources of truth", so it might not be ideal for a data consumer.

Definitely happy to defer to whichever option is easier for data providers. The main motivation for specifying a single county level shape versus an aggregate of precinct level ones is that is that it could make life easier for the data provider.

Somewhat separate, there's been a couple of reasons offered as to why a geocoding service is a necessary tool for implementations of VIP data. With that in mind, is there any reason to also add a spatial attribute to StreetSegment? Such a record could provide a coordinate for the "point cloud" VIP feeds and a polygon-like record for a traditional street segment. Could see these potential records also serving some QA capabilities as well.

Where would the StreetSegment's coordinates come from? Is each data provider able to geocode these to a lat/lng?

@jswiesner
Copy link
Collaborator Author

I think we've reached a good point in the discussion to consider concrete ideas for modeling geospatial data in a VIP feed. I'd like to start by proposing the following.

TL;DR

See the proposed change in jswiesner#1.

Goal

Model precinct boundaries with geospatial data.

General approach

The most straightforward way to model a precinct boundary in a VIP feed is to extend the ElectoralDistrict element to include a spatial boundary. This would allow the geospatial shape to be a property of the district itself, which would allow for future flexibility to specify the boundary of other types of ElectoralDistricts as well (i.e. Locality, State, special districts for contests). A Precinct element in a VIP feed can reference multiple ElectoralDistricts, so the boundary of a precinct would be represented by the composite of the boundaries for all of its electoral districts.

Specific approaches considered

In thinking about how to model geospatial data in a VIP feed, the following two questions are top of mind.

  • How do we represent the geometry of the boundary (i.e. geospatial format)?
    There are two main options:

    • Open format: An existing industry standard for modeling geospatial data. Examples include ESRI Shapefile and GeoSJON.
    • Native format: A new proprietary format for modeling geospatial data using XML which would be part of the VIP specification.
  • Where do we store geospatial coordinates?
    There are two main options:

    • In the feed: The geospatial coordinates (i.e. all the points of a polygon) are embedded directly in the feed. If using an open format, this would likely amount to a very long string field with geospatial coordinates in a format that could be deserialized by a VIP feed parser. If using a native format, this would include XML elements to capture the collections of points that make up each shape.
    • External to the XML: The geospatial coordinates are represented in separate files external to the VIP XML feed. This approach only makes sense when using an open format, so the external file would simply be the raw ESRI shapefile, for example.

The table below summarizes the most reasonable approaches considered.

Option Pros Cons
Geospatial coordinates embedded within the XML Open format geospatial data embedded in feed as plaintext string
  • Geospatial coordinates are self-contained in a single feed. No need to distribute and join multiple files.
  • Embedding geospatial data makes the XML file size large
  • Feed consumer has to understand the open format (i.e. ESRI shapefile)
  • A long string in the XML feed could make the file difficult to manually browse/search, and extremely long lines may cause performance issues for text editors.
  • No XML schema validation possible on shape data, the coordinates are simply contained in a string.
  • Any change to any part of the XML feed would require reprocessing of all geospatial data.
  • Difficult to QA, would require building custom tooling to extract the shape data from the XML feed to visualize.
  • Open format geospatial data embedded in feed as base64 encoded string
  • Most space-efficient option to embed geospatial data in the XML
  • Geospatial coordinates are self-contained in a single feed. No need to distribute and join multiple files.
  • Embedding geospatial data makes the XML file size large
  • Feed consumer has to understand the open format (i.e. ESRI shapefile)
  • A long string in the XML feed could make the file difficult to manually browse/search, and extremely long lines may cause performance issues for text editors.
  • Requires more work for feed consumers to decode the base64 string.
  • No XML validation possible on shape data, the coordinates are simply contained in a string.
  • Any change to any part of the XML feed would require reprocessing of all geospatial data.
  • Difficult to QA, would require building custom tooling to extract the shape data from the XML feed to visualize.
  • Native format geospatial data modeled in feed as XML elements
  • Geospatial coordinates are self-contained in a single feed. No need to distribute and join multiple files.
  • Keeps the XML easier to manually browse and open in an editor since we could model one lat/lng point per line.
  • No special parsing required to understand the shape coordinates (i.e. consumer doesn’t need to be able to parse an ESRI shapefile, for example).
  • XML schema could enforce type constraints on shape coordinates, since each lat/lng value would be in it’s own element.
  • Embedding geospatial data makes the XML file size large
  • Most verbose approach due to repeated XML element tags wrapping each lat/lng. Will result in the largest XML feed size.
  • Cumbersome for data providers to create an XML feed as it would require transforming geospatial data from a GIS tool into the proprietary XML shape format.
  • Any change to any part of the XML feed would require reprocessing of all geospatial data.
  • Difficult to QA, would require building custom tooling to extract and transform the data into a format that is visualizable.
  • Geospatial coordinates external to the XML Open format geospatial data in external files with external file references verified by checksum
  • Keeps XML feed itself extremely slim by fully externalizing geospatial coordinates. Ensures we don’t run into XML file size constraints in the future by avoiding geospatial data embedded in the XML.
  • Easiest option for data providers since they can directly use files exported from GIS tools. No need to transform shapedata into a plaintext or encoded string.

  • Support geospatial data in a robust, open format in its original file format, which would provide maximum flexibility for geospatial modeling and allow us to leverage open source libraries for parsing (i.e. GeoTools: https://geotools.org/)
  • Simplified QA of geospatial data since shapefiles in an open format could be viewed/inspected with open source tooling (i.e. https://mapshaper.org/).
  • Allows for downloading and reprocessing XML and shapefiles independently. If an XML feed has been updated, but the corresponding shapes have not, it allows for optimization of only reprocessing the XML contents.
  • Similar to the above, if just one of many shapefiles has been updated in a feed, it allows for optimization to only reprocess the modified one. This could have significant implications for QA processes to avoid redundant manual work.
  • Feed consumer has to understand the open format (i.e. ESRI shapefile)
  • Requires packaging and distributing multiple files (feed + attachments)
  • Slightly burdensome for consumers to parse and join multiple files in code
  • No XML validation possible on shape data. A valid XML feed according to the XSD may still have a dangling reference that cannot be determined by the XSD validation alone.
  • Proposed Approach

    I propose we opt for the approach of externalizing shapefiles from the XML feed and referencing these files by name and checksum. While this approach is not the simplest of all the approaches, it offers important benefits that clearly outweigh the downsides. Of all the approaches considered, this approach provides the best solution to ensure that publishing VIP feeds remains as easy as possible for data providers, XML feeds remain manageable in size, shapefiles can easily be inspected for QA, and that feed consumers can seamlessly build optimization into ingestion infrastructure to avoid unnecessary reprocessing.

    Schema changes

    The schema changes for the proposed approach are staged in jswiesner#1.

    It would be great to get feedback on the proposed approach, as well as to hear any other proposals.

    @jswiesner
    Copy link
    Collaborator Author

    @afsmythe
    Copy link

    Fine with me.

    @jswiesner
    Copy link
    Collaborator Author

    Closing this issue as a result of #412 (comment).

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants