Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Station Look-Up #19

Open
MoKob opened this issue Mar 3, 2017 · 7 comments
Open

Implement Station Look-Up #19

MoKob opened this issue Mar 3, 2017 · 7 comments

Comments

@MoKob
Copy link
Contributor

MoKob commented Mar 3, 2017

To actually route in a public transportation network, we require the translation from location / query to stations.

I see a few different scenarios that we might want to support.

Name To Station

Similarity measures

Entering the name of a station, or parts of it, we should be able to find the best match in the data.

For any string s and station S, we need a function that determines the likelihood of s referring to S. We could use different synonyms (e.g. S+U Stationname and Stationname) to refer to the same station.

Similarity could further use techniques like stemming, phonetics matching or similar to distinguish between possible candidates.

A possible hint to further strengthen the look-up could be the distance to the GPS position.

Selecting stations

User enters name 'n'. If name 'n' can be matched with a high enough certainty, we directly select the station. If the likelihood of many stations is high enough, we should output a sorted list of possible candidates.

Possible Features

By supporting multiple likely results (staring at maybe like 2/3 characters), we could implement autocomplete.

Location to Station

For the look-up location to station we would need integration with walking durations. What I imagine is a preselection phase to select a Pareto-set of possible connections. For every line we can select the nearest (maybe two nearest/maybe per direction?) stations.
Within a maximal walking radius, we can pre-filter a set of candidates.

These candidates then need to be checked for their walking distances. This could be done with a walking table query to OSRM/mapbox-directions.

This gives a set of initial data. Next to this initial data, we also require a check for the direct connection (if it is close enough) in walking.

After a search for the best connections, we can then augment our response by querying the walk-api for an actual set of walking instructions to the stations/from the stations.

/ cc @daniel-j-h

@daniel-j-h
Copy link
Contributor

I'd like to give the Name->Station task a shot when I can make some time for it.

I think all that's needed for this service is the stop.proto messages. The lookup will then work based on name and location and return matching ids (and maybe matching metadata such as scores, corrections for the match when we fuzzy match etc.).

I want to have at least scalable and fast Fuzzy Matching of some sort.

If we want to be fancy with NLP and do stemming, etc. I'd consider writing this service in Python and using existing libraries. Also we should think about if our matcher should work worldwide or (for now) in Europe/US only.

@daniel-j-h
Copy link
Contributor

daniel-j-h commented Mar 5, 2017

Sooo.. taking the train to Bavaria I had some time to think about this.

Here's a first prototype for a fuzzy autocompleter:

https://gist.github.com/daniel-j-h/c626acfd32b51e7e1fe9e0aedd557514

It's a stupid simple gram-based inverted index and sorting by Jaccard similarity between the gram sets.

It's super inefficient doing full sorts where nth_element-equivalents would do, does not do any NTP pre-processing other than lower-casing and other atrocities (I blame the train ride).

But: it already works amazingly well! Give it a try e.g. on Berlin!

@daniel-j-h
Copy link
Contributor

There's another point I totally forgot to mention: Unicode awareness. We should not limit us to ASCII in 2017. The prototype already handles bigrams based on Unicode code points (and not bytes), but we probably should do some NFKC/NFKD normalization before indexing and querying, too.

@MoKob
Copy link
Contributor Author

MoKob commented Mar 8, 2017

@daniel-j-h if you manage to do emoji look-up, that would be THE addition to the framework :)

🌞 🗼 🚄 🚄 🗻 : how do I get from tokyo by train to mount fuji around noon

@daniel-j-h
Copy link
Contributor

Capturing from chat: what I can imagine for a first prototype is having

  • Stop names
  • Stop locations
  • (Normalized) stop "importance" (e.g. degree in the transit network)

then we can build a service for doing fuzzy auto completion:

  • User request: name string, optional location hint (derived in client from GPS / map view, etc.)
  • Service response: list of (name, confidence) pairs, can be empty

The results should then be ranked by a linear combination of

  • Matching grams
  • Weights
  • Proximity to location hint

If you check the FindProtobuf CMake module there is a CMake function to generate Python stubs based on the .proto schemas. This should allow us to build a Python3 based service based on the techniques outlined above.

@MoKob
Copy link
Contributor Author

MoKob commented Mar 10, 2017

From first impressions of the data. We should be able to handle a set of situations for the look-ups:

  • street corners: find stops of the form missing and first / mission & first
  • handle written numbers / non-written numbers: 1st / first

In addition, we should find ways to pre-filter data for unexpected characters. The Berlin feed for example lists (Berlin) on all stops.

@daniel-j-h
Copy link
Contributor

Client side example of how this would look and feel like

🚆 https://daniel-j-h.github.io/berlin-u-stops/

Should probably be done on the backend except we know the client's position and can prefetch stops and do it on a constrained area locally. Then it can be done even when there is no connectivity.

Not sure about performance, though. This is only on hundreds of U stops. It will probably not scale to more than some thousands. At least not this unoptimized fuzzy string matching running in the browser.

@MoKob MoKob added this to the 1.0 Release milestone Mar 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants