Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace geolix with locus #2362

Merged
merged 11 commits into from
Jan 17, 2023
Merged

replace geolix with locus #2362

merged 11 commits into from
Jan 17, 2023

Conversation

ruslandoga
Copy link
Contributor

@ruslandoga ruslandoga commented Oct 22, 2022

Changes

This PR replaces geolix with locus to simplify self-hosted setup. locus can auto-update maxmind dbs which are recommended for self-hosters if they want city-level geolocation. locus is also a bit faster.

This PR also uses a test mmdb file from https://github.com/maxmind/MaxMind-DB for e2e geolocation tests without stubs.

Tests

  • Automated tests have been updated

Changelog

  • Entry has been added to changelog

Documentation

  • Docs have been updated

Dark mode

  • This PR does not change the UI

nil

{:error, reason} ->
raise "failed to lookup ip address #{inspect(ip_address)}: " <> inspect(reason)
Copy link
Contributor Author

@ruslandoga ruslandoga Oct 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raising on invalid ip address or any other kind of unexpected error

@bundlemon
Copy link

bundlemon bot commented Oct 22, 2022

BundleMon

Unchanged files (8)
Status Path Size Limits
static/css/app.css
515.19KB -
static/js/dashboard.js
297.45KB -
static/js/sentry-bundle-5.9.1-min.js
15.7KB -
static/js/app.js
12.13KB -
static/js/embed.host.js
5.58KB -
static/js/embed.content.js
5.06KB -
tracker/js/plausible.js
748B -
static/js/applyTheme.js
314B -

No change in files bundle size

Final result: ✅

View report in BundleMon website ➡️


Current branch size history | Target branch size history


unless opts[:async] do
{:ok, _version} = :locus.await_loader(@db)
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting for db to be loaded by default, blocks root supervisor startup

:ok = :locus.start_loader(@db, {:maxmind, edition}, license_key: license_key)

path = opts[:path] ->
:ok = :locus.start_loader(@db, path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if locus:start_loader returns an error, this function crashes with a match error, I think it's good enough as the error reason would still be easily visible in logs

:ok = :locus.start_loader(@db, path)

true ->
raise "failed to load geolocation db: need :path or :license_key to be provided"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raising here too in case not enough info was provided

test/support/geo_stub.ex Outdated Show resolved Hide resolved
Copy link
Contributor

@vinibrsl vinibrsl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job on this!

config/runtime.exs Outdated Show resolved Hide resolved
lib/plausible/geo/adapter.ex Outdated Show resolved Hide resolved
@vinibrsl vinibrsl added deploy-to-staging Special label that triggers a deploy to a staging environment and removed deploy-to-staging Special label that triggers a deploy to a staging environment labels Oct 28, 2022
@vinibrsl
Copy link
Contributor

I pushed your changes to staging, and it's looking good! Could you please help me doing some manual testing? https://staging.srv.plausible.io/

I believe we can merge and deploy it to production on Monday.

@ruslandoga
Copy link
Contributor Author

ruslandoga commented Oct 29, 2022

I sent some events to dummy.site with "forwarded" headers to simulate different ip addresses.

One thing I noticed now and also before these changes is that for some locations regions seem to be showing cities and cities seem to be showing city districts. Like Region=Tokyo, City=Shinjuku I think it's a dbip quirk (UPDATE: and maxmind as well...)

As a user, for the Shinjuku example above I'd expect to see
Region=Kanto, City=Tokyo[, District=Shinjuku]

@@ -34,7 +34,7 @@ defmodule Mix.Tasks.DownloadCountryDatabase do

if res.status_code == 200 do
File.mkdir("priv/geodb")
File.write!("priv/geodb/dbip-country.mmdb", res.body)
File.write!("priv/geodb/dbip-country.mmdb.gz", res.body)
Copy link
Contributor Author

@ruslandoga ruslandoga Oct 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locus trusts the file extension.
locus didn't try to gunzip dbip-country.mmdb because it didn't have a .gz suffix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, the error that locus logged was

[error] [locus] [geolocation] database failed to load (:filesystem): {:unpack_database_from, :mmdb_blob,
 {:bad_metadata,
  {:marker_not_found,
   <<171, 205, 239, 77, 97, 120, 77, 105, 110, 100, 46, 99, 111, 109>>}}}

@metmarkosaric
Copy link
Contributor

I sent some events to dummy.site with "forwarded" headers to simulate different ip addresses.

One thing I noticed now and also before these changes is that for some locations regions seem to be showing cities and cities seem to be showing city districts. Like Region=Tokyo, City=Shinjuku I think it's a dbip quirk.

As a user, for the Shinjuku example above I'd expect to see Region=Kanto, City=Tokyo[, District=Shinjuku]

yeah, that's not ideal. we've been manually changing some of these districts into cities ourselves to make it better when people report it but it's difficult to do the whole world like that. see https://github.com/plausible/analytics/blob/master/lib/plausible/ingestion/city_overrides.ex. is there a better way to do it?

@ruslandoga
Copy link
Contributor Author

ruslandoga commented Oct 29, 2022

Before I go into speculation on how to work around this, I'll post some example responses from dbip-city-lite and maxmind's geolite2-city for two Shinjuku ip addresses to better understand what we are dealing with:

ip lookup response from dbip-city-lite
iex(6)> :locus.lookup(:dbip, "124.150.157.214")
{:ok,
 %{
   # The full dbip response from the paid version would also contain geonames ids.
   "city" => %{"names" => %{"en" => "Shinjuku"}},
   "continent" => %{
     "code" => "AS",
     "geoname_id" => 6255147,
     "names" => %{
       "de" => "Asien",
       "en" => "Asia",
       "es" => "Asia",
       "fa" => " آسیا",
       "fr" => "Asie",
       "ja" => "アジア大陸",
       "ko" => "아시아",
       "pt-BR" => "Ásia",
       "ru" => "Азия",
       "zh-CN" => "亚洲"
     }
   },
   "country" => %{
     "geoname_id" => 1861060,
     "is_in_european_union" => false,
     "iso_code" => "JP",
     "names" => %{
       "de" => "Japan",
       "en" => "Japan",
       "es" => "Japón",
       "fa" => "ژاپن",
       "fr" => "Japon",
       "ja" => "日本",
       "ko" => "일본",
       "pt-BR" => "Japão",
       "ru" => "Япония",
       "zh-CN" => "日本"
     }
   },
   "location" => %{"latitude" => 35.6967, "longitude" => 139.709},
   "subdivisions" => [%{"names" => %{"en" => "Tokyo"}}]
 }}

The full dbip response from the paid version would also contain geonames ids like in this maxmind lookup example:

ip lookup response from maxmind's geolite2-city, contains geoname ids
iex(10)> :locus.lookup(:maxmind, "122.216.10.1")
{:ok,
 %{
   "city" => %{
     "geoname_id" => 1852083, # but https://www.geonames.org/1852083 can be used to double check
     "names" => %{"en" => "Shinjuku", "ja" => "新宿区"} # same problem, Shinjuku is listed as city
   },
   "continent" => %{
     "code" => "AS",
     "geoname_id" => 6255147,
     "names" => %{
       "de" => "Asien",
       "en" => "Asia",
       "es" => "Asia",
       "fr" => "Asie",
       "ja" => "アジア",
       "pt-BR" => "Ásia",
       "ru" => "Азия",
       "zh-CN" => "亚洲"
     }
   },
   "country" => %{
     "geoname_id" => 1861060,
     "iso_code" => "JP",
     "names" => %{
       "de" => "Japan",
       "en" => "Japan",
       "es" => "Japón",
       "fr" => "Japon",
       "ja" => "日本",
       "pt-BR" => "Japão",
       "ru" => "Япония",
       "zh-CN" => "日本"
     }
   },
   "location" => %{
     "accuracy_radius" => 5,
     "latitude" => 35.6948,
     "longitude" => 139.6976,
     "time_zone" => "Asia/Tokyo"
   },
   "postal" => %{"code" => "160-0021"},
   "registered_country" => %{
     "geoname_id" => 1861060,
     "iso_code" => "JP",
     "names" => %{
       "de" => "Japan",
       "en" => "Japan",
       "es" => "Japón",
       "fr" => "Japon",
       "ja" => "日本",
       "pt-BR" => "Japão",
       "ru" => "Япония",
       "zh-CN" => "日本"
     }
   },
   "subdivisions" => [
     %{
       "geoname_id" => 1850144,
       "iso_code" => "13",
       "names" => %{
         "en" => "Tokyo",
         "fr" => "Préfecture de Tokyo",
         "ja" => "東京都"
       }
     }
   ]
 }}

@ruslandoga
Copy link
Contributor Author

ruslandoga commented Oct 29, 2022

So what we have is basically wrong data, in my opinion. DB-IP thinks Shinjuku is a city, there is no way around it just by using the data we have from this response. But since the paid version includes geoname id, it can be used to lookup the geoname info by the id and check what kind of place it is: https://www.geonames.org/1852083/shinjuku-ku.html

Shinjuku is listed as a second-order administrative division, under Tokyo (JP (independent political entity) -> Tokyo (first-order administrative division) -> Shinjuku (second-order administrative division)), maybe that information can be used to automatically rename the "city_geoname_id" key into something like "district_geoname_id" before inserting into the DB. Something similar is already being done in https://github.com/plausible/location, but on the "presenting" side, when turning geoname ids into city names before rendering the dashboard.

@ruslandoga
Copy link
Contributor Author

ruslandoga commented Oct 29, 2022

It might have been too soon for me to claim that the data is wrong 🙈 (The suggested workaround might still work, though, I've checked it on Bangkok / Watthana responses as well, I'll need to check more cities)

So what we have is basically wrong data, in my opinion. DB-IP thinks Shinjuku is a city [...]

Shinjuku is listed as a second-order administrative division, under Tokyo [...]

I tried to research a bit what a second-order administrative division is, and it usually means something like a city in most countries. But from https://en.wikipedia.org/wiki/Municipalities_of_Japan

Each of the 23 special wards of Tokyo is legally equivalent to a city, though sometimes the 23 special wards as a whole are regarded as one city.

So maybe in this particular case, Shinjuku is indeed a city-like administrative division (a special ward) even though it sounds a bit strange.

@metmarkosaric
Copy link
Contributor

would be great if we can improve this somehow. we get three types of feedback for cities (everyone seems to be happy with countries and regions). here they are ranked by how often we hear about them:

  1. you show my city as my ISP location or some big city a few hundreds kilometers away from me (similar to this IP Geo Location showing ISP HQ location #2166). not sure if there's something to do here other than update the database

  2. this is a district and not a city (we manually override these now if/when people report them)

  3. you don't show any city data for a big part of my visitors (some people expect to see cities for all their visitors). in our testing before we released this, db-ip actually showed city data for more visitors than other databases did

@vinibrsl
Copy link
Contributor

Hey 👋! I pushed some commits to this branch and merged with master. We're also ready to give MaxMind in Plausible Cloud a try after some benchmarking.

lib/plausible/geo.ex Outdated Show resolved Hide resolved
Copy link
Contributor

@ukutaht ukutaht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just one comment about adding geolocation readiness status to our healthcheck endpoint.

Once this is tested and we've settled on using Locus+Maxmind, we also need to document this for self-hosted and include it in the release notes for the next release (cc @ruslandoga)


defp setup_geolocation do
opts = Application.fetch_env!(:plausible, Plausible.Geo)
:ok = Plausible.Geo.load_db(opts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this seems to be async by default, I think it would also be a good idea to add this loading step to the /api/health endpoint like we do with the sites cache. This way the load balancer will not route traffic to a newly restarted node until the geo database is loaded.

Side question: how long does it take to download and be ready to run geolocation lookups with the Maxmind license key?

Side note: not a concern for this PR but I think we're getting to a point where we should really separate liveness vs readiness probes as suggested by @cnkk. Loading the geolocation DB is required for the app to be 'ready' but it's independent from it being 'live'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's sync by default in this implementation.

Copy link
Contributor Author

@ruslandoga ruslandoga Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side question: how long does it take to download and be ready to run geolocation lookups with the Maxmind license key?

On my laptop it takes six seconds. I don't know how much bigger the paid version is though. I think most of the time is spent on IO, decoding (or rather, verification) is fast.

Mix.install [:locus]

:timer.tc fn ->
  Application.put_env(:locus, :license_key, "XXdDMc5OhchOTazu") # I'll delete it in a few days
  :ok = :locus.start_loader(:city, {:maxmind, "GeoLite2-City"}, [:no_cache])
  {:ok, _} = :locus.await_loader(:city)
end

#=> {6204296, {:ok, {{2023, 1, 10}, {15, 33, 48}}}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Seeing as this is sync and called before Location.load, I assume this will slow down the whole startup process. If this loading was async, it could run in parallel and probably finish before Location.load is done if I understand right. That seems desirable so the whole startup process is only limited by the slowest piece which would be Location.load. What do you think @ruslandoga @vinibrsl ?

Copy link
Contributor Author

@ruslandoga ruslandoga Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, maybe we can fire it off before Location.load, and wait for it with something like Plausible.Geo.await_load() after Location.load (just in case)? await_load would call :locus.await_loader(:city)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good idea

Copy link
Contributor

@vinibrsl vinibrsl Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using :locus.await_load but I couldn't know when the task was completed. The alternative I considered is to use Task.await_many/2 to start Location and Geolix in parallel. This ensures the app will never be in an invalid state where Geolix or Location haven't yet finished booting up. Changed in 35393b2. Let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using :locus.await_load but I couldn't know when the task was completed.

What was the problem?

Copy link
Contributor

@vinibrsl vinibrsl Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that I was using it wrong 😅. Fixed in c775070. Let me know what you think :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants