-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
Description
At Intercom we test our app thousands of times per month, utilising 500 parallel jobs which run a subset of our tests and all of which run the same container. The container is a plain Docker one that we build and push to a private registry. For historical/performance reasons this image is fat i.e. while it's base image is Ruby we install MySQL, Elasticsearch, Redis etc on it too. The Elasticsearch we install is the official tar.gz release.
In Elasticsearch 7.14 a feature was added which would automatically download the MaxMind GeoIP database. Specifically, it downloads the database from Google Cloud after fetching this JSON object. This is enabled by default.
We discovered that this feature was responsible for around $20,000 on our AWS bill since we upgraded last March and based on a rough approximation of traffic we were charged for it probably cost Elastic about $5,500 💸 Investigation was slow and required enabling AWS VPC logs to assess the NAT gateway traffic source. Because the source was was so widespread (i.e. all our CI machines) we had to use tcpdump which was of limited value due to HTTPS and only provided us with a generic hostname. Eventually some close observation led us to the cause.
This feature in general seems a bit weird and very surprising. Here's a couple of observations
- Anyone starting a basic Elasticsearch process is now downloading 40MB of a GeoIP database entirely in the background, potentially for a feature (GeoIP processing) they do not use.
- The persistence of the database to
/tmpis fundamentally incompatible with Docker and drives up traffic:- If I start/stop my container several times I'm downloading that database several times.
- If I decide to start Elasticsearch during my image build (to download the file once, and put it into my image) I need to be cognisant that I've to change the path too, because
/tmpis wiped.
- Not fronting the URLs the database is fetched from and using a generic Google URL made this tricky to find.
- Support for other cloud providers (i.e. an AWS S3 mirror) and some intelligence around picking which URLs to give to an ES server requesting the database could mitigate a lot of customer cost. This is a bit of a can of worms - who do you support, how do you detect those etc. However it would have mitigated the cost impact for both Intercom and Elastic, and potentially others (e.g. folks using CircleCI + ES).
Above all, in order to do something like disable the updater, change the storage path for the database, change the endpoint used to fetch etc I need to be aware this is happening and I'm not sure that people who do not use the GeoIP features are aware.