Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Relational DB VS. Index (like Elasticsearch or Solr) #317

Open
kyleobrien91 opened this issue Oct 26, 2016 · 6 comments
Open

Question: Relational DB VS. Index (like Elasticsearch or Solr) #317

kyleobrien91 opened this issue Oct 26, 2016 · 6 comments

Comments

@kyleobrien91
Copy link

I was just wondering about this.

Thinking

In its current form, we're storing very non-fluid data. It doesn't seem to me like we're really storing any new data between quarters. At most, I imagine we might have to update a couple of numbers here and there (I could be grossly misunderstanding this :)

Potential Suggestion

So, in thinking that, I wonder if an index like Solr and/or Elasticsearch isn't better suited to our needs. They're both persistent, extremely fast, scalable in that we can run multiple nodes and they both have a long history in the concept of 'search'. If we're honest, this is essentially fancy big(ish) data search.

Options:

This also falls nicely into the so-called 'big data' pattern. We could introduce a DAG (like AirBNB's Airflow) to run the steps in the Standard Operation Procedures in the README of the repo and then have it automatically index the data into one of the tools above.

The frontend is developed agnostically of any of that and we can even do cool stuff like index various quarters and do comparisons etc...

Thoughts?

@kyleobrien91 kyleobrien91 changed the title Question: MySQL versus an index like Elasticsearch or Solr Question: Relational DB VS. Index (like Elasticsearch or Solr) Oct 26, 2016
@longhotsummer
Copy link
Contributor

Hi @kyleobrien91 and thanks for the thoughts. I'd like to understand this more. What benefits would using a non-relational datastore like ElasticSearch or Solr provide for the end user, or for us as admins? Do you see this being just a change in backend, or would that interface be exposed publicly?

@kyleobrien91
Copy link
Author

@longhotsummer - so, at their core, things like ElasticSearch and Solr aren't really datastores in the traditional sense of thinking about NoSQL stores.

One very typical use case for Indexes (although, there are many more) is a search. So, for example, a pattern in the real world for an index is

  • Process big data (from something like Google BigQuery)
  • Run it through some transformations (using Airflow or Pinball)
  • Index that pretty transformed data into an Index so that it can be searched fast

This is sort of the typical implementation I've come across a hundred times.

However, I've also seen many statistics interfaces coded off the back of an index. So very pretty graph type data representation is very comfortable pulling that large dataset from an index more than its comfortable pulling that same data from a relational data store (even a heavily cached one).

Primarily, this works well because Indexes are the best place to store large data into the concept of 'documents' - related pieces of information. Indexes are extremely fast. They work well for data that doesn't change often (tracking banking transactions is typically not a good fit for indexes).

This provides, primarily, a good user experience.

Some considerations

Operations

Being able to index involves actually getting a separate service running on a server that will index that data. This is not really that different to running MySQL as a service but is perhaps new territory.

User Experience

This kind of implementation primarily benefits the end user of both the frontend and the API.

Future Implementation of a 'Site Search'

Getting an index up and running that serves the data to the frontend in its current form makes the implementation of a site search down the line a trivial task.

Resources

@jbothma
Copy link
Contributor

jbothma commented Nov 1, 2016

It's an interesting idea to use index systems like you mentioned for this.
Have you seen them used in OLAP frameworks? I had a quick look the other
day and it didn't seem like they can be used in something like Python Cubes
yet.

I guess the 'facts' view in our API (and I guess typical OLAP systems)
would be the documents.

I'm curious what the performance would be for the 'aggregate' view
implemented on top of an index - how good are these index/document-stores
at aggregating numbers on the fly? We only use the sum and count aggregates
so far.

A use case we rely on a lot is to show the values for a given line item,
financial period, amount type and municipality. Most of the facts are at
this level, so it's just a matter of joining in the labels we want in
addition to the base financial data. Two very commonly-used tables are
disaggregated on government function. So to show the total expenditure for
staff, for example, one has to sum up the staff expenditure across
government functions. This is something relational databases have been used
for for years and happens behind the scenes as the API library (babbage)
generates the appropriate query. Is this something these indexes can do for
an appropriate query? Or would you have to store a set of documents
pre-aggregated this way?

So far we've gotten performance to an acceptable level with appropriate
indexes, but our indexes do end up being around 80% of our disk space, last
time I checked, which means they're so far from fitting in RAM it's not
funny.

I doubt we'd like to lose flexibility with the data - we're using
aggregation on line item and on government function at the moment and we're
yet to see how others really use it once they get going. At the same time,
this data will only grow and we'd like to keep decent performance.

On 1 November 2016 at 18:05, kyleobrien91 [email protected] wrote:

@longhotsummer https://github.com/longhotsummer - so, at their core,
things like ElasticSearch and Solr aren't really datastores in the
traditional sense of thinking about NoSQL stores.

One very typical use case for Indexes (although, there are many more)
is a search. So, for example, a pattern in the real world for an index
is

  • Process big data (from something like Google BigQuery)
  • Run it through some transformations (using Airflow or Pinball)
  • Index that pretty transformed data into an Index so that it can be
    searched fast

This is sort of the typical implementation I've come across a hundred
times.

However, I've also seen many statistics interfaces coded off the back of
an index. So very pretty graph type data representation is very
comfortable pulling that large dataset from an index more than its
comfortable pulling that same data from a relational data store (even a
heavily cached one).

Primarily, this works well because Indexes are the best place to store
large data into the concept of 'documents' - related pieces of information.
Indexes are extremely fast. They work well for data that doesn't change
often (tracking banking transactions is typically not a good fit for
indexes).

This provides, primarily, a good user experience.
Some considerations Operations

Being able to index involves actually getting a separate service running
on a server that will index that data. This is not really that different to
running MySQL as a service but is perhaps new territory.
User Experience

This kind of implementation primarily benefits the end user of both the
frontend and the API.
Future Implementation of a 'Site Search'

Getting an index up and running that serves the data to the frontend in
its current form makes the implementation of a site search down the line a
trivial task.
Resources


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#317 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAOZGSGG8g-QHr6qmIPxv7fqsBj0Zwipks5q52NVgaJpZM4Khtnd
.

@kyleobrien91
Copy link
Author

@jbothma - I think to the overarching point about aggregation.

Both Solr and ElasticSearch as indexes provide query aggregation. So I imagine, this would work pretty well in a scenario in which pre-aggregating the data isn't desirable - indeed, it also adds a lot of flexibility with the kind of interface one can provide via the API. I don't think pre-aggregated data works well there.

So, as resources to that point:

@kyleobrien91
Copy link
Author

@jbothma - then, looking at cubes (never worked with the concept - although perhaps I have under a different name), based on very limited exposure to OLAP, I think ES and Solr more replace cubes than really seek to interface with them.

There are a couple of resources on replacing typical OLAP cubes and their supposed lack of future life - although, I have no feelings on the latter.

That all being said, it seems that ES works better with interfacing with OLAP frameworks if that's truly required to make development easier.

@jbothma
Copy link
Contributor

jbothma commented Nov 1, 2016

I think we went with the OLAP model mainly since that's how people tend to look at data like this, and we want the interface to be familiar. Also to avoid reimplementing the API concepts. So I don't think we'd want to stray too far from these ideas. Another consideration was that Postgres has really been proven to be a low-buzz high performance database over the last few years. So we went with what we know and had in place.

That said, I really like considering not-so-obvious implementation options like these that show promise.

Someone else asked why we didn't use any of the existing open data platforms - freedom to customise the platform as needed was one of the considerations but that's also something we should reconsider if we need to change.

I don't think we'll move forward on alternative backends like solr or ES just yet, but thanks so much for raising this and running through some of the plusses and minuses. If we re-evaluate the backend we're on for performance or maintenance we'll definitely keep this in mind.

If you're keen on benchmarking some of these options we can talk through more of the details of the queries that are taxing the database heavily and the sort of characteristics and use cases we prioritise. Since this isn't a pressing issue for us right now we don't have time to really get our hands dirty with alternatives just yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants