This project is based on the Apache 2.0-licensed elasticsearch-sql project. Thank you eliranmoyal, shi-yuan, ansjsun and everyone else who contributed great code to that project.
The Open Distro for Elaticsearch SQL plugin launched early this year which lets you write queries in SQL rather than the Elasticsearch query domain-specific language (DSL). While the majority of our codebase is on top of ES-SQL initially, there are a lot of new features and bug fixes introduced in our implementation. And in the following releases this year, we keep improving and refactoring our code as well as maintaining version currency of Elasticsearch. Basically OpenDistro SQL is superset of ES-SQL and it’s more reliable and up-to-date.
The ES-SQL codebase has clear architecture and abstraction for a basic query engine, such as SQL parser, DSL generator and domain model. However as we dived deep, we identified the following major problems and resolved them before launch:
- The JDBC driver used Elasticsearch proprietary API
PreBuiltXPackTransportClient
. - JOIN capability matters but the Hash JOIN implementation was not scalable for production use because it loads all data into memory.
- For complex queries like JOIN or Multi-query, there was no concept of a query plan for planning and optimizing.
- ES-SQL uses a handwritten SQL parser Druid which is not extensible, so many queries that not supported or has semantical error can still pass the parsing but ended up throwing runtime error.
Apart from the problems we identified earlier, we made significant improvement in terms of functionality and reliability:
- Integration Test: We migrated all integrate tests to standard Elasticsearch IT framework which spins up in-memory cluster for testing. Now all test cases treat plugin code as blackbox and verify functionality from externally.
- New JDBC Driver: We developed our own JDBC driver without any dependency on Elasticsearch proprietary code. sql-jdbc
- Better Hash JOIN: OpenDistro SQL launched with Block Hash Join implementation with circuit break mechanism to protect your Elasticsearch memory. Performance testing showed our implementation is 1.5 ~ 2x better than old hash join in terms of throughput and latency and much lower error rate under heavy pressure.
- Query Planner: Logical and physical planner was added to support JOIN query in efficient and extendible way.
- PartiQL Compatibility: we are partially compatible with PartiQL specification which allows for query involved in nested JSON documents.
- New ANTLR Parser: A new ANTLR4 parser was generated from grammar based on what we support along with a new semantic analyzer to perform scope and type checking.