Originally a side project, I've decided to put this out there in case someone else may find this useful.
The LuceneSqlParser
supports a subset of standard sql. Here are some examples:
Select last_name from '/path/to/index/' where first_name='Foo' and age <=30 and city='Boston' limit 25
Select * from 'path/index/' where age in (31, 30, 50)
Select first_name, last_name from '/path/index/' where city in ('Cincinatti', 'New York', 'Boyds')
Select first_name from '/path/index/' where age between 35 and 50 and first_name like 'Br*'
-- Also takes paths from Windows OS
Select first_name from 'C:/path/index/' where first_name='John' and (age<=45 and city not in ('New York', 'Boston', 'Atlanta'))
The LuceneSqlParser
returns a BooleanQuery. The BooleanQuery
will contain different types of lucene query objects depending on the predicates used. There is a class Searcher
avaiable for use with the LuceneSqlParser
. The Searcher
abstracts away the opening of a lucene IndexSearcher, iterating over the ScoreDoc array and extracting results. Next, we'll take a look at the rules used to parse the sql.
At high level a sql statement is broken down and parsed in the following manner:
- The 'Select' statement contains a comma sparated list of fields stored in a Lucene index. The parser stores fields in a
Set<String>
for use by theSearcher
. To retreive all fields we can specify a '*' operator, or omit the 'select' clause altogether. - The 'From' clause takes a path in single quotes representing the location of a lucene index.
- The 'Where' clause contains the predicates for searching the data.
- The parser analyzes values used for searching in a similar fashion as the StandardAnalyzer (lower cased, whitespace and special characters removed). There are exceptions to this rule. The PrefixQuery, RegexQuery and the WildcardQuery are special cases and only removes characters not defined as special characters used by Lucene.
- Predicates can be nested to an arbitrary depth. For example:
where field='1' and (field2='2' and field3='3' and (field4='4' and (field5='5' and field='6')))
- The 'Select' and 'From' clauses are optional.
The LuceneQueryParser
defines two static methods parseQuery
and parseToBooleanQuery
. The parseQuery
method is intended to be used in conjuction with the Searcher
. The parseQuery
returns a QueryParseResults
object that contains the path for the index, the set of fields to retrieve and the BooleanQuery
to execute. The parseToBooleanQuery
is intended to be used for parsing only and returns a BooleanQuery
.
We now will list the supported Lucene query objects and how they are mapped from the input sql.
<field name> ='Foo'
converts to a TermQuery.<field name> ='Bar Baz'
converts to a PhraseQuery<field name> like 'Fo*'
converts to aPrefixQuery
<field name> like 'B?l?'
converts to aWildcardQuery
<field name> matches('[Bb].*[hH]?')
converts to aRegexpQuery
<field name> in ('foo', 'bar', 'baz')
converts to aBooleanQuery
consisting of 3 BooleanClause objects. Each clause is aTermQuery
with a BooleanClause.Occur ofSHOULD
.PhraseQuery
andTermQuery
objects can be combined in the 'in' operator. For examplecity in ('New York', 'Boston', 'Los Angeles')
between 'Foo' AND 'Bar'
converts to a TermRangeQuery with both items being inclusive.between 25 and 40
converts to a NumericRangeQuery again inclusive.- The >, <, >= <= operators are converted to either
TermRangeQuery
orNumericRangeQuery
objects with one side of the range being unbounded. The < and > operators are exclusive. The <= and >= operators are inclusive.
The AND,OR & NOT operators are mappped in the following manner.
- AND converts to
BooleanClause.Occur.MUST
- OR converts to
BooleanClause.Occur.SHOULD
- NOT converts to
BooleanClause.Occur.MUST_NOT
In two cases the query is converted to something different from the mappings shown previously. The first case is a query that contains a single predicate that must not match. For example:
Select * from /index/path/ where last_name != 'Smith'
Select first_name, last_name where city not in ('Foo', 'Bar')
A query submitted in this format will not work in lucene. This fix for this query is simple. The parser takes the original BooleanQuery
and adds an addtional clause. The underlying query object in the new clause is a MatchAllDocsQuery. The MatchAllDocsQuery
returns all documents in the index and the orginal predicate will fiter out the unwanted results. The second case is when searching for a numeric field in the TermQuery
format. For example:
Select * from '/index/path' where age = 49
Normally a 'field=value' or 'field != value' predicate is converted to a TermQuery
. But the way lucene searches for values it will not find a field if it is searching for a number versus a string. In this case the parser constructs a NumericRangeQuery
where the low and high value are equal and inclusive.
There are a several limitations at this point.
- Converting to a
PhraseQuery
does not allow for specifying any slop. For a match with aPhraseQuery
all terms must be located adjacent to each other. - Numeric type queries only support the Int type for now.
- Range querries are inclusive when both a high and low value are specified.
- If no limit clause is specified in the query, a default limit of 10,000 records is used.
The second component of sql for lucene is the Searcher
class. The Searcher
could be thought of as a convenience method for performing a lucene search and extracting the results. The Searcher
has one method search
that takes a sql query and returns a List<Map<String,Object>>
containing the search results. Each map in the list represents a document with the keys being the field names and the values are the values stored in the retrieved fields.
String query = "Select age,city from '/path/to/index/' where first_name='john' and age='50'";
List<Map<String,Object>> results = searcher.search(query);
It's worth noting the list and map returned from the searcher are of type ImmutableList and ImmutableMap respectively. The Searcher
has 4 constructors:
public Searcher() {}
public Searcher(String indexPath){
Preconditions.checkArgument(indexPath != null && !indexPath.trim().isEmpty(), "Index Path is can't be null or empty");
this.indexSearcher = createIndexSearcherFromStringPath.apply(indexPath);
}
public Searcher(Path indexPath){
Preconditions.checkNotNull(indexPath,"Index Path can't be null");
this.indexSearcher = createIndexSearcherFromPath.apply(indexPath);
}
public Searcher(Directory directory) {
Preconditions.checkNotNull(directory,"Directory can't be null");
this.indexSearcher = fromDirectoryToIndexSearcher.apply(directory);
}
If the searcher is instantiated with the no-arg constructor, then the path for the index will be extracted from the query and used to open the IndexSearcher
. All subsequent queries can safely omit the from clause. If the searcher is instantiated with any of the other 3 constructors, the 'from' clause will be ignored and can be omitted from the query.
- A JDBC Driver.
- Insert, Update and Delete support.
- Support all the numeric types supported by Lucene.
- Syntax to support
FuzzyQuery
searches. - Support for filters.
- Ability to specify slop for
PhraseQuery
searches.