Implement a custom search query language using a subclass of QueryParser.
At first, our query language will support only simple term queries and phrases delimited by double quotes. For simplicity’s sake, it will not support parenthetical groupings, boolean operators, or prepended plus/minus. The results for all subqueries will be unioned together – i.e. joined using an OR – which is usually the best approach for small-to-medium-sized document collections.
Later, we’ll add support for trailing wildcards.
Our initial parser implentation will generate queries against a single fixed field, “content”, and it will analyze text using a fixed choice of English EasyAnalyzer. We won’t subclass Lucy::Search::QueryParser just yet.
Code example for C is missing
Some private helper subs for creating TermQuery and PhraseQuery objects will help keep the size of our main parse() subroutine down:
Code example for C is missing
Our private _tokenize() method treats double-quote delimited material as a single token and splits on whitespace everywhere else.
Code example for C is missing
The main parsing routine creates an array of tokens by calling _tokenize(), runs the tokens through through the EasyAnalyzer, creates TermQuery or PhraseQuery objects according to how many tokens emerge from the EasyAnalyzer’s split() method, and adds each of the sub-queries to the primary ORQuery.
Code example for C is missing
Most often, the end user will want their search query to match not only a single ‘content’ field, but also ‘title’ and so on. To make that happen, we have to turn queries such as this…
foo AND NOT bar
… into the logical equivalent of this:
(title:foo OR content:foo) AND NOT (title:bar OR content:bar)
Rather than continue with our own from-scratch parser class and write the routines to accomplish that expansion, we’re now going to subclass Lucy::Search::QueryParser and take advantage of some of its existing methods.
Our first parser implementation had the “content” field name and the choice of English EasyAnalyzer hard-coded for simplicity, but we don’t need to do that once we subclass Lucy::Search::QueryParser. QueryParser’s constructor – which we will inherit, allowing us to eliminate our own constructor – requires a Schema which conveys field and Analyzer information, so we can just defer to that.
Code example for C is missing
We’re also going to jettison our _make_term_query() and _make_phrase_query() helper subs and chop our parse() subroutine way down. Our revised parse() routine will generate Lucy::Search::LeafQuery objects instead of TermQueries and PhraseQueries:
Code example for C is missing
The magic happens in QueryParser’s expand() method, which walks the ORQuery
object we supply to it looking for LeafQuery objects, and calls expand_leaf()
for each one it finds. expand_leaf() performs field-specific analysis,
decides whether each query should be a TermQuery or a PhraseQuery, and if
multiple fields are required, creates an ORQuery which mults out e.g. foo
into (title:foo OR content:foo)
.
To add support for trailing wildcards to our query language, we need to override expand_leaf() to accommodate PrefixQuery, while deferring to the parent class implementation on TermQuery and PhraseQuery.
Code example for C is missing
Ordinarily, those asterisks would have been stripped when running tokens
through the EasyAnalyzer – query strings containing “foo*” would produce
TermQueries for the term “foo”. Our override intercepts tokens with trailing
asterisks and processes them as PrefixQueries before SUPER::expand_leaf
can
discard them, so that a search for “foo*” can match “food”, “foosball”, and so
on.
Insert our custom parser into the search.cgi sample app to get a feel for how it behaves:
Code example for C is missing
Copyright © 2010-2015 The Apache Software Foundation, Licensed under the
Apache License, Version 2.0.
Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The
Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their
respective owners.