QPL Query Parser
For Information on Aspire 3.1 Click Here
The QPL Query Parser takes in a string, as entered by the user, and returns a fully parsed hierarchical query expression.
For example, the user enters:
QUERY: (george washington) or "Thomas Jefferson"
And the query parser will return:
Once the query parser is built, you can further manipulate the query in QPL. For example:
- Add a security filter to the user's query expression
- For example:
- and(securityFilterQuery, parse('<user query string here'>))
- For example:
- Transform the query using the transformer
There are several ways to call the query parser.
def myQuery = parseQuery("<query string>");
It returns an Operator object.
This parser uses all defaults for parsing:
- Simple (not extended) queries
- No custom operators
- With wildcards
- Simple white-space tokenizer
- No virtual fields
The parseQuery method is part of the QPLUtilities, therefore it is available in any qpl script without any further imports.
You can create a re-usable parser as follows:
def myParser = makeParser(extended:true, tokenizer:solr.makeTokenizer("content"), customOps:true); def myQuery = myParser.parse("<query string>");
Parser options include:
- extended - 'true' to use standard boolean expressions (and, or, not, before, etc.). 'false' for simple query syntax (field:, "-" for not, "+" for enhance, and phrases)
- See below for more details
- customOps - Allow custom operators.
- These are specified in the query like function calls, as in: |george or exact(washington)|
- tokenizer - The tokenizer to use to tokenize content.
- Note that this is not the query tokenizer (see below) but a search-engine compatible content to parse the lists of terms which are sent to the search engine.
- In general, this is probably only needed for Solr/Lucene. Other search engines will provide a second layer of parsing of the query expression.
- If not specified, uses a simple white-space tokenizer.
- tokenizerFlags - You can specify the flags for a new tokenizer (See Creating Tokenizers section). NOTE: If the tokenizer is provided, this parameter is ignored.
- wildcard - Look for wildcard patterns ('*' and '?' only) and turn them unto wildcard() operators
- virtualFields - Map of virtual fields which are automatically turned into custom operators.
- For example, the query |exact:"Washington DC"| would be turned into |exact(phrase(term("Washington","DC"))|
Parsing Token Lists
The query parser can parse a list of tokens. For now, the list of tokens need to be a QueryTokenList object.For example:
def solrTokenizer = solr.tokenizer("body") def parser = makeParser(tokenizer:solrTokenizer,wildcard:true) def tokens = parser.tokenize(queryString); def booleanQ = parser.parse(tokens);
In the example above we are using the Solr tokenizer and sending it to the makeParser method to instantiate a new parser.
Note that if there is a single String argument, then the parser will automatically tokenize it. If there are multiple arguments (or the argument is a list), the parser will assume that all tokens are fully pre-processed. For instance, in the following example a non-tokenized query string is passed to the parser. It will automatically analyze the string and create the proper tokens.
def booleanQ = parser.parse("string query example");
You can also use the query parser to tokenize text:
def queryTokens = parserTokenizer("(My Query String)"); queryTokens ==> ["(", "My", "Query", "String", ")"];
def myParser = makeParser(extended:true, tokenizer:solr.makeTokenizer("content"), customOps:true); def queryTokens = myParser.tokenize("(My Query String)");
The query tokenizer will return a list of tokens that can be pre-processed before being sent to the parser properly.
The returned object is an instance of QueryTokenList.
Thesaurus Expansion + Query Parsing
One use for pre-tokenized queries is to perform thesuarus expansion:
def thesaurus = Thesaurus.load("dict/synonyms_v3.xml"); def userTokens = parserTokenizer("(My Query String for San Francisco)"); def finalQuery = parse(thesaurus.expand(1.0f, solr.makeTokenizer("content"), userTokens));
Query Language Definition
This section defines the query language supported by the query parser.
The following types of operands are supported by the query parser.
Simple tokens are specified as simple words within the query expression.
field:token field:(sub expression)
Tags the token or sub-expression with the specified field. Queries must match within the specified field.
Note that fields can be "virtual fields", which get converted to custom operators instead of QPL field("") expressions.
Sub expressions are enclosed with parenthesis.
"this is a phrase"
Phrases are surrounded with double-quotes.
-token -(sub expression) not token not (sub expression)
Tokens can be prefixed with "-" or "not" to indicate not(token).
+token +(sub expression)
Tokens prefixed with "+" indicate a boost on that token or sub expression. The default boost factor is "1.1" or a 10% boost. Therefore
will boost the token by 33.1% (1.1 * 1.1 * 1.1).
example(sub expression1, sub expression2)
Custom operators are expressed like function calls within the query language. They are called "custom" because they have no pre-defined purpose, and the name of the function does not need to be pre-defined.
Typically, custom operators are modified / interpreted by search engine builder extensions and/or additional QPL processing.
Note that operators are case insensitive.
Two or more tokens in sequence are automatically joined into an implied AND operator. Requires that both tokens be in matching documents.
token and token
AND operator requires that both tokens be in matching documents.
washington or jefferson
OR operator finds documents which contain either token.
george adj washington
Adjacency token (ADJ) finds documents where the first token comes directly before the second token. Often used with wildcard or other expressions, such as:
(george or thomas) adj (washington or jefferson)
Note that this operator is the same as "before/0".
declaration near/10 washington
Finds documents where the first token occurs within 10 words of the second token, in either order. Note that the number "10" is typically taken to mean that ten non-matching words can occur between the two operands.
george near/0 washington
Represents two words with no non-matching words in-between, i.e. the two words are right next to each other. This will match either |george washington| or |washington george|.
Note: If the "/" is missing, the parser will assume the user is looking for the word "near".
president before/5 washington
Finds documents where the first token occurs before the second word and within 5 tokens. Compared to "near", the before operator requires that the words be in the same order as specified in the query. Note that the number "10" is typically taken to mean that ten non-matching words can occur between the two operands.
president before/0 washington
Requires that the words be right next to each other, in the specified order. This is the same as the "adj" operator.
You can perform a search using date ranges for numeric, string or date fields.
That would create a range query based on the default field. You can specify a field for your query.
Note that the field type must match the data type you are attempting to search. Otherwise you will get an error.
For date ranges, the expected format is YYYY-MM-DD. However you can omit the day or month and the parser will autocomplete.
- YYYY-MM: Assumes the first day of the specified month.
- YYYY: Assumes January 1st of the specified year.
- YYYY-MM: Assumes the last day of the specified month.
- YYYY: Assumes December 31st of the specified year.
Wilcard searches are allowed.
Operator precedence is as follows. Operators higher in the list will be coalesced into expressions before operators lower into the list.
: "-" (unary not) : "+" (unary boost) : field: (field tagging) : token token (implied and) : not : adj : before : near : and : or