QPL Query Parser

From wiki.searchtechnologies.com
Revision as of 17:09, 3 March 2017 by Sdenny (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

This Page is Locked - Content no longer maintained - See QPL for the latest information.
Enterprise Add-On Feature

Introduction

The QPL Query Parser takes in a string, as entered by the user, and returns a fully parsed hierarchical query expression.

For example, the user enters:

QUERY:  (george washington) or "Thomas Jefferson"

And the query parser will return:

or(and(term("george"),term("washington")), phrase("Thomas","Jefferson"))

Once the query parser is built, you can further manipulate the query in QPL. For example:

  • Add a security filter to the user's query expression
    For example:
    and(securityFilterQuery, parse('<user query string here'>))
  • Transform the query using the transformer

Usage

Simple Usage

There are several ways to call the query parser.

 def myQuery = parseQuery("<query string>");

It returns an Operator object.

This parser uses all defaults for parsing:

  • Simple (not extended) queries
  • No custom operators
  • With wildcards
  • Simple white-space tokenizer
  • No virtual fields

The parseQuery method is part of the QPLUtilities, therefore it is available in any qpl script without any further imports.

Reusable Parser

You can create a re-usable parser as follows:

 def myParser = makeParser(extended:true, tokenizer:solr.makeTokenizer("content"), 
                           customOps:true);
 
 def myQuery = myParser.parse("<query string>");

Parser options include:

  • extended - 'true' to use standard boolean expressions (and, or, not, before, etc.). 'false' for simple query syntax (field:, "-" for not, "+" for enhance, and phrases)
    See below for more details
  • customOps - Allow custom operators.
    These are specified in the query like function calls, as in: |george or exact(washington)|
  • tokenizer - The tokenizer to use to tokenize content.
    Note that this is not the query tokenizer (see below) but a search-engine compatible content to parse the lists of terms which are sent to the search engine.
    In general, this is probably only needed for Solr/Lucene. Other search engines will provide a second layer of parsing of the query expression.
    If not specified, uses a simple white-space tokenizer.
  • tokenizerFlags - You can specify the flags for a new tokenizer (See Creating Tokenizers section). NOTE: If the tokenizer is provided, this parameter is ignored.
  • wildcard - Look for wildcard patterns ('*' and '?' only) and turn them unto wildcard() operators
  • virtualFields - Map of virtual fields which are automatically turned into custom operators.
    For example, the query |exact:"Washington DC"| would be turned into |exact(phrase(term("Washington","DC"))|

Parsing Token Lists

The query parser can parse a list of tokens. For now, the list of tokens need to be a QueryTokenList object.For example:

def solrTokenizer = solr.tokenizer("body")
def parser = makeParser(tokenizer:solrTokenizer,wildcard:true)
def tokens = parser.tokenize(queryString);
def booleanQ = parser.parse(tokens);	

In the example above we are using the Solr tokenizer and sending it to the makeParser method to instantiate a new parser.

Note that if there is a single String argument, then the parser will automatically tokenize it. If there are multiple arguments (or the argument is a list), the parser will assume that all tokens are fully pre-processed. For instance, in the following example a non-tokenized query string is passed to the parser. It will automatically analyze the string and create the proper tokens.

 def booleanQ = parser.parse("string query example");

Query Tokenizer

You can also use the query parser to tokenize text:

 def queryTokens = parserTokenizer("(My Query String)");
 
 queryTokens ==>  ["(", "My", "Query", "String", ")"];

Or:

 def myParser = makeParser(extended:true, tokenizer:solr.makeTokenizer("content"), 
                           customOps:true);
 
 def queryTokens = myParser.tokenize("(My Query String)");

The query tokenizer will return a list of tokens that can be pre-processed before being sent to the parser properly.

The returned object is an instance of QueryTokenList.

Thesaurus Expansion + Query Parsing

One use for pre-tokenized queries is to perform thesuarus expansion:

def thesaurus = Thesaurus.load("dict/synonyms_v3.xml");
def userTokens = parserTokenizer("(My Query String for San Francisco)");
def finalQuery = parse(thesaurus.expand(1.0f, solr.makeTokenizer("content"), userTokens));

Query Language Definition

This section defines the query language supported by the query parser.

Operands

The following types of operands are supported by the query parser.

token

Simple tokens are specified as simple words within the query expression.

Unary Operators

Fields

field:token
field:(sub expression)

Tags the token or sub-expression with the specified field. Queries must match within the specified field.

Note that fields can be "virtual fields", which get converted to custom operators instead of QPL field("") expressions.

Sub-Expressions

(sub expression)

Sub expressions are enclosed with parenthesis.

Phrases

 "this is a phrase"

Phrases are surrounded with double-quotes.

Not

-token
-(sub expression)
not token
not (sub expression)

Tokens can be prefixed with "-" or "not" to indicate not(token).

Boost

+token
+(sub expression)

Tokens prefixed with "+" indicate a boost on that token or sub expression. The default boost factor is "1.1" or a 10% boost. Therefore

+++token

will boost the token by 33.1% (1.1 * 1.1 * 1.1).

Custom Operators

example(sub expression1, sub expression2)

Custom operators are expressed like function calls within the query language. They are called "custom" because they have no pre-defined purpose, and the name of the function does not need to be pre-defined.

Typically, custom operators are modified / interpreted by search engine builder extensions and/or additional QPL processing.

Binary Operators

Note that operators are case insensitive.

Implied "and"

token token

Two or more tokens in sequence are automatically joined into an implied AND operator. Requires that both tokens be in matching documents.

and

token and token

AND operator requires that both tokens be in matching documents.

or

washington or jefferson

OR operator finds documents which contain either token.

adj

george adj washington

Adjacency token (ADJ) finds documents where the first token comes directly before the second token. Often used with wildcard or other expressions, such as:

(george or thomas) adj (washington or jefferson)

Note that this operator is the same as "before/0".

near

declaration near/10 washington

Finds documents where the first token occurs within 10 words of the second token, in either order. Note that the number "10" is typically taken to mean that ten non-matching words can occur between the two operands.

Therefore:

george near/0 washington

Represents two words with no non-matching words in-between, i.e. the two words are right next to each other. This will match either |george washington| or |washington george|.

Note: If the "/" is missing, the parser will assume the user is looking for the word "near".

before

president before/5 washington

Finds documents where the first token occurs before the second word and within 5 tokens. Compared to "near", the before operator requires that the words be in the same order as specified in the query. Note that the number "10" is typically taken to mean that ten non-matching words can occur between the two operands.

Therefore:

 president before/0 washington

Requires that the words be right next to each other, in the specified order. This is the same as the "adj" operator.

Ranges

You can perform a search using date ranges for numeric, string or date fields.

range("paul", "john")

That would create a range query based on the default field. You can specify a field for your query.

date:range(2013-01-12,2013-01-20)

Note that the field type must match the data type you are attempting to search. Otherwise you will get an error.

For date ranges, the expected format is YYYY-MM-DD. However you can omit the day or month and the parser will autocomplete.

Lower Limit

  • YYYY-MM: Assumes the first day of the specified month.
  • YYYY: Assumes January 1st of the specified year.

Upper Limit

  • YYYY-MM: Assumes the last day of the specified month.
  • YYYY: Assumes December 31st of the specified year.

Wildcards

Wilcard searches are allowed.

corp*

Operator Precedence

Operator precedence is as follows. Operators higher in the list will be coalesced into expressions before operators lower into the list.

: "-"          (unary not)
: "+"          (unary boost)
: field:       (field tagging)
: token token  (implied and)
: not
: adj
: before
: near
: and
: or