Elasticsearch extensions for QPL - Query Time Analysis

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

This Page is Locked - Content no longer maintained - See QPL for the latest information.
Enterprise Add-On Feature

This section covers analysis of query strings, to get the tokens which are used for search.

Note that this is primarily for use by the QPL query parser, which must analyze query strings to produce correct query expressions.

Additional Methods for Tokenizing Queries

Solution #1: Use match() and matchPhrase()

These special operators are available in elasticsearch to perform server-side analysis of query strings.

match() will analyze the query content and create an AND of all tokens after tokenizing the content:

 match("hello-world") --> converted_to --> and(term("hello"),term("world"))

Similarly, match phrase:

 matchPhrase("hello-world") --> converted_to --> phrase(term("hello"),term("world"))

match() and matchPhrase() are the preferred methods where possible.

Limitations

The limitations of this approach are that match() and matchPhrase() can not be used inside of proximity expressions such as near() or before(). Unfortunately this also includes XML searching expressions which require between().

Solution #2: Specifying Tokenizers for Fields

QPL provides a second solution, which is to specify either well-known elasticsearch analyzers (in place of custom analyzers) for fields, or to use QPL Tokenizer to analyze fields. The ESQPLUtilities class also provides the ability to apply string tokenization based on the default rules provided for query time analysis by ElasticSearch itself. This means that either a default or a user set analyzer set at one of these following levels: index, document type or field, can be reused for tokenization when the es.tokenize method is called. This means that the user does not have to configure/supply a tokenizer, unless he or she chooses to do so.

In both of these cases, the specifications are done for query parsing only (not indexing).

Note that all of the methods for this solution are applied to the current session.

The following methods are available on the "es." variable. Note that the methods are shown in precedence order. If multiple methods apply to a particular situation, the method which is shown first below will take precedence.

es.setTokenizerForField(String type, String field, Tokenizer tokenizer)
Specify a QPL tokenizer to use for a field within a specified elasticsearch document type. If the user searches over such a field (for example, searching over the "play.text_entry") then this tokenizer will be used to parse that query string.


es.clearTokenizerForField(String type, String field)
Clear a tokenizer previously specified for a field within an elasticsearch document type.


es.setTokenizerForField(String field, Tokenizer tokenizer)
Specify the tokenizer for a particular field, independent of document type. Whenever the user searches over the field (for example, text_entry:"hello world") then the specified tokenizer will be used to tokenize the text in the query.
Note that this applies across all types, unless a tokenizer is specified for a particular type.field combination with the method above.


es.clearTokenizerForField(String field)
Clear a tokenizer previously set for the specified field.


es.setAnalyzerForField(String field, String analyzer)
Specify a named analyzer to use to tokenize text in the specified field across all types.
The analyzer name should be either a standard analyzer name (such as "standard", "simple", "english", "french", etc.) or one of the registered analyzers (see below under solution #3).


es.setAnalyzerForField(String type, String field, String analyzer)
Set the analyzer to be used for a field in a specific elasticsearch document type. For example, if the user searches on field "play.text_entry", then a specific analyzer can be specified for this case.
The analyzer name should be either a standard analyzer name (such as "standard", "simple", "english", "french", etc.) or one of the registered analyzers (see below under solution #3).


es.clearFieldAnalyzersToDefault()
Clear all field analyzers set with the above methods.


es.setDefaultTokenizer(String type, Tokenizer tokenizer)
Set the default QPL tokenizer to use for all fields within a specified document type. This will only be used if the field itself does not have an analyzer or tokenizer specified with the above methods.


es.setDefaultTokenizer(Tokenizer tokenizer)
Specify a QPL tokenizer to use to tokenize anything for which an analyzer is not otherwise specified. This overrides use of the "standard" analyzer which elasticsearch uses as its final default.
Use setDefaultTokenizer(null) to clear the default for this index.


NOT_ANALYZED_TOKENIZER
A special QPL tokenizer which does no analysis on its input.

Examples

Note that all examples only apply when parsing query expressions.

Specify the default tokenizer:

es.setDefaultTokenizer(tokenizer(DQ_PHRASES+TO_LOWER+PUNCT));

Set the default tokenizer (to no analysis) for all fields within the "play" type, if specifically called out by the user:

es.setDefaultTokenizer("play", NOT_ANALYZED_TOKENIZER);

Set the tokenizer for whenever the "text_entry" field is used.

es.setTokenizerForField("text_entry", tokenizer(DQ_PHRASES+TO_LOWER+PUNCT+CASE_CHANGE+ALNUM_CHANGE));

Set the tokenizer for whenever "text_entry" field is used within the "play" type:

es.setTokenizerForField("play", "text_entry", tokenizer(DQ_PHRASES+TO_LOWER+PUNCT));

Set the analyzer to use for the "speaker" field: (must be one of the well-known or registered analyzers, see below)

es.setAnalyzerForField("speaker", "english");

Set the analyzer to used for whenever the user searches over "play_name" within the "play" type:

es.setAnalyzerForField("play", "play_name", "simple");

Solution #3: Analyzer Overrides

The third solution is to use one of the QPL analysis overrides. These overrides specify the analyzer which will be used at query time by the QPL query parser.

The following methods can be used in conjunction with the field-based tokenization methods specified above. Where there is a conflict, the field-based methods above will take priority over the analyzer methods below.

es.registerTokenizerForAnalyzer(String analyzer, Tokenizer tokenizer)
Specify a QPL Tokenizer to use to tokenize the text for the specified analyzer name.


es.removeTokenizerForAnalyzer(String analyzerName)
Remove a tokenizer for the previously specified analyzer.


es.registerAnalyzerClass(String analyzerName, Class<? extends org.apache.lucene.analysis.Analyzer> analyzer)
Register a Lucene analyzer class to be used to analyzer text for the specified analyzer name. Note that the analyzer class specified must have either a no-argument constructor, or a constructor with a single argument of the type "org.apache.lucene.util.Version".
If a tokenizer and an analyzer class are specified for the same analyzer name, then the QPL Tokenizer will take preference.


es.clearAnalyzerClass(String analyzerName)
Clear the analyzer class for the specified analyzer to the default class provided by elasticsearch (if the default is a standard analyzer and not a custom analyzer).


Class<? extends org.apache.lucene.analysis.Analyzer> getAnalyzerClass(String analyzerName)
Get an analyzer class for the specified analyzer name. This can be used to copy analyzer classes from one analyzer to another.


Examples

Specify a simple tokenizer for a custom XML analyzer: (to be used when parsing queries)

es.registerTokenizerForAnalyzer("custom_xml_analyzer", tokenizer(TO_LOWER+PUNCT+DQ_PHRASES));

Fetch the "simple" analyzer and use it for the custom analyzer: (to be used when parsing queries)

es.registerAnalyzerClass("myCustomAnalyzer", es.getAnalyzerClass("simple"))