QPL Tokenizers

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

This Page is Locked - Content no longer maintained - See QPL for the latest information.
Enterprise Add-On Feature

QPL contains a number of methods for tokenizing query strings.

Search Engine Tokenizers

Note that each search engine implementation may have its own tokenizers. For example, the Solr plug-in has its own methods for creating Solr tokenizers which use the Solr Schema for tokenization. The elasticsearch plug-in can automatically handle standard (non-custom) analyzers.

The methods described in the remainder of this wiki page are for generic (non-search engine specific) tokenization. These methods may be needed when you have special tokenization requirements, or for the elasticsearch plug-in which (at the time of this writing) can not automatically create tokenizers for custom analyzers.

Tokenization Flags

You can customize the behavior of your tokenizer using the following flags:

  • Convert all the tokens into lower case (TO_LOWER)
  • Convert all the tokens into upper case (TO_UPPER)
  • Split token on punctuation (PUNCT)
  • Split token on change from lower case to upper case (CASE_CHANGE)
  • Split token on change from letter to digit (ALNUM_CHANGE)
  • Combine double-quoted sequences into nested phrase (DQ_PHRASES)

Note that QPL tokenizers will automatically split on whitespace.

Tokenization Methods

Returns a list of QPL operators

List<Operator> tokenize(String str)
Simple whitespace tokenizer which keeps double-quoted strings together.


List<Operator> tokenize(int flags, String str)
Tokenizer customized with flags (see above).


Returns a String array

String[] tokenizeToArray(String str)
Simple whitespace tokenizer which keeps double-quoted strings together and returns a String array.


String[] tokenizeToArray(int flags, String str)
Tokenizer which returns a string array, customizied with the specified flags (see above)


String[] getTokensArray(List<Operator> tokens)
Converts a list of QPL Operator(s) into a String array.


Returns a re-usable Tokenizer class

Tokenizer tokenizer(int flags)
Returns a re-usable Tokenizer class, customized with the specified flags, which can be passed to other methods to do tokenization.


FieldTermTokenizer genericFieldTokenizer(int flags)
Returns a generic field term tokenizer, which accepts all fields and assumes that all fields are STRING type fields which are all processed as specified by the flags


Tokenization Examples

Simple tokenization:

 def tokens = tokenize('president "george washington"')

Returns:

 [term(president), phrase(term(george),term(washington))]


Tokenization to array:

def tokens = tokenizeToArray('president "george washington"')
def tokens = tokenizeToArray(DQ_PHRASES + TO_LOWER, 'president "george washington"')

Both examples return:

[president, george washington]


New re-usable Tokenizers can be created and used as follows: (note that flags are always required)

 def t = tokenizer(TO_UPPER+PUNCT+CASE_CHANGE+ALNUM_CHANGE+DQ_PHRASES);
 def tokens = t.tokenize('president "george washington"');

The tokens returned:

 [term(president), phrase(term(george),term(washington))]


Using the generic field tokenizer: (normally this would be passed to another method, such as a query parser)

def tokenizer = genericFieldTokenizer(DQ_PHRASES+TO_LOWER+PUNCT);
def tokens = tokenizer.tokenize("any-field", 'Ambassador "Thomas Jefferson"');

Returns:

 [term(ambassador), phrase(term(thomas),term(jefferson))]


Available Java Interfaces

If you have complex tokenization needs, you may need to write your tokenizer in Groovy or Java. If this is the case, create a new class and implement one of the following Java interfaces:

Tokenizer

Defines an object that tokenizes text. This interface is engine independent.

An instance that implements this interface is available in QPLUtilities and therefore available for QPL scripts. It accepts several options to configure the tokenizer.

def tokenizer = tokenizer(DQ_PHRASES)
tokens = tokenizer.tokenize(query)

The example above is the equivalent of doing this:

tokenize(DQ_PHRASES, query)

Having an instance of the tokenizer is very useful if you need to reuse a tokenizer several times. If that is not the case, you can call the tokenize method directly and pass the input parameter each time.

FieldTermTokenizer

Defines a tokenizer that is field sensitive. You will need to implement the following methods:

 public String[] tokenize(String field, String str) throws QPLException;
Tokenize the string as appropriate for the specified field.


 public FieldType getFieldType(String field);
Get the type of the field. Types are defined in the FieldTermTokenizer.FieldType enum, and include: DATE, STRING, INTEGER, FLOAT, XML, UNKNOWN, BOOLEAN.


 public boolean validField(String field);
Tests whether the field is a valid field. The query parser will return an error if this returns 'false' for a field which the user is attempting to query with.