Difference between revisions of "QPL Tokenizers"
For Information on Aspire 3.1 Click Here
m (Protected "QPL Tokenizers" ([Edit=Allow only administrators] (indefinite) [Move=Allow only administrators] (indefinite)))
Revision as of 17:07, 3 March 2017
QPL contains a number of methods for tokenizing query strings.
Search Engine Tokenizers
Note that each search engine implementation may have its own tokenizers. For example, the Solr plug-in has its own methods for creating Solr tokenizers which use the Solr Schema for tokenization. The elasticsearch plug-in can automatically handle standard (non-custom) analyzers.
The methods described in the remainder of this wiki page are for generic (non-search engine specific) tokenization. These methods may be needed when you have special tokenization requirements, or for the elasticsearch plug-in which (at the time of this writing) can not automatically create tokenizers for custom analyzers.
You can customize the behavior of your tokenizer using the following flags:
- Convert all the tokens into lower case (TO_LOWER)
- Convert all the tokens into upper case (TO_UPPER)
- Split token on punctuation (PUNCT)
- Split token on change from lower case to upper case (CASE_CHANGE)
- Split token on change from letter to digit (ALNUM_CHANGE)
- Combine double-quoted sequences into nested phrase (DQ_PHRASES)
Note that QPL tokenizers will automatically split on whitespace.
Returns a list of QPL operators
List<Operator> tokenize(String str)
- Simple whitespace tokenizer which keeps double-quoted strings together.
List<Operator> tokenize(int flags, String str)
- Tokenizer customized with flags (see above).
Returns a String array
String tokenizeToArray(String str)
- Simple whitespace tokenizer which keeps double-quoted strings together and returns a String array.
String tokenizeToArray(int flags, String str)
- Tokenizer which returns a string array, customizied with the specified flags (see above)
String getTokensArray(List<Operator> tokens)
- Converts a list of QPL Operator(s) into a String array.
Returns a re-usable Tokenizer class
Tokenizer tokenizer(int flags)
- Returns a re-usable Tokenizer class, customized with the specified flags, which can be passed to other methods to do tokenization.
FieldTermTokenizer genericFieldTokenizer(int flags)
- Returns a generic field term tokenizer, which accepts all fields and assumes that all fields are STRING type fields which are all processed as specified by the flags
def tokens = tokenize('president "george washington"')
Tokenization to array:
def tokens = tokenizeToArray('president "george washington"') def tokens = tokenizeToArray(DQ_PHRASES + TO_LOWER, 'president "george washington"')
Both examples return:
[president, george washington]
New re-usable Tokenizers can be created and used as follows: (note that flags are always required)
def t = tokenizer(TO_UPPER+PUNCT+CASE_CHANGE+ALNUM_CHANGE+DQ_PHRASES); def tokens = t.tokenize('president "george washington"');
The tokens returned:
Using the generic field tokenizer: (normally this would be passed to another method, such as a query parser)
def tokenizer = genericFieldTokenizer(DQ_PHRASES+TO_LOWER+PUNCT); def tokens = tokenizer.tokenize("any-field", 'Ambassador "Thomas Jefferson"');
Available Java Interfaces
If you have complex tokenization needs, you may need to write your tokenizer in Groovy or Java. If this is the case, create a new class and implement one of the following Java interfaces:
Defines an object that tokenizes text. This interface is engine independent.
An instance that implements this interface is available in QPLUtilities and therefore available for QPL scripts. It accepts several options to configure the tokenizer.
def tokenizer = tokenizer(DQ_PHRASES) tokens = tokenizer.tokenize(query)
The example above is the equivalent of doing this:
Having an instance of the tokenizer is very useful if you need to reuse a tokenizer several times. If that is not the case, you can call the tokenize method directly and pass the input parameter each time.
Defines a tokenizer that is field sensitive. You will need to implement the following methods:
public String tokenize(String field, String str) throws QPLException;
- Tokenize the string as appropriate for the specified field.
public FieldType getFieldType(String field);
- Get the type of the field. Types are defined in the FieldTermTokenizer.FieldType enum, and include: DATE, STRING, INTEGER, FLOAT, XML, UNKNOWN, BOOLEAN.
public boolean validField(String field);
- Tests whether the field is a valid field. The query parser will return an error if this returns 'false' for a field which the user is attempting to query with.