QPL Operator and Function Library

From wiki.searchtechnologies.com
Revision as of 17:08, 3 March 2017 by Sdenny (talk | contribs) (Protected "QPL Operator and Function Library" ([Edit=Allow only administrators] (indefinite) [Move=Allow only administrators] (indefinite)))
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Enterprise Add-On Feature

Introduction

Standard Method Arguments

The following specify the "Standard Method Arguments" which are supported by most of the functions contained in QPL. A wide variety of methods for specifying arguments are allowed so that QPL will be as flexible as possible.

Arguments which are Strings are Automatically Converted to Term()

For example:

 and("george", "washington")

The above is automatically converted to

 and(term("george"), term("washington"))

Any Number of Arguments are Allowed

For example:

 and("george", "washington")
 and("william", "jefferson", "clinton", "the", "third")

Lists are Allowed

This is especially useful when lists are returned by other methods, such as split() or synonym expansion.

 and(["george", "washington"])

The above is the same as:

 and("george", "washington")

Nested Lists are Flattened

 and([["george", "martha], "washington"])

The above is the same as:

 and("george", "martha, "washington")

opTrue and null

In QPL:

  • "opTrue" => true
    • opTrue is a special Operator object which represents "true" in QPL.
  • null => "false"
    • Any expression which returns "null" is automatically assumed to match zero documents.

Most operators have special handling for opTrue and null, as described below.

null Safe

In QPL you will rarely (if ever) need to check for 'null'.

All operators will check for null and do sensible things. Since a null operand is treated as "false", it is automatically removed from or() operators. Any and() operand which is null will cause the and() to return null as a whole.

In this way, nulls are propagated up the query expression, and removed when they become unnecessary.

The same is true for empty lists, which are also automatically removed and treated like null.

NOTE: Sometimes you will want for an empty list to return all documents instead of zero documents. If this is the case, you will need to check for this case specifically inside your QPL, and then return opTrue (or otherwise modify the expression).

Automatic Optimizations

... Or "Where Did My Operator Go?"

QPL has some built-in optimizations, described for each operator below under "Special Cases". In general:

  • Most operators return "null" (i.e. false) if they contain no operands.
  • Most operators are automatically removed if they contain only a single operand.
    For example, phrase(term("george")) becomes just term("george").
  • The and() operator returns null if any of its operands are null.
  • The or() operator returns opTrue if any of its operands are opTrue.

In this way, the structures built by QPL are automatically optimized as they are built. This eliminates the need for optimization after the query is built.

However, you will likely encounter situations where you ask "Hey! Where did my Operator go?" In most cases, it was automatically optimized away.

Meta-Operators

"Meta-Operators" are special methods built into QPL which produce query structures made up of simpler operators (typically and() and or()).

The best examples are compositeOr() and compositeMax(), which create complex or() or max() expressions based on the inputs.

Once QPL has finished creating the query expression, there will be no "Meta-Operators" in the resulting query tree. All Meta-Operators will be converted to simpler query expressions.

Basic Operators

AND

Prefix operator:

 and("george", "washington")

Infix operator:

 term("george") & term("washington")

Special Cases:

  • If there is only one operand, the one operand is returned (no and() operator is created)
  • If any of the operands are null, returns null.
  • Operands which are opTrue are removed
  • If all operands are opTrue (or there are no operands), returns opTrue

OR

Prefix operator:

 or("george", "washington")

Infix operator:

 term("george") | term("washington")

Special Cases:

  • If there is only one operand, the one operand is returned (no or() operator is created)
  • If any of the operands are opTrue, returns opTrue.
  • Operands which are null are removed
  • If all operands are null (or there are no operands), returns null

OR MIN

orMin() is just like or(), but it will require a minimum number of clauses to match before it returns "true".

 orMin(<minimum-clauses>, <operands>)

for example:

 orMin(3, "jefferson", "washington", "adams", "madison", "monroe")

Will match documents which contain any 3 out of the 5 presidents listed.

Special Cases:

  • If there is only one operand, the one operand is returned (no orMin() operator is created)
  • If any of the operands are opTrue, returns opTrue.
  • Operands which are null are removed
  • If all operands are null (or there are no operands), returns null

FIELD (Meta-Operator)

The field operator attaches a field name to other operators. Field names are automatically propagated down to all nested queries but only when the final query expression is built.

The following expression will attach the "title" field name to the or() operator:

field("title", or("george", "washington"))

Note that fields are only attached to the parent operator. Nested operators are unaffected. Therefore, the following expression will work as expected:

 myNestedQuery = or("george", "washington")
 
 query1 = field("title", and(myNestedQuery, "president"))
 query2 = field("text", and(myNestedQuery, "actor"))

In the above example, the field names are only attached to the and() operators. The nested or() operators (and nested terms) are unaffected. Only when the result is built (i.e. built for a specific search engine) will the field name be propagated down throughout the nested elements of the expression, if needed.

List Handling

When there are multiple operands or nested lists, field() will return a list of the modified operators. So, for example, the following expression:

 field("title", "george", "washington")

Is the same as:

 field("title", ["george", "washington"])

Which returns a list of modified items, as in:

 [ field("title", term("george")),  field("title", term("washington")) ]

Special Cases:

  • Nested lists are flattened (see Standard Arguments above)
  • opTrue is returned unmodified
  • Arguments which are null are included in the list
  • Empty strings are changed to null

PHRASE

Use phrase() when looking for a sequence of terms in a document which must occur in order.

 phrase("george", "washington")  -->  Finds documents where the word "george" is found right before the word "washington"

The contents of the phrase() operator can be a list (or embedded lists of lists, see Standard Arguments above), which amounts to the same thing:

 phrase(["george", "washington"])

In addition, phrases can contain other phrases. For example:

 phrase(phrase("george", "washington"), phrase("thomas", "jefferson"))

is perfectly okay. These nested phrases are preserved by QPL, but are usually converted into a single phrase when they are built for executing on search engines.

In some cases, a nested or() of terms (and only terms) is also allowed:

 phrase(or("thomas","tom","t"), or("jefferson","jeffers","jeffersonian"))

Whether this is supported or not will depend on the search engine that you are using. It is currently supported by the Lucene query builder.

Special Cases:

  • Items which are not TERM, PHRASE, or OR are simply ignored.
  • null and opTrue are simply ignored.
  • If the phrase has only a single operand, that one operand is returned (no phrase() operator is constructed).

NOT

Prefix version:

 not('anarchy')

Infix version:

 ~term('anarchy')

If the not contains multiple operands, it is assumed that an OR of the operands is performed. For example:

 not('anarchy', 'evil')

The above is the same as:

 not(or('anarchy', 'evil'))

Note that no restrictions are placed on the not() operator by QPL, it can occur pretty much anywhere, even as the root of a query expression. Whether or not a search engine can execute these queries (and execut them efficiently!) will depend on the associated builder and the engine in question.

So, basically, use not() expressions sparingly, and if you can always use them and()'d with something else.

Special Cases:

  • If all arguments are null, returns opTrue
  • If any argument is opTrue, returns null
  • If there are no arguments, returns opTrue

WILDCARD

(under construction)

Relevancy Ranking Operators

BOOST (Meta-Operator)

Use the boost operator to set the relevancy boost weight of a term or query expression.

Prefix operator:

 boost(1.5, and("thomas", "jefferson"))

Infix operator: (only if the left-hand operand is a QPL Operator object)

 and("thomas", "jefferson")^1.5

List Handling

If boost has multiple arguments or an embedded list, the result will be a list of the operands. For example:

 boost(1.5, ["thomas", "jefferson"])

The above is equivalent to:

 boost(1.5, "thomas", "jefferson")

Which returns a list of boosted items:

 [term("thomas")^1.5, term("jefferson")^1.5]

Special Cases:

  • Nested lists are flattened (see Standard Arguments above)
  • opTrue is returned unmodified
  • Arguments which are null are included in the list
  • Empty strings are changed to null

MAX

Max behaves like or(), in that if either sub-expression is found in the document, then the document will be returned from the search.

The difference is that the relevancy ranking of or() will be a combination of all of the scores of the sub-expressions, whereas the relevancy ranking of max() will take the score of the best matching sub-expression.

Many search engines do not have the concept of max(), however, and for those engines it will be interpreted as a simple or().

 max("george", "martha")

Special Cases:

  • If there is only one operand, the one operand is returned (no or() operator is created)
  • If any of the operands are opTrue, returns opTrue.
  • Operands which are null are removed
  • If all operands are null (or there are no operands), returns null

CONSTANT

Constant applies a constant relevancy score if any of the arguments are matched.

The following expression will return a constant relevancy score of 1.5 for any document that contains either "thomas" OR "jefferson" (or both):

 constant(1.5, "thomas", "jefferson")

This above expression is equivalent to:

 constant(1.5, or("thomas", "jefferson"))

Special Cases:

  • If there is only one operand (after the constant value), then the constant() operator is applied to that single operand.
  • If any of the operands are opTrue, returns opTrue.
  • Operands which are null are removed
  • If all operands are null (or there are no operands), returns null

BOOST PLUS

boostPlus() will return all documents which contain the first argument, and then will increase the rank of those documents if any of the other arguments (2-n) occur. Ideally, the relevancy score should be the relevancy of the first term added to the relevancy of all of the other matching operands.

 boostPlus(and("thomas", "jefferson"), constant(0.5, field("source", "presidentialPapers")))

In the above example, all documents which contain "thomas" and "jefferson" are returned. Documents where "presidentialPapers" is in the "source" field will be boosted by the constant value of 0.5.

Special Cases:

  • If any argument is opTrue, the entire expression is reduced to opTrue.
  • If there is only a single argument, then that single argument is returned.
  • Operands which are null are removed
  • If all operands are null (or there are no operands), returns null

BOOST MUL

boostMul() boosts (or reduces) the relevancy of a query by a multiplier, if and only if any of a set of additional expressions are matched.

The following expression finds all documents which contain "thomas" and "jefferson". If those documents contain either "movie" or "actor", then the relevancy is reduced by 50%.

boostMul(and("thomas", "jefferson"), 0.5, "movie", "actor")

This is equivalent to:

boostMul(and("thomas", "jefferson"), 0.5, or("movie", "actor"))

Special Cases:

  • If the first argument is null, then null is returned.
  • Operands from 3-n which are null are removed
  • If any of the arguments 3-n are opTrue, OR all of these arguments are null, OR there are no arguments 3-n:
    • Then just the first argument is returned.

COMPOSITE OR (Meta-Operator)

The compositeOr() method converts operands as if they had been applied to a "composite field" which is made up of multiple sub-fields. Essentially this means taking each operand and converting it to an or() expression where the operand is duplicated across multiple fields, each field with a different boost.

The arguments for composite() are:

compositeOr( <map of field:weight pairs> ,  <operands to expand> )

All standard argument styles (see above) are accepted for the <operands to expand> above.

For example:

compositeOr(["title":1.5, "body":0.8], "george", "washington")

Will produce:

[or(title:term("george")^1.5, body:term("george")^0.8), or(title:term("washington")^1.5, body:term("washington")^0.8)]

Note that the composite() operator simply converts all of the input arguments into a list of or() expressions on output. You will typically want to wrap the resulting expression with an AND operator, like this:

and(compositeOr(["title":1.5, "body":0.8], "george", "washington"))

Special Cases:

  • If the second argument is null, or there are no operands, then returns null.
  • opTrue values are returned unmodified

COMPOSITE MAX (Meta-Operator)

compositeMax() operates just like compositeOr(), but inserting max() operators instead of or() operators.

For example:

compositeMax(["title":1.5, "body":0.8], "george", "washington")

Will produce:

[max(title:term("george")^1.5, body:term("george")^0.8), max(title:term("george")^1.5, body:term("george")^0.8)]

Proximity Operators

NEAR

NEAR operator matches terms (or sub-expressions) which are near one another. First parameter of the NEAR operator refers to the maximum number of intervening unmatched positions.

For example:

 near(3, "george", "washington")

Notes and Special Cases:

  • If there is only one operand in addition to window size, the one operand is returned (no near() operator is created)
  • If any of the operands are null, returns null.
  • Operands which are opTrue are removed
  • If all operands are opTrue (or there are no operands), returns opTrue
  • Operands can be nested expressions with AND and OR

BEFORE

BEFORE operator matches terms (or sub expressions) that are within a certain distance of each other and are in-order. First parameter of the BEFORE operator refers to the maximum number of intervening unmatched positions.

For example:

 before(3, "george", "washington")

Notes and Special Cases:

  • If there is only one operand in addition to window size, the one operand is returned (no near() operator is created)
  • If any of the operands are null, returns null.
  • Operands which are opTrue are removed
  • If all operands are opTrue (or there are no operands), returns opTrue
  • Operands can be nested expressions with AND and OR

BETWEEN

BETWEEN operator matches a positive query and an optional negative query when they occur between a start tag (e.g. term) and end tag (or term).

Find documents where the term "world" is between the content tags:

 between("<content>", "</content>", "world")

Example where a complex expression must match between the content tags:

 between("<content>", "</content>", and(or("hello","hola"),or("mundo","world")))

Example which matches only when "world" is between the tags and "wide" is not.

 between("<content>", "</content>", "world", "wide")

Note that the query expression must match between the tags, and with no intervening tags. For example, the following query:

 between("<content>", "</content>", and("hello","world"))

Will match this:

 <content>
   The first thing a coder does is to create a "hello world" program. It is a rite of passage.
 </content>

But not this:

 <content>
   Let us say "hello" to all people.
 </content>
 <content>
   The world is a beautiful place.
 </content>

Notes:

  • Between only works for Solr and Elasticsearch.
    • Further, handling XML tags (such as "<content>" and "</content>") requires a custom analyzer to be installed
  • Requires special plug-in for some engines, and not available for others (see search engine details)
  • Previous versions of this function indicated a proximity window for the between() operator.
    • These versions are officially deprecated. Between takes and requires no proximity window.
    • If proximity is required, then use a near() or before() clause as the positive or negative query.

Range Operators

QPL also supports the following range operators:

 range(Float from, Float to)
 range(Integer from, Integer to)
 range(String from, String to)
 range(Date from, Date to)

Note that all ranges are inclusive, for both from & to range. Also, typically range operators must be specified with a field:

 import java.text.SimpleDateFormat;
 import java.util.Date;
 SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
 df.setTimeZone(TimeZone.getTimeZone("GMT"));
 Date fromDate = df.parse("2015-01-01T00:00:00Z");
 Date toDate = df.parse("2015-01-04T23:53:00Z");
 
 return field('modifiedDate',range(fromDate,toDate))

In the above example, the 'modifiedDate' is the field (specified with the standard QPL field() operator) over which the date range will be applied.

Range operators provide a Boolean response, and therefore can be included as operands of and(), or(), not(), and anything else that makes sense (which is basically everything except for phrase() and proximity operators).

String Utilities

split()

Splits input.

 String[] split(String str, String regex)
    -  returns null when input is null
 String[] split(String str, String regex, String defaultVal)
    -  returns default value when input is null

Both of these operate the same as the Java split method, except that they return null (or the default value) when the first argument (the string to split) is null. This is unlike Java which returns an exception when the input is null.

 split("hello world!", "[\\s\\p{Punct}]+")  =>  ["hello", "world"]

join()

Join lists into a single string.

 join(":", ["hello", "dog", "world"]) -->  "hello:dog:world"

Returns null if the second argument is null (does not throw exception).

trim()

Trims whitespace from string.

 trim("   hello  world!!    ")  -->  "hello  world!!"

Returns null if the argument is null (does not throw an exception)

isEmpty() / isNotEmpty()

Checks whether a string is empty or not.

 isEmpty(null)  -->  true
 isEmpty("")    -->  true
 isEmpty("  ")  -->  true
 isEmpty(" x ") -->  false
 isNotEmpty(null)  -->  false
 isNotEmpty("")    -->  false
 isNotEmpty("  ")  -->  false
 isNotEmpty(" x ") -->  true

depunctuate()

Remove all punctuation from the beginning and the end of the string.

 depunctuate("**hello-world!!")  -->  "hello-world"

containsWildcard()

Tests whether the string contains either a "*" or "?" wildcard.

 containsWildcard("world*")       -->  true
 containsWildcard("hello")        -->  false
 containsWildcard("initiali?e")   -->  true