Difference between revisions of "QPL Thesaurus and Synonym Expansion"

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

m (Protected "QPL Thesaurus and Synonym Expansion" ([Edit=Allow only administrators] (indefinite) [Move=Allow only administrators] (indefinite)))
 
Line 1: Line 1:
 +
{{qpl-redirect}}
 
{{qpl}}
 
{{qpl}}
 
QPL uses a class called Thesaurus to perform synonym expansion for any given word.
 
QPL uses a class called Thesaurus to perform synonym expansion for any given word.

Latest revision as of 17:09, 3 March 2017

This Page is Locked - Content no longer maintained - See QPL for the latest information.
Enterprise Add-On Feature

QPL uses a class called Thesaurus to perform synonym expansion for any given word.

Loading data

The Thesaurus class uses an XML synonym definition file as its database to determine what is the synonym list for the given word. So the first step to use the Thesaurus class is to load this synonym XML file like:

Thesaurus tv2 = Thesaurus.load("src/test/resources/synonyms_v2.xml");

After the data is loaded, a Thesaurus instance will be returned to the caller. When Thesaurus object loads the XML file, it will only load the data if this XML file has never been loaded or the XML file to be loaded is newer than the version had been loaded before.

If you try to load the same data file twice. For example:

Thesaurus tv1 = Thesaurus.load("src/test/resources/synonyms_v2.xml");
Thesaurus tv2 = Thesaurus.load("src/test/resources/synonyms_v2.xml");

Thesaurus will only load synonyms_v2.xml file once. Both tv1 and tv2 will point to the same Thesaurus instance after the above code is executed.

A user can also load multiple XML files into the system by doing:

Thesaurus tv2 = Thesaurus.load("src/test/resources/synonyms_v2.xml");
Thesaurus tv3 = Thesaurus.load("src/test/resources/synonyms_v3.xml");

In this case, the two different Thesaurus objects, tv2 and tv3, can be used to retrieve different synonym list for the input word.

Thesaurus XML Format

The QPL thesaurus format is an XML file with multiple entries:

<thesaurus>
  <entry>
   <both>San Francisco</both>
   <both>San Fran</both>
   <both>SF</both>
  </entry>
  <entry>
    <from>IT</from>
    <to>Information Technology</to>
  </entry>
</thesaurus>

The tags of the XML are as follows:

  • <thesaurus> - The root tag for the XML file
  • <entry> - Holds a complete "synonym set" of synonyms which all represent the same item
  • <from> - Specifies a word or phrase which can be a source of an expansion
    • These are the words and phrases as they are specified in the user's query
  • <to> - Specifies a word or phrase which can be a destination of an expansion
    • This is the word or phrase which will be added to the query
  • <both> - Specifies a word which can be both the source and destination of an expansion

The format allows for any number of <from> <to> and <both> tags inside an entry. When multiple tags are included, any of the words and phrases in the user's query which match to any of the <from> or <both> tags will be expanded to all of the words or phrases which are found in all of the <to> or <both> tags (see example below).

Retrieving Synonym List

After the data file is loaded, one can use it to obtain the synonym list like this:

thesaurus.expand(1.0f, tokenizer, args)
  • The first argument is the boost value the expand method will assign to the generated synonyms when they are built into Operator objects.
  • The second argument is a Tokenizer used to tokenize the generated synonym list.
    • Matching is done on the original tokens.
    • The resulting tokens are tokenized using the supplied Tokenizer.
    • You can create tokenizers using QPLUtilities (see below) or via the Search Engine specific extensions.
  • The third argument is a string array of tokens which are the result of splitting the original text string by space or punctuation and the tokens are converted to lower case too. When the input token array are matched in the Thesaurus XML data file, the expand method will always match the largest phrase found.

Expansion Results

The results returned from the expand method is a list of Operator objects.

If you want to get the synonym list for input string "I like San Francisco City" and the Thesaurus XML file has data like this:

<thesaurus>
  <entry>
   <both>San Francisco</both>
   <both>San Fran</both>
   <both>SF</both>
  </entry>
  <entry>
   <both>San Francisco City</both>
   <both>SF City</both>
  </entry>
</thesaurus>

You are going to get a list of Operators like:

[term(i), term(like), or(phrase(term(SF), term(city)),phrase(term(san),term(francisco),term(city))]

Note that the results is a list of operators, where some operators are replaced with the expansions. Therefore, you will probably want to surround your expansion with an and() operator:

myQuery = and(thesaurus.expand(1.0f, tokenizer, ["i","like","san", "francisco"]));


Advanced Expansion Options

The full expand() method takes the following arguments:

 thesaurus.expand(float expansionFactor, Tokenizer tokenizer, OperatorType opType, Boolean requireFullMatch, Object... args)
  • expansionFactor - (float) Is the weighting of synonyms over the original word
  • tokenizer - Is the tokenizer to use to process the expansion text
  • opType - identifies if MAX or OR should be used to combine the synonyms together
    • Note: Use the ALL CAPS version of the operator (to indicate an operator type, instead of the operator itself)
  • matchAll - If true, only match if all of the tokens match the thesaurus entry
  • tokens - the list of query tokens to be processed

Example:

 def expansions = multiWordThesaurus.expand(0.8, tokenizer(TO_LOWER+PUNCT), MAX, /*RequireFullMatch?*/false, tokens);
 println "MULTI-WORD EXPAND: " + expansions

More Examples

In summary, the complete code snippet to get the synonym expansion is done like this:

def queryS = "I like San Francisco City";
def thesaurus = Thesaurus.load("dict/synonyms_v3.xml");
def userTokens = split(input.toLowerCase(), "[\\s]|[\\p{Punct}]")

def finalQuery = and(thesaurus.expand(1.0f, tokenizer(TO_LOWER), userTokens));

Note: See QPL Tokenizers for more information on creating and using tokenizers.