Difference between revisions of "Lemmatizer"

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

(Creating Stem Words)
m (Protected "Lemmatizer" ([Edit=Allow only administrators] (indefinite) [Move=Allow only administrators] (indefinite)))
(No difference)

Revision as of 17:11, 3 March 2017

Template:Lemmatizer

Installation

Lemmatizer is not automatically bundled with QPL, you will need to build and copy the jar file to Solr lib. Get the latest from https://subversion.searchtechnologies.com/svn/linguistics/trunk/lemmatizer/ . Do mvn package , this should produce lemmatizer-x.x-SNAPSHOT.jar file.


To install the Lemmatizer in a single core Solr (see note about multi-core), do the following:

  • Create a "lib" directory (if it doesn't already exist) inside your "solr" directory.
    This should be at the same level as the Solr "conf" directory
  • Copy the following lemmatizer-x.x-SNAPSHOT.jar file into your new lib directory:

<SolrDeployPath>/example/solr-webapp/webapp/WEB-INF/lib

Notes

  • For a multi-core implementation you need to deploy the files to the application server instead.
  • Your jar file may have different version x.x number based on the pom.xml of lemmatizer project.

Lemmatizer

Lemmatizes a term to reduce it to it's root word or expands it to create new relevant variants. The following code creates Lemmatizer :

 Lemmatizer lemmatizer = LemmatizerFactory.createLemmatizer();

Notes:

  • You can reuse the same lemmatizer object or call Singleton LemmatizerFactory.createLemmatizer() to get reference for lemmatizer in memory.

Creating Stem Words

Once you have lemmatizer initialized, you can call public String stem(String term) to get stem lemmatized term of passed in word.

lemmatizer.stem("babies") 

produces stem word of "baby".


How does it work?

1) The word "babies" is processed through the reduction rules - Each reduction rule has the "remove part", the "add part" and and how it is allowed.


Default reductions rules applied are :-

short allowPos = 0;
// #1
allowPos = DictEntry.NOUN_POS_BIT + DictEntry.VERB_POS_BIT;
ReductionRule ruleEntry = new ReductionRule("ies", "y", allowPos);
ruleTable.add(ruleEntry);

// #3
ruleEntry = new ReductionRule("s", "", allowPos);
ruleTable.add(ruleEntry);

// #2
ruleEntry = new ReductionRule("es", "e", allowPos);
ruleTable.add(ruleEntry);

// #2'
ruleEntry = new ReductionRule("es", "", allowPos);
ruleTable.add(ruleEntry);

// #4
allowPos = DictEntry.ADJECTIVE_POS_BIT;
ruleEntry = new ReductionRule("iest", "y", allowPos);
ruleTable.add(ruleEntry);

// #5
ruleEntry = new ReductionRule("est", "", allowPos);
ruleTable.add(ruleEntry);

// #6
ruleEntry = new ReductionRule("ier", "y", allowPos);
ruleTable.add(ruleEntry);

// #7
ruleEntry = new ReductionRule("er", "", allowPos);
ruleTable.add(ruleEntry);


2) The word "babies" is run through each of the above rules and the produced word is checked against the dictionary and also checked if this rule is acceptable by the "pos" in the doctoinary.


3) How does it work for "babies"?


The rule applied for babiesis the rule # 1, the "ies" is removed and "y" added as replacement. So word "baby" is produced. And the rule allowed for word baby is noun and verb as defined in the dictionary, so it passes the test that 1) "baby" exists in the dictionary and 2) The reduction rule allowPos = DictEntry.NOUN_POS_BIT + DictEntry.VERB_POS_BIT; is also met so the stem word is "baby"


<word text="baby" pos="n,v">
 <variants>
   <variant text="babies" pos="v"/>
   <variant text="babied" pos="v"/>
 </variants>
</word>


4) If we do the same thing for the word "focus" Step 1 is applied and reduced to focu as if focus is a plural word (which it is not) ​, and then Step 2 will reject it, saying there is no such words "focu" in the dictionary and "focus" will not be stemmed to focu, wich is a good thing.


Creating Expand Words

Once you have lemmatizer initialized, you can call public List<String> expand(String term) to reduce word to its stem and then expands to its variations.

lemmatizer.expand("babies")

produces expand list of words as "[baby, babys, babyes]".

You can optionally choose not to reduce/stem the word before expanding by passing a Boolean to public List<String> expand(String term, boolean reduceVariants), this method (optionally) reduces to its stem and then expands to its variations.First, this method reduces the word to its root word ,if reduceVariants is true or to its dictionary word if reduceVariants is false, then applies all suffix rules to create new word then returns the list of variations

lemmatizer.expand("plumbers",true)


Adding and Resetting Custom Rules

You can add custom rules to lemmatization process by calling public void addCustomRule(String suffixToCheck, String suffixToAdd, short allowedPOS) where suffixToCheck : suffix to check in the word for reduce, suffixToAdd : suffix to add in the word for reduce, allowedPOS : parts of speech

lemmatizer.addCustomRule("er", "or", (short)3);
lemmatizer.addCustomRule("er", "ors", (short)3);

You can clears rules in rules by calling public void clearRules(boolean clearAll) , if clearAll is true it clears all rules from the current in-memoroy instance , if clearAll is false it only removes custom rules that are set through addCustomRule(...).

lemmatizer.clearRules(false);