Tagger 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Text Tagger

Text Tagger
Description: Scans the document for the occurance of words/phrases and adds counts of the phrases to the document <tags> element.

Uses a number or "tag files", containing lists of phrases to match and optionally their synonyms. The output will count the occurences of the tags and synonyms and output xml. Outputs are per tag file, so the tagger can tag documents against differing lists and maintain separation. The tagger tags the content element by default, but can tag multiple fields from the Aspire document. Optionally, in the case of a field that is marked as the document body, the tagger will separately count occurences of words with in a certain distance of the start of the text, allowing subsequent stages to bias on proximity to the start of the text

Inputs: <content> by default, but potentially any field from the Aspire document and one or more tag files
Outputs: <tags>
Factory: aspire-tag-text
Sub Type: default
Object Type: AspireDocument


Configuration

Element Type Default Description
output String tags The base output element for the tags in the Aspire document


Configuration of Content to Tag

By default, the tagger processes the <content> tag from the Aspire document. However, it is possible to configure is to tag other fields.

Element Type Default Description
tagFields/tagField/@field String None (must be specified) The field to process
tagFields/tagField/@isBody boolean false Flag to indicate this field contains the document body.
tagFields/tagField/@startTokens int 0 For the document body only, the number of tokens from the start of the text to consider separately as being near the document start

Note:

  • If no <tagField> tags exist, the tagger defaults to processing the <content> element as the document body.
  • More than one <tagField> element may be used, to tag more than one field.
  • If any <tagField> tags are used, you MUST specify ALL fields to tag, including the <content> if required.


Tag List Configuration

The tagger requires at least one tag list in order to tag files. This text file contains a list of phrases and their synonyms. Each phrase should appear on a new line and any synonyms should appear on subsequent lines, preceded with a + symbol.

Element Type Default Description
tagLists/tagList/@id String None (must be specified) An identifier for the tags. This will be output in the Aspire Document against any tags from this file that any identified.
tagLists/tagList/@tagFile String None (must be specified) The path to the file containing the tags Relative to $ASPIRE_HOME

Note:

  • More than one <tagList> may be specified.


Example tag file

UK
	+United Kingdom
	+Great Britain
	+Wales
	+Scotland
	+England
	+English
	+British
	+Briton
	+Scottish
	+Welsh

USA
	+United States of America
	+United States
	+America

Tokeniser Configuration

By default, the tagger uses the classes org.apache.lucene.analysis.standard.StandardTokenizer and org.apache.lucene.analysis.LowerCaseFilter to tokenize and lowercase the document text and phrases from the tag files. However, these may be overriden if required.

Element Type Default Description
tokenProcessing/tokenizer/@class String org.apache.lucene.analysis.standard.StandardTokenizer String representing the class to use for the tokeniser. Must conform to the parameters/return type of org.apache.lucene.analysis.standard.StandardTokenizer
tokenProcessing/tokenizer/@jar String None (built in) Jar file the tokenizer class file exists in. Relative to $ASPIRE_HOME
tokenProcessing/tokenFilter/@class String org.apache.lucene.analysis.LowerCaseFilter String representing the class to use for the tokeniser. Must conform to the parameters/return type of org.apache.lucene.analysis.LowerCaseFilter
tokenProcessing/tokenFilter/@jar String None (built in) Jar file the token filter class file exists in. Relative to $ASPIRE_HOME

Note:

  • More than one token filter may be used
  • The element may contain further attributes. If the configured class implements the AspireInitializer interface, this will be called, and the config element of the component called. This allows the classes to be initialised with any required information


Example Configurations

Simple

 <component name="tagger" subType="default" factoryName="aspire-tag-text">
   <config>
     <tagLists>
       <tagList id="geo" tagFile="data/tagFiles/geo.txt"/>
     </tagLists>
   </config>
 </component>

Complex

 <component name="tagger" subType="default" factoryName="aspire-tag-text">
   <config>
     <output>tags</output>
     <tagFields>
       <tagField field="title"/>
       <tagField field="content" isBody="true" startTokens="20"/>
     </tagFields>
     <tagLists>
       <tagList id="geo" tagFile="data/tagFiles/geo.txt"/>
       <tagList id="hr" tagFile="data/tagFiles/hr.txt"/>
       <tagList id="sport" tagFile="data/tagFiles/sports.txt"/>
     </tagLists>
     <tokenProcessing>
       <tokenizer class="org.apache.lucene.analysis.standard.StandardTokenizer"/>
       <tokenFilter class="org.apache.lucene.analysis.LowerCaseFilter"/>
       <tokenFilter jar="lib/aspire-lemmatizer.jar" class="org.apache.lucene.analysis.LemmatizerFilter" dictionary="data/dict/gcide_out.xml"/>
     </tokenProcessing>
   </config>
 </component>

Example Output

The following is a sample output for the tagger. Note: Does not show synonym or sub-tags (will add an example for that later).

 <doc>
   .
   .
   .
   <tags source="textTagger">
       <category name="responsibility">
           <tag body="1" name="administrator"/>
           <tag body="1" name="responsible for" topBody="1"/>
           <tag name="senior" topBody="1"/>
           <tag body="1" name="managing"/>
           <tag body="3" name="management" topBody="2"/>
       </category>
   </tags>
 </doc>