Entity Extractor (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Entity Extractor (Aspire 2)
Factory Name  com.searchtechnologies.aspire:aspire-tokenizer
subType  extractor
Inputs  InputStream (set in the contentStream or contentBytes variable) which contains the content to be parsed.
Outputs  Set the doc.content with Input text marked up in-line with extracted matches.

Takes the input stream and looks up token patterns from the stream in a given dictionary. The output will be the content with matching phrases gathered into single tokens. The matches are also tagged with TokenFlags.WHOLE_TOKEN with matching phrases gathered into single tokens.


Configuration

Element Type Default Description
dictionaryFile String none Required. Identifies the file location of the dictionary file, which contains a list of entries to be matched from the token stream. (see below) Multiple files may be added using config options shown below.
dictionaryOffset Integer 0 The number of lines (entries) to skip from the beginning of the dictionary. This setting is mostly used to avoid loading all of a very large dictionary.
dictionaryEntries Integer 0 The number of lines (entries) to load from the dictionary, starting from dictionaryOffset. Zero means load the entire file. This setting is mostly used to avoid loading all of a very large dictionary.
extractorName String "Extractor" This name is logged with each hit, so that hits from multiple Extractors can be differentiated.
normalize boolean false If true, when a hit is found, the term will be changed to the target text, if any.
debug boolean false When true, a number of printouts are activated.

Example Configurations

<component name="MainDictLookup" subType="extractor" factoryName="aspire-tokenizer">
  <extractorName>Main Terms</extractorName>
  <dictionaryFile>testdata/nse.txt</dictionaryFile>
</component>