Entity Extractor 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Entity Extractor Stage

Entity Extractor Stage
Description: Takes the input stream and looks up token patterns from the stream in a given dictionary.
Inputs: object['contentStream'] or object['contentBytes'] to fetch the content to be parsed.
Outputs: The content with matching phrases gathered into single tokens. The matches are also tagged with TokenFlags.WHOLE_TOKEN.with matching phrases gathered into single tokens. The matches are also tagged with TokenFlags.WHOLE_TOKEN..
Factory: aspire-tokenizer
Sub Type: extractor
Object Type: AspireDocument

Configuration

Element Type Default Description
dictionaryFile String none Identifies the file location of the dictionary file, which contains a list of entries to be matched from the token stream. (see below)
dictionaryEntries Integer 0 The number of lines (entries) to load from the dictionary. Zero means load the entire file.
extractorName String "Extractor" This name is logged with each hit, so that hits from multiple Extractors can be differentiated.


Example Configurations

 <component name="MainDictLookup" subType="extractor" factoryName="aspire-tokenizer">
   <config>
     <extractorName>Main Terms</extractorName>
     <dictionaryFile>testdata/nse.txt</dictionaryFile>
   </config>
 </component>

Example Output, Dictionary File Format, etc.

See Extractor for the Token Filter version of this component.