Tokenization Library (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

The following are the token processors currently available:

Tokenizers

Token Filters

  • Acronym Combiner - Converts initialized acronyms (e.g. N.A.S.A.) into combined acronyms (EG NASA)
  • Case Filter - Converts tokens to any of lower-, upper-, or title-case
  • Case Recorder - Sets flags to keep track of the cases of terms
  • Character Change Splitter - Divides tokens when the character type changes
  • Contains Filter - Not yet implemented - Filters all tokens containing a string or regular expression
  • Flags Filter - Removes tokens which have particular flags set and/or are matches
  • HTML Entity Decoder - Converts entities (e.g. >) into the actual characters (e.g. >)
  • Lower Case Filter - Converts tokens to lower case
  • Numbers Filter - Removes tokens which are entirely digits
  • Paragraph Filter - Removes whole paragraphs, based on hash code
  • Punctuation Filter - Removes tokens which are entirely punctuation
  • Single Character Filter - Removes tokens that are only one character long
  • Stop Words Filter - Removes tokens from a stop words list
  • Tags Filter - Removes HTML/XML-type tags
  • Token Length Filter - Not yet implemented - Filters all tokens less than a minimum length and/or greater than a maximum length
  • Tokens And Pairs - Converts a token stream into a stream of tokens and token pairs
  • Token Combinations - Converts a token stream into a stream of token pairs/triples/quads/etc.
  • Type Filter - Removes all tokens of particular type(s), as specified by an Extractor
  • Window Filter - Removes all tokens except those within particular windows

Entity Extraction

Token Statistics

  • Count Characters - Counts a variety of character types across one or more documents
  • Count Tokens - Counts tokens across one or more documents
  • Gather Token Statistics - Computes a variety of statistics on all unique tokens processed all at once
  • Hash Code - Computes a hash signature for a block of text
  • Token Docs Histogram - Computes a histogram which counts the number of documents containing each unique token
  • Tokens Histogram - Computes a histogram which counts the number of occurrences of all unique tokens across all documents

Document Scoring

Miscellaneous