Diverse languages (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire is able to crawl content in many languages from content repositories, process it using pipelines created using content processing components and publish them to target applications, typically search engines.

Most of Aspire doesn’t care about the encoding or the language and is designed as a content processor with UTF-8 processing of all documents throughout the entire stack. The area where language and encoding are most critical within Aspire are those components that actually look at and manipulate the internal text stream.

  • Aspire's default text tokenization services uses the Lucene language analyzers, providing simple tokenization services but not stemming / lemmatization or other forms of text processing.
  • The tokenization pipeline stages within the Tokenization Library typically rely on externalized files which should work the same regardless of what language they’re in.
  • In addition there is a packaged Aspire Basis Tokenizer (requires license from Basis) that also supports many languages.
    • Aspire's open architecture is flexible enough to integrate many other language processors, both Open Source and Commercial.