Semantic Co-occurrence Solution Overview (Aspire 2)

Jump to: navigation, search

For Information on Aspire 3.1 Click Here


Co-occurrence or cooccurrence is a linguistics term that can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. In contrast to collocation, co-occurrence assumes interdependency of the two terms. A co-occurrence restriction is identified when linguistic elements never occur together. Wikipedia

The co-occurrence or collocation of words to form short phrases (2-4 words) can be useful in tagging content and performing query enhancement by adding a level of meaning to these phrases and therefore improved relevancy for result sets.

The components described in this section take advantage of Wikipedia as a source for phrases and DBpedia and Wikilinks to add semantic meaning to those phrases. The basic architecture used is as follows:

Semantic Co Architectures generic.png

Execution order

After Aspire HDFS feed:

  1. Token Processing
  2. Token Statistics
  3. Statistical Phrases
    1. Token Merge (with content field)
    2. Document Merge
    3. Statistical Phrases Component
    4. Sort Phrases By Weight

Generate the Master Dictionary

  1. Use Export HDFS to Redis to add the Statistical Phrases dictionary to Redis Master Dictionary
  2. Add any external dictionaries to the Master Dictionary.
  3. Run Redis Bitmap Calculator to prepare the Master Dictionary for Phrase Extraction

Once the Master Dictionary is complete:

  1. Phrase Extraction
  2. Token Statistics
  3. Semantic Co-occurrence
    1. Token Merge (with tagged_phrases and non_tagged_tokens fields)
    2. Document Merge (using the previous token merge output)
    3. Co-occurrence Extractor
    4. Co-occurrence Merge

Other components