Natural Language Processing (NLP)

Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Many problems in the field of Natural Language Processing (NLP) can be addressed using Aspire. The sections below detail exactly how to perform NLP tasks using various Aspire modules, even to the extent of including custom Groovy code.

Generating Statistics for your data (Histograms)

Main article: Generating Statistics / Histograms

This introductory article produces statistics that are used by many of the more advanced processes below. Code contained in this article includes:

  1. A general Aspire parent/child processing pipeline for analyzing many documents at once
  2. An example tokenization pipeline in the child process that is built up further in the articles below

Named Entity Recognition (NER)

Main article: Named Entity Recognition (NER)

Entity extraction is the process of recognizing particular strings as being of particular significance to the application. Often this involves assigning terms to classes, such as persons, organizations, or locations.

Boundary Recognition

Main article: Determining Text Boundaries

This article includes detecting paragraph, sentence, and phrase boundaries in free text.

Boilerplate Removal

Main article: Text Block Removal

Recognize and remove common blocks of text across documents.

  1. Build a list of hash signatures of paragraphs
  2. Use the list in the Paragraph Filter

Stemming & Lemmatization

Main article: Lemmatization

Reduce natural language terms to their base forms (e.g. "running" can be reduced to "run").

An alternative is to use Basis Technologies' Rosette Search Essentials.

Part-of-Speech Tagging

Main article: Part-of-Speech Tagging

Identify how each token is used in sentences; e.g. as a noun, verb, adjective, etc.

An alternative is to use Basis Technologies' Rosette Search Essentials.

Building a Dictionary

Main article: Dictionary Construction
Or for the specific case of an acronym dictionary: Acronym Dictionary Construction

How to generate a list of terms and phrases that are meaningful within a given data set.

Weighted Document Vectors

Main article: Document Vectors

Documents are formed into vectors as a way to reduce their size and complexity, and for ease of comparison and use in complex algorithms.

Word Sense Disambiguation

Main article: Disambiguation

Disambiguation is the process of determining which meaning of a particular term is meant for each occurrence of that term.

Near-Duplicate Detection (NDD)

Main article: Near-Duplicate Detection (NDD)

Near-duplicates are difficult to discover (which is why there is a whole page dedicated to that topic), but true (or exact) duplicate detection is easy.

If the documents are individual files:

  • Simply compare the file sizes using an OS utility or a Java method, like File.length().

If the lengths are different, the files are not duplicates. If the lengths are the same:

  • Do a byte-level comparison using an OS utility, like diff or fc
  • OR calculate strong hash signatures for each file and do a simple compare of the signatures.
  • OR open them both as streams and compare the bytes as they are read from each

The only decisions that may have to be made are determining whether a difference in metadata (e.g., last modified date) is meaningful in the given context. Remember that the effort is over as soon as even one byte of difference is found, so duplicate-detecting code usually contains short-circuits and cut-outs to exit long before the ends of the records are reached.

If the documents are RDB records or blocks of text from some other source (or group of sources), the same general principles hold:

  • First, compare some basic facts about size or structure (do they contain the same fields?) that can quickly show that the majority of cases are not duplicates.
  • Then continue to byte-level comparisons of the contents of every field, exiting with a "false" (not a duplicate) as soon as any difference is found.
    • If the fields are short, Java's String.equals() method can be used to do the comparisons.