Mahout Create Vector

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Create Mahout Vector

Create Mahout Vector
Description: Reads in the token_dictionary and creates 2 hashmaps,tokenDictStoreByToken and tokenDictStoreById. Then, creates a mahoutDocVector for the AspireDocument
Inputs: AspireDocument that has an attached histogram variable
Outputs: AspireDocument with an attached mahoutDocVector variable
Factory: aspire-mahout
Sub Type: default
Object Type: mahoutDocVector

Configuration

Element Type Default Description
tokenDictFile string <none> The file location of the token dictionary, which is a tab delimited file (see file sample). Also requires parameters of the field numbers of tokenFieldNum (token), tokenIdFieldNum (token Id), documentOccurrencesFieldNum (# of documents where token appears)and totalOccurrencesFieldNum(not currently used. Total occurences of the token accross all documents). Also requires, but not currently used, parameters: minObservations, and totalDocumentsCount(documents processed for the dictionary file).
.

Sample Token Dictionary File

57:assessment 1 1 1
reprocessed:at 2 2 2
as:standard 3 2 2
expect 4 6 14
produce:posterior 5 1 1

Where:
column one = tokenFieldNum
column two = tokenIdField
column three = documentOccurrencesField
column four = totalOcurrencesFieldNum

The dictionary file will be different according to the token processor used to create it, for this example we used the TokenAndPairs processor.

Sample Configuration

  <component name="createMahoutDocVector" subType="default" factoryName="aspire-mahout">
     <config>
	<tokenDictFile tokenFieldNum="1" tokenIdFieldNum="2" documentOccurrencesFieldNum="3" totalOccurrencesFieldNum="4" 
              minObservations="50" totalDocumentsCount="66523">testdata/token_Dictionary.txt</tokenDictFile>
     </config>
  </component>

Usage

This stage is meant to be used before Mahout Store Vector, since this stage creates the mahout vector that it will store.