Hash Table Lookup (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here


Hash Table Lookup (Aspire 2)
Factory Name  com.searchtechnologies.aspire:aspire-hash-table
subType  default
Inputs  AspireObject (when used as a pipeline stage)
Outputs  Fields as specified by the metadata mapper (when used as a pipeline stage)

 (2.0 Release)   The Hash Table Lookup stage loads an in-memory hash table for very quickly looking up data and adding it to the document being processed. The hash table can be automatically loaded on start-up from a tabular file or relational database select.

When used as a pipeline stage, takes the key from an existing XML element, looks up the entry in the hash table, and then maps the hash table values-array elements to fields in the document being processing.

When used as an independent component, the component supports the "com.searchtechnologies.aspire.hashtable.AspireHashTable<K, V>" interface (see AspireHashTable for more info).

The stage can also be used as a generic hash table resource, for example by other components or Groovy scripts which need to quickly look up values in the hash table.

Note that the component can be used both as an interface and as a hash table service at the same time. That is, once you've configured the component to be a pipeline stage (inside of a pipeline manager), you can also access the hash table directly.

Configuration

Element Type Default Description
initialSize int 10000 (10 thousand) The estimated initial size for the hash table, used to specify its initial capacity. It is best to set this value large enough to contain all of the expected entries in the hash table. This will prevent additional hash table allocations and rehashing.
initializeFromFiles boolean false Set this flag to true if you are initializing the hash table from a tabular file (i.e. a comma-separated or tab-separated file).
file xml node none (requires initializeFromFiles = true) Contains the file location and separator value. Multiple file nodes can be configured to load multiple files.
file/fileName string none (requires initializeFromFiles = true) The file name where the tabular file can be located. If a relative path is specified, this is assumed to be relative to Aspire Home.
file/separator string tab (requires initializeFromFiles = true) This is either "comma", "tab" or a single character to specify the separator used for columns in the file. If a CSV file, use "comma".

The tabular files use the Microsoft-Excel standards for specifying data. Specifically, data entries with embedded commas or tabs should be surrounded by double quotes. Data entries which contain double quotes should escape the double-quote character with a pair of double quotes.

Finally, if you want to have some other separator (for example, the pipe-character / vertical-bar, |, is popular), then you can specify that single character in the <separator> tab as well.

folder xml node none (requires initializeFromFiles = true) Contains the folder location and separator value. Multiple folder nodes can be configured.
folder/folderName string none (requires initializeFromFiles = true) The folder name where the tabular files can be located. If a relative path is specified, this is assumed to be relative to Aspire Home.
folder/separator string tab (requires initializeFromFiles = true) This is either "comma", "tab" or a single character to specify the separator used for columns in the file. If a CSV file, use "comma".

The tabular files use the Microsoft-Excel standards for specifying data. Specifically, data entries with embedded commas or tabs should be surrounded by double quotes. Data entries which contain double quotes should escape the double-quote character with a pair of double quotes.

Finally, if you want to have some other separator (for example, the pipe-character / vertical-bar, |, is popular), then you can specify that single character in the <separator> tab as well.

hasColumnLabels boolean false (requires initializeFromFiles = true) Set this flag to true if the first row of the tabular file contain column labels.
keyColumn string column1 (requires initializeFromFiles = true) The name of the tabular file column which will be used for the hash table key.

If <hasColumnLabels> = false, then the column labels will be numbered starting with 1, as in "column1", "column2", "column3", etc.

<keyColumn> is also available when loading the hash table from the RDB. See below.

valueMap Nested list of <column label=""/> tags include all columns in the order in which they occur (requires initializeFromFiles = true) The value map parent tag allows users to choose exactly which columns are stored in the hash table (controlling memory usage) and the order of the columns in the value array.

Inside of <valueMap> list the columns desired with nested <column label=""> tags. Only columns specified in the value map will be stored in the hash table. The order of the values in the hash table will be the same as the order of the <column> tags inside the value map.

Column labels will either be the labels specified in the file (if <hashColumnLables> is true) or "column1", "column2", "column3" etc. otherwise.

initializeFromSQL boolean false Set this flag to true if you are initializing the hash table from a SQL select statement.
connectionPoolName string none (requires initializeFromSQL = true) The Aspire component name of the RDBMS Connection component which maintains the pool of RDB connections for the database to be queried.
sqlQuery string none (requires initializeFromSQL = true) The SQL query to use to access the data from the RDBMS to load the hash table. The order of the columns in the SQL table will be maintained in the list of values stored in the hash table.
keyColumn string none (requires initializeFromSQL = true) The name of the SQL column from the "sqlQuery" query which will be used for the hash table key.
targetElement string none (when used as a pipeline stage) The XML element from the document being processed which will be used as the key to look up the entry in the hash table.
metadataMap Metadata Mapper none Specifies the mapping of fields or columns from the original hash table

Example Configurations

Initialized from a Tabular File - where the first row has column labels

 <component name="NormalizedAssigneeHashTable" subType="default" factoryName="aspire-hash-table">
   <initializeFromFiles>true</initializeFromFiles>
   <file>
     <fileName>data/NormalizedAssigneeFile.csv</fileName>
     <separator>comma</separator>
   </file>
   <hasColumnLabels>true</hasColumnLabels>
   <keyColumn>hashName</keyColumn>
   <valueMap>
     <column label="uniqueAsgnId"/>
     <column label="name"/>
     <column label="normAsgnId"/>
     <column label="count"/>
   </valueMap>
 </component>

Initialized from a Tabular File - with no column labels

 <component name="NormalizedAssigneeHashTable" subType="default" factoryName="aspire-hash-table">
   <initializeFromFiles>true</initializeFromFiles>
   <file>
     <fileName>data/NormalizedAssigneeFile.csv</fileName>
     <separator>comma</separator>
   </file>
   <keyColumn>column1</keyColumn>
   <valueMap>
     <column label="column3"/>
     <column label="column1"/>
     <column label="column8"/>
     <column label="column2"/>
   </valueMap>
 </component>

Initialized from a SQL Select Statement

 <component name="NormalizedAssigneeHashTable" subType="default" factoryName="aspire-hash-table">
   <initialSize>10000000</initialSize>
      
   <initializeFromSQL>true</initializeFromSQL>
   <connectionPoolName>/CPAAssigneeNorm/openRDBConnection</connectionPoolName>
   <sqlQuery><![CDATA[select Name, NormAsgnID
                from AssigneeNormalization.dbo.NormalizedAssignee]]></sqlQuery>
   <keyColumn>Name</keyColumn>
 </component>

Used as a pipeline stage

 <component name="NormalizedAssigneeHashTable" subType="default" factoryName="aspire-hash-table">
   <targetElement>documentId</targetElement>
      
   <metadataMap>
     <map from="name" to="title"/>
     <map from="subCategory" to="subCategory"/>
     <map from="geographicArea" to="geographicArea"/>
     <map from="searchKeywords1" to="searchKeywords1"/>
   </metadataMap>
       
   .
   .
 </component>
 

Note that in the above example, the metadata mapper @from attribute could be names such as "column1", "column2" etc. if the data comes from a tabular file with no column labels specified in the file.

Example use from within a Groovy scripting component

Note how, in the examples below, the hash table is referenced via a "component variable" declared in Groovy. See Groovy Scripting for more details.

Reading from the hash table

 <component name="printPatentPubCount" subType="default" factoryName="aspire-groovy">
   <variable name="normalizedAssigneeHashTable" component="/CPAAssigneeNorm/NormalizedAssigneeHashTable" />
   <variable name="uniqueAssigneeHashTable" component="/CPAAssigneeNorm/UniqueAssigneeHashTable" />        
   <script>
   <![CDATA[
     println "*** Normalized Assignee hash table size: " + normalizedAssigneeHashTable.size();
     println "*** Unique Assignee hash table size: " + uniqueAssigneeHashTable.size();          
   ]]>
   </script>
 </component>


Reading and writing the hash table

 <component name="update" subType="default" factoryName="aspire-groovy">
   <variable name="uniqueAssigneeHashTable" component="/CPAAssigneeNorm/UniqueAssigneeHashTable" />
   <script>
   <![CDATA[
     use(groovy.xml.dom.DOMCategory) {
       .
       .
       dom.'variants'[0].each() {
         if(uniqueAssigneeHashTable.contains(it.getAttribute("hash")))
           normAsgnId = uniqueAssigneeHashTable.get(it.getAttribute("hash"))[2];
       }
         
       dom.'variants'[0].each() {
         .
         .
           
         // update uniqueAssigneeHashTable with the above UniqueAsgnID
         String[] values = [localUniqueID, assigneeName, normAsgnId, '1', 
                            "PATENT", isDocDB, patent, '0', '0', sdf.format(date) ,lang];
           
         def returnValue = uniqueAssigneeHashTable.put(hashName, values);
           
         // Check to see if the key was already in the hash table...
         if(returnValue != null) {
           .
           .
           .
         }
       }
     }
   ]]>
   </script>					
 </component>