Metadata Mapper

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

The Metadata Mapper is a generic metadata mapping utility, which can be used to map metadata produced by a pipeline stage into fields in the Job Object XML (i.e., into elements of the AspireDocument XML passed along with the job).

The metadata mapper takes a "from-to" list of metadata pairs.

The "from" attribute identifies the metadata as returned by the pipeline stage. See each individual pipeline stage for information about the "from" metadata fields it provides.

The "to" attribute specifies the element name in the resulting XML to which the metadata will be mapped.

In addition, the metadata mapper can convert date-time formats into ISO-8061 standard format.

Example

The following is an example of using the metadata mapper inside the RSS feeder:

    <component name="RSSFeeder" subType="default" factoryName="aspire-rssfeeder">
      <autoStart>false</autoStart>
      <loopWait>300000</loopWait>
      <metadataMap>
        <map from="category" to="category"/>
        <map from="subCategory" to="subCategory"/>
        <map from="geographicArea" to="geographicArea"/>
        <map from="searchKeywords1" to="searchKeywords1"/>
      </metadataMap>
      <ccdLocation>/systemCommon/ccd</ccdLocation>
      <branches>
        <branch event="onPublish" pipelineManager="standard-pipe-manager" />
      </branches>
    </component>

Configuration

Element Type Description
metadataMap parent tag Holds a list of <map> elements, which are checked in order from top to bottom. Order is important, mappings that occur higher in the list, if successful, will take precedence over mappings lower in the list. For example, if two <map> elements both map to "title", the one that occurs higher in the list will take precedence over elements lower in the list.
metadataMap/map parent tag Specifies an individual mapping. Have one <map> tag for each source field to be mapped.
map/@from string The field from the native application that is the source of the data. The names used here are dependent on the enclosing component.

For example, for Fetch URL, the source fields will be HTTP header names. For Extract Text, source fields will be <meta> tag names (such as "dc.title") extracted from the HTML document. The RSS Feeder has metadata field names from the RSS feed. Other components, which use the metadata mapper, will have their own source field names.

See each component's documentation for a complete description of all source field names it supports.

map/@to string The AspireDocuments XML element to which the source field will be mapped. This is typically something like "title" or "modificationDate". Once mapped to an AspireDocument field, it will then be available for use in transforms, such as those provided by post-xml.

Note that if multiple <map> tags specify the same @to field, the one whose matches is highest in the list of <map> tags will take precedence.

map/@dateFormat string (optional) If the source data is a date, this specifies the java date format for the date value. Specifying the format here will remap the data into standard ISO-8601 date format.

Note that date formats that fail to match are simply ignored; there is no error thrown.

map/@dateTimeFormat string (optional) If the source data specifies a date and time, this attribute holds the java date format for the date/time value. Specifying the format here will remap the datatime into the standard ISO-8601 datetime format.

Metadata Precedence

1. All metadata maps from the custom configuration (i.e., specified with the system configuration file) take precedence over the metadata maps in the ComponentFactory.xml file.

2. Mappings higher in the list take precedence over mappings lower in the list. For example, if both "DC.title" and "title" are available in your metadata, "DC.title" will take precedence because it is listed higher in the map:

       <map from="DC.title" to="title"/>
       <map from="title" to="title"/>

3. When there are multiple date time formats for a single field, they will be attempted from top-to-bottom. The first one to successfully parse the date will be mapped. For example, if the "created" date is "Wed Apr 3 12:31:32 GMT", the first map will fail (unable to parse), but it will then be correctly parsed and mapped by the second entry:

       <map from="created" to="creationDateTime" dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss z"/>
       <map from="created" to="creationDateTime" dateTimeFormat="EEE MMM dd HH:mm:ss z yyyy"/>

Example Output

The output results look like the below:

<doc>
  .
  .
  .
  <title source="ExtractTextStage/DC.title">Dublin Core Metadata Initiative (DCMI) Home Page</title>
  <myDublinCoreDescription source="ExtractTextStage/DC.description">The Dublin Core Metadata Initiative is an open forum engaged ...</myDublinCoreDescription>
  <modificationDate source="ExtractTextStage/DC.date">2009-07-22</modificationDate>
  <contributor source="ExtractTextStage/DC.contributor">Dublin Core Metadata Initiative</contributor>
  <contentType source="ExtractTextStage/Content-Type">text/html; charset=iso-8859-1</contentType>
  <language source="ExtractTextStage/language">en</language>
  <extension source="ExtractTextStage">
    <field name="title">Dublin Core Metadata Initiative (DCMI)</field>
    <field name="Content-Language">en</field>
    <field name="Content-Encoding">ISO-8859-1</field>
    <field name="DC.language">en</field>
    <field name="DC.format">text/html</field>
    <field name="resourceName">http://dublincore.org/</field>
  </extension>
  .
  .
  .
</doc>

Notes:

  • Notice how the source of every field is clearly identified.
  • Notice how fields that are not mapped are put into their own <extension> sub-element so they can be used by the index transformer if necessary.

Related tutorials

Extracting custom mapped metadata for use in Amazon CloudSearch

Tutorial for mapping custom metadata fields, extracting them using the Aspire S3 connector, and sending the data to Amazon CloudSearch.