Elasticsearch extensions for QPL - Analysis

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

This Page is Locked - Content no longer maintained - See QPL for the latest information.
Enterprise Add-On Feature

This section covers elasticsearch analysis modules installed with Search Technologies' QPL.

XML Tokenizer

The XML tokenizer is used to tokenize the xml content so that XML searching can be accomplished using the BETWEEN operator.

This tokenizer creates tokens as follows:

  1. Split tokens on white space
    |this is some-text| => [this, is, some-text]
    Note that tokens with embedded punctuation need to be further split with later stages
  2. Simple XML start tags and end tags become separate tokens
    |<tag>this is some-text</tag>| => [<tag>, this, is, some-text, </tag>]
  3. XML entities are converted to their character equivalents in the token stream
    |this is &amp; some-text| => [this, is, &, some-text]
  4. XML attributes are marked with special <tag/@attribute> tags.
    |<granule type="parent">| => [<granule>, <granule/@type>, parent, </granule/@type>]
  5. XML Comments such as <!-- This is a comment --> are removed and ignored.
  6. Nested <![CDATA[ content ]]> marked character content are converted to an ordinary block of character data.

This will create a stream of tokens with XML tags (with XML tag characters, such as </@> preserved) that are sent to the index and indexed with the XML punctuation intact.

This will have the following advantages:

  1. XML tags will never be confused with user searches.
    1. Since all user searches will be completely stripped of punctuation, it will be impossible for a user to search for an XML tag without using the approved mods: field operator syntax.
    2. For example, a query on |tag| will never match |<tag>| in the XML field.
    3. Similarly, when the user searches on |<tag>|, the punctuation will be removed and so only instances of |tag| in the text will be found not XML tags.
    4. The only method for users to search the XML will be to use the official |mods:| field operator format.
  2. Since XML tags are indexed as part of the same token stream as the content itself, this will allow for searching of content within specified XML tags.
    1. All XML tags will be counted as words for the purpose of word position computations.
    2. This will ensure that the BETWEEN operator can correctly search for text which occurs between the specified XML tags when needed.

Using the Tokenizer

The tokenizer has the name "xml_tokenizer". You can test the tokenizer using the analysis endpoint:

Send URL:

 http://MININT-8ASF6E7.search.local:9501/shakespeare/_analyze?tokenizer=xml_tokenizer

With HTTP content:

 <peace><hello>Hola Everyone!</hello><world>Planet Earth</world></peace>

Returns:

{tokens=[
 {tokens=[{token=<peace>, start_offset=0, end_offset=7, type=<XMLTAG>, position=1}, 
 {token=<hello>, start_offset=7, end_offset=14, type=<XMLTAG>, position=2}, 
 {token=Hola, start_offset=14, end_offset=18, type=<ALPHANUM>, position=3}, 
 {token=Everyone!, start_offset=19, end_offset=28, type=<ALPHANUM>, position=4}, 
 {token=</hello>, start_offset=28, end_offset=36, type=<XMLTAG>, position=5}, 
 {token=<world>, start_offset=36, end_offset=43, type=<XMLTAG>, position=6}, 
 {token=Planet, start_offset=43, end_offset=49, type=<ALPHANUM>, position=7}, 
 {token=Earth, start_offset=50, end_offset=55, type=<ALPHANUM>, position=8}, 
 {token=</world>, start_offset=55, end_offset=63, type=<XMLTAG>, position=9}, 
 {token=</peace>, start_offset=63, end_offset=71, type=<XMLTAG>, position=10}
]}


Settings: Depth Sensitivity

Use the "is_depth_sensitive" setting to turn on depth sensitivity. This adds a digit indicating the depth of the XML tag to every XML tag output.

This can be done creating a custom tokenizer with settings as follows:

 index.analysis.tokenizer.xml_with_depth.type=xml_tokenizer
 index.analysis.tokenizer.xml_with_depth.is_depth_sensitive=true

URL:

 http://MININT-8ASF6E7.search.local:9501/shakespeare/_analyze?tokenizer=xml_with_depth

With sample HTTP request body:

 <peace><hello>Hola Everyone!</hello><world>Planet Earth</world></peace>

Returns:

 {tokens=[
   {token=<peace>0, start_offset=0, end_offset=7, type=<XMLTAG>, position=1}, 
   {token=<hello>1, start_offset=7, end_offset=14, type=<XMLTAG>, position=2}, 
   {token=Hola, start_offset=14, end_offset=18, type=<ALPHANUM>, position=3}, 
   {token=Everyone!, start_offset=19, end_offset=28, type=<ALPHANUM>, position=4}, 
   {token=</hello>1, start_offset=28, end_offset=36, type=<XMLTAG>, position=5}, 
   {token=<world>1, start_offset=36, end_offset=43, type=<XMLTAG>, position=6}, 
   {token=Planet, start_offset=43, end_offset=49, type=<ALPHANUM>, position=7}, 
   {token=Earth, start_offset=50, end_offset=55, type=<ALPHANUM>, position=8}, 
   {token=</world>1, start_offset=55, end_offset=63, type=<XMLTAG>, position=9}, 
   {token=</peace>0, start_offset=63, end_offset=71, type=<XMLTAG>, position=10}
 ]}


How to Update the Settings by REST

The settings above can also be updated using the elasticsearch RESTful interface, as follows:

First, close the index: (analysis settings are not dynamic)

 curl -XPOST 'localhost:9200/{index}/_close'

Next, PUT (i.e. use HTTP PUT method) the settings with the "_settings" endpoint:

 PUT URL:  http://localhost:9200/{index}/_settings

With Content:

{
  "analysis": {
    "tokenizer": {
      "xml_with_depth": {
        "type":"xml_tokenizer",
        "is_depth_sensitive":true
      }
    }
  }
}

Finally, re-open the index:

 curl -XPOST 'localhost:9200/{index}/_open'

Implementation:

The tokenizer is implemented with the Java JDK XMLStreamReader interface, which ensures correct entity handling, DTD handling, comments, CDATA handling, etc.

XMLTokenizer calls the XMLStreamReader.next() method as needed to fetch XML structure.

XML Token Type

XML tokens are marked with a special type, <XMLTAG>. This tag identifies XML token markup from standard content text, so that standard content text can be further processed (e.g. with word splitters, lower-case filters, etc.) while leaving the XML markup tags unmolested.

See the XML Filter Wrapper below for a way to wrap standard token filters so that they process content without touching XML tags.

XML Content and Attribute Content

XML Content and attribute content are split on white space. No other processing is performed on this content.

Offset Calculations

The XML Tokenizer attempts to mark the offset of all tokens to most closely match the original text.

For example:

 00000000001111111111222222222233333333334444444444
 01234567890123456789012345678901234567890123456789
 <meta name="dc.creator" content="arcuser"/>
 
 0:5   6:10        12:22     22:23        24:31          33:40  40:41           41:43
 <meta><meta/@name>dc.creator</meta/@name><meta/@content>arcuser</meta/@content></meta>

WARNINGS ON OFFSET CALCULATIONS:

Due to the limitations of the XMLStreamReader character offset calculations, offsets will only be correct if the following rules are followed:

  1. No whitespace before the root tag other than a single new line.
  2. Single space before all attributes (causes offset problems within the XML tag only, any issues are reset for the next tag)
  3. xmlns attributes must be first in any XML tag
  4. Only three entities are allowed: &amp; &lt; &gt;. Any content which has any other entities will contain offsets (within the string) as if the entities did not exit.
  5. There are no CDATA sections

Note that the rules only affect the offset calculations. If you don't care about the offsets (if you do not expect to need to highlight words in XML), then none of these rules apply. The XML tokenizer will work (other than offsets) for any XML.

NAMESPACE HANDLING

  1. Namespace attributes (the xmlns:*= attributes) are never included in the output
  2. Attributes and elements with namespaces are indexed with their prefix values, for example: <gpo:test> or <file/@xlink:href>


XML Token Filter Wrapper: xml_filter_wrapper

The XML Filter Wrapper wraps other standard token filters so that XML tag tokens which go down the analysis chain are not modified by linguistic analysis.

For example, the XML:

 <Doc><Text-Content>Hello World!</Doc></Text-Content>

would result in the following tokens:

 <Doc>
 <Text-Content>
 hello
 World!
 </Text-Content>
 </Doc>

The goal is to apply token filters to just the "hello" and "World!" tokens without touching any of XML tagging tokens.

This can be accomplished with the "xml_filter_wrapper" token filter. This wrapper can wrap other token filters by specifying the class name of the filter factory as the "factory_class" parameter.

For example, wrapping the lowercase filter would be done as follows:

 index.analysis.filter.xml_wrap_lowercase.type=xml_filter_wrapper
 index.analysis.filter.xml_wrap_lowercase.factory_class=org.elasticsearch.index.analysis.LowerCaseTokenFilterFactory

If the factory has other parameters of its own, these can be included along with the "factory_class" parameter.

For example, to wrap the word delimiter filter, do the following:

index.analysis.filter.xml_wrap_word_delim.type=xml_filter_wrapper
index.analysis.filter.xml_wrap_word_delim.factory_class=org.elasticsearch.index.analysis.WordDelimiterTokenFilterFactory
index.analysis.filter.xml_wrap_word_delim.generate_word_parts=true
index.analysis.filter.xml_wrap_word_delim.generate_number_parts=true
index.analysis.filter.xml_wrap_word_delim.catenate_words=true
index.analysis.filter.xml_wrap_word_delim.catenate_numbers=true
index.analysis.filter.xml_wrap_word_delim.catenate_all=true
index.analysis.filter.xml_wrap_word_delim.split_on_case_change=true
index.analysis.filter.xml_wrap_word_delim.split_on_numerics=true


Determining the Filter Factory Class

In order to use the xml_filter_wrapper token filter, you need to identify the class name for the token filter you need to wrap. This can be done by looking through the elasticsearch source code.

As of 1.4, the following classes were defined:

  • org.elasticsearch.index.analysis.AbstractCompoundWordTokenFilterFactory
  • org.elasticsearch.index.analysis.ApostropheFilterFactory
  • org.elasticsearch.index.analysis.ArabicNormalizationFilterFactory
  • org.elasticsearch.index.analysis.ArabicStemTokenFilterFactory
  • org.elasticsearch.index.analysis.ASCIIFoldingTokenFilterFactory
  • org.elasticsearch.index.analysis.BrazilianStemTokenFilterFactory
  • org.elasticsearch.index.analysis.CJKBigramFilterFactory
  • org.elasticsearch.index.analysis.CJKWidthFilterFactory
  • org.elasticsearch.index.analysis.ClassicFilterFactory
  • org.elasticsearch.index.analysis.CommonGramsTokenFilterFactory
  • org.elasticsearch.index.analysis.CzechStemTokenFilterFactory
  • org.elasticsearch.index.analysis.DelimitedPayloadTokenFilterFactory
  • org.elasticsearch.index.analysis.DutchStemTokenFilterFactory
  • org.elasticsearch.index.analysis.EdgeNGramTokenFilterFactory
  • org.elasticsearch.index.analysis.ElisionTokenFilterFactory
  • org.elasticsearch.index.analysis.FrenchStemTokenFilterFactory
  • org.elasticsearch.index.analysis.GermanNormalizationFilterFactory
  • org.elasticsearch.index.analysis.GermanStemTokenFilterFactory
  • org.elasticsearch.index.analysis.HindiNormalizationFilterFactory
  • org.elasticsearch.index.analysis.HunspellTokenFilterFactory
  • org.elasticsearch.index.analysis.IndicNormalizationFilterFactory
  • org.elasticsearch.index.analysis.KeepTypesFilterFactory
  • org.elasticsearch.index.analysis.KeepWordFilterFactory
  • org.elasticsearch.index.analysis.KeywordMarkerTokenFilterFactory
  • org.elasticsearch.index.analysis.KStemTokenFilterFactory
  • org.elasticsearch.index.analysis.LengthTokenFilterFactory
  • org.elasticsearch.index.analysis.LimitTokenCountFilterFactory
  • org.elasticsearch.index.analysis.LowerCaseTokenFilterFactory (TESTED, works)
  • org.elasticsearch.index.analysis.NGramTokenFilterFactory
  • org.elasticsearch.index.analysis.PatternCaptureGroupTokenFilterFactory
  • org.elasticsearch.index.analysis.PatternReplaceTokenFilterFactory
  • org.elasticsearch.index.analysis.PersianNormalizationFilterFactory
  • org.elasticsearch.index.analysis.PorterStemTokenFilterFactory
  • org.elasticsearch.index.analysis.ReverseTokenFilterFactory
  • org.elasticsearch.index.analysis.RussianStemTokenFilterFactory
  • org.elasticsearch.index.analysis.ScandinavianFoldingFilterFactory
  • org.elasticsearch.index.analysis.ScandinavianNormalizationFilterFactory
  • org.elasticsearch.index.analysis.ShingleTokenFilterFactory
  • org.elasticsearch.index.analysis.SnowballTokenFilterFactory
  • org.elasticsearch.index.analysis.SoraniNormalizationFilterFactory
  • org.elasticsearch.index.analysis.StandardTokenFilterFactory
  • org.elasticsearch.index.analysis.StemmerOverrideTokenFilterFactory
  • org.elasticsearch.index.analysis.StemmerTokenFilterFactory
  • org.elasticsearch.index.analysis.StopTokenFilterFactory
  • org.elasticsearch.index.analysis.SynonymTokenFilterFactory
  • org.elasticsearch.index.analysis.TrimTokenFilterFactory
  • org.elasticsearch.index.analysis.TruncateTokenFilterFactory
  • org.elasticsearch.index.analysis.UniqueTokenFilterFactory
  • org.elasticsearch.index.analysis.UpperCaseTokenFilterFactory
  • org.elasticsearch.index.analysis.WordDelimiterTokenFilterFactory (TESTED, works)


Not all Filters may be Wrappable

The XML filter wrapper, by necessity, makes some assumptions about the token filter which it is wrapping.

Specifically:

  1. The filter must process a single input token at a time
    • Filters which take multiple tokens and join them together (for example) will not work with the filter wrapper.
    • Filters which return multiple output tokens will work OK.
    • Filters which discard tokens will also work OK.
  2. The filter must allow incrementToken() to be called after false has been returned.
    • This typically works with token filters, but it may not work with all filter types.

The standard Lucene token stream does not support wrapping of other token filters, and so the xml_filter_wrapper creates a "pseudo stream" which is used to isolate the wrapped filter. This pseudo stream has the restrictions identified above.

Note that the filters which have been tested OK are identified in the list above.