AspireObject (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Since version 0.5, Aspire moved from using W3C DOM XML as the native method for encoding metadata, and has moved over to a JSON compatible structure called "AspireObject".

History of AspireObject

Prior to version 0.5, Aspire was always proudly "XML throughout" and all metadata was always been specified in XML DOM objects.

However, over time, some weaknesses of this approach have arisen:

  • Customers have requested JSON output from Aspire
  • XML DOM objects are heavy-weight (use up more memory) than simpler Map&List approaches.
  • XML DOM objects are slow for simple name/value pair accesses
    • These types of accesses account for a majority of Aspire metadata manipulation
    • Simple accesses, such as accessing a child element from a parent node, are unfortunately slow in XML DOM
  • XML DOM objects take up more space when serialized and de-serialized than JSON representations
    • This is typically about 40% more bytes
  • JSON is achieving more and more industry support throughout, and Aspire had no read JSON output method

All of these weaknesses are resolved with the new AspireObject container.

AspireObject is Not Thread Safe

To maintain performance and low memory usage, AspireObject is not thread safe. It is expected that AspireObject is only used by a single thread at a time. This is the same as W3C XML DOM, which was also not thread safe.

Benefits of AspireObject

AspireObject brings substantive benefits for Aspire overall:

  • Basic object access is now simpler and faster
  • AspireObject can be written directly to XML or JSON streams or files
  • AspireObjects can be directly created from XML or JSON streams or files
  • AspireObject itself is much simpler and easier to use than the previous AspireDocument / AXML classes.
  • Multiple AspireObjects can be easily combined by reference.
    • XML DOM required that all components come from the same document owner, which made combining the results of multiple sub-jobs (for example) more expensive in memory and CPU

Challenges and Solutions

This change, however, was not without its challenges:

  • Can XPath still be used over JSON objects?
    • YES: This was accomplished with Apache JXPath. XPaths can still be used to select nodes from AspireObjects.
  • Can AspireObjects still be transformed with XSLT?
    • YES: AspireObjects are also XMLReaders. This allows them to produce SAX events which can be input to XSLT transforms.
  • How does one represent XML content with attributes?
    • Both content and attributes are separately stored in AspireObjects, allowing for content to have attributes.
    • Content with attributes is represented as the special "$" name in JSON
    • Attributes are represented as names prefixed with "@" in JSON
  • How does one represent embedded XML content?
    • Content can be an array of items, including embedded content and embedded objects.

Allows for Other Improvements

Finally, as we have gone through this process, we are happy to report that the overall design of Aspire is cleaner and more programmer-friendly throughout. Some improvements enabled by this change include:

  • Merging of "Job properties" with "document variables"
    • Both have been merged into a single list inside the job, called "Job Variables"
  • AspireObject is much simpler than the older AspireDocument or AXML
    • AXML and AspireDocument previously had about 87 java methods.
    • This has been reduced in AspireObject to just 36 methods.
  • Code intended to encourage use of Aspire "standards" will be moved to a new "Standards" class inside the framework.
    • Much of this was originally in AspireDocument. It will be moved to Standards.
    • This further reduces the churn and size of the AspireObject implementations
  • Note that Job result data will also be represented in AspireObject
  • Access to AspireObjects in Groovy scripting will now be much simpler
    • DOMCategory is no longer needed
    • Contents of objects can be accessed directly, as if they were fields of objects

Configuration is Still XML

Note that Aspire configuration files are still represented in XML. There is no plan to change this at any time in the future.

But Component Status is Now Aspire Objects

In order to make Aspire more friendly for JavaScript-based user interfaces (such as the new Admin console), all component status is now represented in AspireObject objects can can be reported by Aspire in either XML or JSON.

AspireObjects Support JSON

The new metadata holder is called an "object" because it is equivalent to a JSON object, with extensions that make it more compatible with XML.

JSON support in AspireObject now includes:

Maps: All AspireObject instances can have a nested Map of name/value pairs.
Primitive Types: Values stored in AspireObject's can be primitive types such as Integer, Float, Boolean, etc., not just String.
Lists: AspireObject values can hold lists of objects. Items in the list can be any type, from a primitive object to a full embedded map.
Created from JSON: AspireObjects can be created from strings and streams of characters which represent JSON content.
Write to JSON: AspireObjects can be written out as valid JSON.

AspireObjects Support XML

The AspireObject has extensions to simple JSON objects specifically to improve handling of XML. This is critically important because XML is used so heavily in and throughout Aspire document processing.

Therefore, several extensions to JSON were added to create an AspireObject which works well with XML:

AspireObject's are named - Every AspireObject has a name. This makes AspireObject's roughly equivalent to XML elements.
Attribute Handling - Special methods exist on AspireObject for setting and getting attributes. Attributes can be added to any object (i.e. Strings, Integers, etc.) which are stored in AspireObject's.
Repeated Elements - Repeated elements are automatically stored as embedded arrays in AspireObject when read from XML. These are automatically unwound when writing the object back to XML.
Mixed Content Handling - Elements which contain mixed text and embedded markup are automatically stored as arrays of content.
XML can be read directly into AspireObject's - AspireObject can be built from SAX streams, allowing XML to automatically build an AspireObject.
AspireObjects can be written to XML - AspireObject's can be used as an XMLReader which produces a SAX event stream. This can be serialized to files or transformed into W3C DOM as necessary.
AspireObjects can be transformed - Since AspireObjects can produce SAX streams, they can also be inputs to XSLT transformations.
AspireObjects can be searched with XPath - Apache JXPath has been implemented as a method for searching AspireObject trees, so that standard XPath expressions can be used to access data from AspireObjects.

Basic AspireObject Usage

Creating New Objects

Creates an empty AspireObject.

 AspireObject aspireObj = new AspireObject("doc");

JSON:

 {"doc": null}

XML:

 <doc/>

Adding Named Elements to an Object

 aspireObj.add("title", "Call of the Wild");
 aspireObj.add("author", "Jack London");

JSON:

 {
   "doc":
     {
       "title": "Call of the Wild",
       "author": "Jack London",
     }
 }

XML:

 <doc>
   <title>Call of the Wild</title>
   <author>Jack London</author>
 </doc>

Using set() to Change a Value

 aspireObj.set("title", "White Fang");

JSON:

 {
   "doc":
     {
       "title": "White Fang",
       "author": "Jack London",
     }
 }

XML:

 <doc>
   <title>White Fang</title>
   <author>Jack London</author>
 </doc>

AspireObjects Have Names

Like XML Elements, AspireObjects have names:

 AspireObject aspireObj = new AspireObject("doc", "this is the content of my doc");

These named objects are represented in JSON as a parent-map:

 {"doc": "this is the content of my doc"}

In XML:

 <doc>this is the content of my doc</doc>

In the above example, "doc" is not actually a key for any map. It is merely the name of the AspireObject.

Named Children

All children added to an Aspire object will also have names:

 AspireObject aspireObj = new AspireObject("doc");
 aspireObj.add("title","A supercalifragilistic title!");
 aspireObj.add("author","Dick Van Dyke");

These are stored in the aspireObj in a map and are represented in JSON as a nested JSON object:

 {
   "doc":
     {
       "title": "A supercalifragilistic title!",
       "author": "Dick Van Dyke"
     }
 }

And in XML:

 <doc>
   <title>A supercalifragilistic title!</title>
   <author>Dick Van Dyke</author>
 </doc>

All Named Child Elements are also AspireObjects

...even those which are just simple Strings or Integers.

When adding named elements to an AspireObject using add(), new AspireObject's are created for each element added:

 AspireObject aspireObj = new AspireObject("doc");
 AspireObject titleObj = aspireObj.add("title","Call of the Wild");
 
 titleObj instanceof AspireObject

This small amount of extra overhead allows for these objects to contain attributes, an important feature of XML.

In the above example, titleObj is an AspireObject which contains a String, and not String itself.

In addition, child elements are all named:

 titleObj.getName()   -->  Returns "title"
 titleObj.getContent()  -->  Returns "Call of the Wild

Multiple Children with the Same Name become Embedded Arrays

The following XML structure:

 <doc>
   <file>a.txt</file>
   <file>b.txt</file>
   <file>c.txt</file>
 </doc>

Will be represented in JSON as an array:

 {
   "doc":
     {
       "file": ["a.txt", "b.txt", "c.txt"]
     }
 }

This structure will be implemented automatically using the add() method of AspireObject:

 AspireObject aspireObj = new AspireObject("doc");
 aspireObj.add("file", "a.txt");
 aspireObj.add("file", "b.txt");
 aspireObj.add("file", "c.txt");

Note that get("file") will return just the first child:

 aspireObj.get("file")  -->  Returns the AspireObject which contains "a.txt"

Therefore, get() will always return an AspireObject. The value can be null if the child can not be found.

A special method exists to fetch all children for a node:

 List<AspireObject> allFiles = aspireObj.getAll("file");   // Always returns a list, even with just one
 
 allFiles.size()  -->  Is "3"

getAll() will always return a List<AspireObject> which is never null. If there are no children, it will return an empty list.

Fetching All Children

Because all AspireObjects are named, this makes fetching children quite useful:

 AspireObject aspireObj = new AspireObject("doc");
 aspireObj.add("title", "A clutch of files");
 aspireObj.add("file", "a.txt");
 aspireObj.add("file", "b.txt");
 aspireObj.add("author", "Rip Van Winkle");
 
 List<AspireObject> childrenList = aspireObj.getChildren();
 
 childrenList.size() -->  return 4
 
 childrenList.get(2).getName()  -->  returns "file"
 childrenList.get(2).getContent()  -->  returns "b.txt"

In the above example, childrenList contains all children, including both instances of the "file" child.

Content Support

AspireObjects can have both content as well as name/value pairs. This is an extension over standard JSON objects to provide better XML support.

Special methods are available for accessing content

 aspireObj.setContent("This is the content of my node");
 
 aspireObj.getContent()                 // Returns "This is the content of my node"
 
 aspireObj.addContent("More Content")   // Adds more content to the existing content
                                           (creates a list of content objects)

AspireObjects with just content, are represented as just content

 AspireObject aspireObj = new AspireObject("doc");
 aspireObject.add("name", "George Washington");

JSON:

{"doc":
  {"name": "George Washington"     <<< "name" is actually an embedded AspireObject,
                                       but printed here as a simple string
  }
}

XML:

 <doc>
   <name>George Washington</name>
 </doc>

AspireObjects with content and attributes are displayed as "$" in JSON

 AspireObject aspireObj = new AspireObject("doc");
 AspireObject nameObj = aspireObject.add("name", "George Washington");
 nameObj.setAttribute("jobType", "president");
 

JSON:

{"doc":
  {"name":
    {"$": "George Washington",
     "@jobType": "president"}
  }
}

XML:

 <doc>
   <name jobType="president">George Washington</name>
 </doc>

Multiple adds to content are stored as a list

...to handle embedded markup

XML content can have text and embedded tags mixed. When this occurs, the content will be represented in the AspireObject as a list of items. This can be implemented with addContent(), as follows:

 AspireObject aspireObj = new AspireObject("doc");
 AspireObject title = aspireObj.add("title");
 title.addContent("This is a ");
 title.addContent(new AspireObject("b,"really big"));
 title.addContent(" statement of truth!");

When written to JSON, this will look like this:

{
   "doc": {
      "title": {
        "$": ["This is a ", {"b:"really big"}, " statement of truth!"]
      }
   }
}

Note that when the content is a list, it must be put under the "$" tag, to distinguish it from situations where there are multiple XML tags with the same name.

In XML it looks like this:

 <doc>
   <title>This is a <b>really big</b> statement of truth!</title>
 </doc>

Now suppose the document has attributes:

 title.setAttribute("type", "lovely");

The JSON will now look like this:

{
  "doc": {
      "title": {
        "$": ["This is a ", {"b:"really big"}, " statement of truth!"]
        "@type": "lovely"
      }
   }
}

In XML:

 <doc>
   <title type="lovely">This is a really big statement of truth!</title>
 </doc>

Attribute Support

AspireObjects have special built-in support for attributes to make them more compatible with XML.

Special methods are available for setting and getting attributes

 aspireObj.setAttribute("type", "human");       // Set the type attribute on the object
 
 String type = aspireObj.getAttribute("type");  // Fetches the type attribute from the AspireObject

Attribute values can only be string

 aspireObj.setAttribute("type", 32)                        // Throws an exception
 
 aspireObj.setAttribute("type", new AspireObject("data"))  // Throws an exception

Attributes are represented in JSON as names preceded by "@"

 AspireObject aspireObj = new AspireObject("doc");
 aspireObj.setAttribute("from", "US-GPO");
 AspireObject nameObj = aspireObj.add("name", "Obama");
 nameObj.setAttribute("type", "president");

JSON:

 {
   "doc": {
     "@from": "US-GPO",
     "name": {
       "$": "Obama",
       "@type": "president"
     }
   }
 }

XML:

 <doc from="US-GPO">
   <name type="president">Obama</name>
 </doc>

Miscellaneous Notes

null Handling

It is possible to create an object with null as its content and with no named elements:

 AspireObject aspireObj = new AspireObject("doc");

In JSON this is represented as:

 {"doc": null}

In XML this is represented as:

 <doc/>

Adding named elements with no content

New named elements added to an AspireObject with no content will be assumed to be null:

 AspireObject aspireObj = new AspireObject("doc");
 aspireObj.add("title");

In JSON this is represented as:

 {
   "doc":
     {
       "title":null
     }
 }

In XML this is represented as:

 <doc>
   <title/>
 </doc>

Adding null content is generally ignored

The following are ignored (do not throw an exception):

 AspireObject aspireObj = new AspireObject("doc");
 
 aspireObj.addContent(null);  // ignored
 aspireObj.add("$");          // ignored, same as addContent(null)

What is not Supported

XML Unsupported

  • Ordered elements of varying names
    • Since elements are stored in HashMaps in AspireObject, the order of these elements is not preserved.
    • However, multiple children of the same name are stored in a list in AspireObject, and so their order IS preserved.
  • XML Namespaces
    • Uncertain what is possible here or will be ultimately supported here
  • XML Processing Directives
    • These can not be represented in AspireObject
  • XML Comments

JSON Unsupported

  • Names must be valid XML names
    • Currently, we expect that all names in JSON name/value pairs will need to be valid XML names. For example, names must start with a letter or underscore, and can contain letter, underscore, dash or period.