Groovy Scripting

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here


Groovy Scripting
Factory  aspire-groovy
subType  default
Inputs  The AspireObject object passed down the pipeline
Outputs  As determined by the Groovy Scripting module.
View 0.4
Documentation

Groovy scripting is available for quickly creating new pipeline modules without having to write Java Code and create a Jar for the component. Groovy scripts can do many things that a regular Java pipeline stage can do.

  • For examples of various Groovy Scripts, see Groovy Scripting Examples
  • Groovy scripts are pooled and reused to minimize re-compiling of scripts
    • Therefore, all scripts are thread safe.

When Not to Use a Groovy Script

All of the following are planned as near-term enhancements.

Metadata Mapper - Mapping metadata values using the prioritized metadata mapper.

Shared Components - Currently, Groovy scripting modules can not be shared, deployed, or distributed except within a system configuration file. Therefore, code which is to be widely shared should be probably be implemented in a Java stage.

Anything Big or Complex - Probably could benefit from more careful unit testing with a Java based JUnit test bench.

More Information on Groovy

Groovy is a Java-based scripting language which compiles directly into Java byte code. It's advantages are that it runs very fast in the Java JVM and it has full access to all Java objects, classes, and methods.

Some useful Groovy Scripting links:

Configuration

Element Type Default Description
script string none The Groovy script to be executed on every job in the pipeline.
startup string none The Groovy script to be executed on startup. See discussion below about what is and is not allowed inside of startup scripts.
branches Branch Handler Configuration none You can add a standard Branch Handler configuration to your Groovy stage which will allow you to access the "bh" variable to create and branch sub jobs. See Branch Handler for more information.
variable Static Variable none Allows you to add static variables which are initialized when the stage is initialized, and then are available for all executions of the script across all jobs. See the discussion below for more details.

Your Scripts are Inside XML

Don't forget that your scripts are XML. Therefore, you must escape the <, >, and & characters if you need them in your script.

Alternatively, you could enclose the content of your script in a <![CDATA[ ]]> tag, for example:

<component name="MyTest2" subType="default" factoryName="aspire-groovy">
  <script>
    <![CDATA[
      if(true && 5 == 5 && 12 > 3)
        println "TRUE!!"
      else
        println "FALSE!!"
    ]]>
  </script>
</component>

Variables

Your Groovy script will automatically have a number of variables predefined before the script is called. These variables can be used directly from your script (no special setup or imports or anything required).

Variable When Available Description
job always Java Type = Job

References the job which is being procesed. You can use this variable to create new sub-jobs, check on job status, wait for sub-jobs, etc.

doc always Java Type = AspireObject

The AspireObject which holds all of the metadata for the current document being processed. This is the same as job.get() - the job's data object.

component always Java Type = StageImpl which is derived from ComponentImpl

This variable provides access to the component itself. This can be used for a variety of useful tasks, such as logging, accessing other components, getting Aspire Home, and turning relative paths to Aspire Home into absolute paths. Note that StageImpl extends ComponentImpl, where all of the most useful methods are located.

bh If <branches> configured Java Type = BranchHandler

If a <branches> tag is defined within the <config> for your groovy stage, then this variable will contain a pointer to the loaded branch handler. You can use this to create sub-jobs with your groovy stage and then branch those sub-jobs (with the bh.enqueue() method) to other pipeline managers and pipelines. Note, if branching the current job, use job.setBranch() instead.

contentStream After Fetch URL Java Type = java.io.InputStream

The variable contains a reference to the Java InputStream object most recently opened by the prior Fetch URL stage.

Fetch URL doesn't actually read any data from the URL, it only opens a stream to the URL. It depends on a later stage (typically Extract Text or XML Loader) to do something with that stream.

jdbc After RDBMS Connection Java Type = java.sql.Connection

Contains a pointer to a Java SQL Connection object, which can be used to execute SQL statements against the connected database. The connection is created from the connection pool by RDBMS Connection stage and then stored into the "jdbc" variable so it can be accessed by Groovy.

Job Variables

All of the variables in the AspireObject instance map are automatically available as variables in your Groovy Script. This makes it possible for you to create variables which are attached to your document which can then be used by your own script, or other Groovy scripts in other pipeline modules. See AspireObject for more details about the AspireObject instance.

Note: These variables are attached to the AspireObject instance within the job and not the groovy stage itself. Therefore, the values for these variables are passed down the pipeline with the job and are therefore available to any later Groovy stage which processes the same job.

Within Groovy code, you can access variables just like regular variables.

Example 1: Setting a document variable (will be stored in the AspireObject)

 myVar = 12;

Example 2: Using a document variable

 print myVar + 33;

Example 3: Setting a local variable (will not be stored in the AspireObject)

 def myVar = 32;

Example 4: Setting the variable with putVariable() (works the same as example 1)

 doc.putVariable("myVar",12);
 println myVar + 32;

Note that, once the "myVar" variable is set with doc.putObject() (all of the examples above except example #3), it is attached to the document. All Groovy scripts which process the same document will have access to the "myVar" variable.

Example 5: Setting the variable with job. prefix (works the same as example 1)

 job.myVar = 12;

Hierarchical Job Variables

In Aspire, jobs can be split into a series of smaller jobs called "sub jobs". Every sub-job maintains a link to the parent-job from whence it was derived.

This forms a "job hierarchy". For example, you might have a large XML file which has multiple data records within it. Each of the records within the XML file may be processed by a sub job. Sub jobs are typically extracted with "Sub Job Extractors" such as the Tabular Files Extractor or the XML Sub Job Extractor.

When accessing a variable, the following procedure is followed:

  1. Check to see if the variable is on the current job's AspireObject
  2. If not, then check to see if the job has a parent job.
    1. If so, then look for the variable in the parent job's AspireObject
  3. Continue checking up the job hierarchy until the variable is found or you have reached a job with no parent.

What this means is that referencing a variable as follows will first check the current job, and then will automatically check the parent job and grandparent jobs (if any):

 println myVar

If no job has the variable, then the variable will return null.

All new variables set, for example, will be place in the current job, and not the parent job:

 myNewVar = 12

Static Variables

Static variables are variables which are attached to the instance of the Groovy stage. These variables are initialized when the stage is initialized, and are then available to all other jobs which pass through the stage.

Note that these variables are not attached to the job - but to the stage.

An example of a stage with static variables looks like this:

 <component name="MyTest" subType="default" factoryName="aspire-groovy">
   <variable name="test">100+1</variable>
   <variable name="test_string">"my string"</variable>
   <script>
     println "The value of test is:  " + test;
     println "The value of test_string is:  " + test_string;
   </script>
 </component>

Note that the Groovy script which is the content of the variable is only executed once, when the stage is initialized.

Static variables are initialized with groovy script, for example:

 <variable name="test">return 5+3;</variable>

Note that the "return" statement is optional:

 <variable name="test">5+3</variable>

Note that the variable script can include any other variables previously defined:

 <variable name="test">return 5+3;</variable>
 <variable name="secondtest">return test*20;</variable>

Since the contents of the variable tag is Groovy, constant strings need to be surrounded by double quotes:

 <variable name="theFileName">"data/testfile.txt"</variable>

Also, you may need import statements for classes:

 <variable name="count">
    import java.util.concurrent.atomic.AtomicInteger;
    return new AtomicInteger();
 </variable>

Component Variables

Component variables are like static variables. They are initialized with the <variable> tag, as follows:

 <variable name="namesHash" component="/MySystem/NamesHashTable"/>

Once the component variable is declared, you can then access the methods exported by that component. For example:

 <component name="MyTest" subType="default" factoryName="aspire-groovy">
   <variable name="namesHash" component="/MySystem/NamesHashTable"/>
   <script>
     println "The size of the hash table is:  " + namesHash.size();
     println "The values associated with xxnamexx are:  " + namesHash.size("xxnamexx");
   </script>
 </component>

Technical Details: When the variable is initialized, the Groovy stage will create an OSGi service tracker for the component. When each job is processed, the Groovy stage will fetch the latest object from the service tracker and assign it to the variable, so that the Groovy script will always have the latest reference to the object (even if the component is refreshed).

Accessing Another Groovy Component's Static Variables

Components can access another Groovy component's variables as long as those variables are specified in a <variable> tag. For example:

Component 1:

 <component name="MyTest" subType="default" factoryName="aspire-groovy">
   <variable name="thisIsATestVar">return 'Hello World!'</variable>
   <script> . . .  </script>
 </component>

Component 2:

 <component name="SecondTest" subType="default" factoryName="aspire-groovy">
   <variable name="otherComponent" component="MyTest"/>
   <script>
     println "Getting a variable from another component: " + otherComponent.getVariable("thisIsATestVar");
   </script>
 </component>


Notice that, in component 2, there is a variable (otherComponent) which holds a reference to the first component. This reference can now be used to access variables this other component.

Startup Scripts

Groovy scripts can be specified with the <startup> tag.

  <!-- Create an embedded database in memory -->
  <component name="EmbeddedDB" subType="default" factoryName="aspire-derby"/>
  
  <component name="InitDB" subType="default" factoryName="aspire-groovy">
    <variable name="embeddedDB" component="EmbeddedDB"/>
    <startup>
      def sqlConn = embeddedDB.getConnection();
      gsql = new Sql(sqlConn);
      gsql.execute('''create table PERSON (
          id integer not null primary key,
          firstname varchar(20),
          lastname varchar(20),
          location_id integer,
          location_name varchar(30)
      )''')
      embeddedDB.closeConnection(sqlConn);
    </startup>
  </component>

On-Line Script Testing

The Groovy stage provides a method to test scripts. From the status page of any running Groovy stage, the user may enter the text of a script to test along with the XML of a document to run against.

The return value is then displayed back via the interface.

The script is not added to the component; this is entirely for testing. If the tested script is to be used in an Aspire system, it must be added to the appropriate system configuration file.

NOTE: If the script writes output, be careful as to the command used.

System.out.println("hello") writes to the console the Felix container is running in, and is not reported back to the interface, while println("world") is returned to the interface, but not written to the Felix console.


Using Groovy with the Branch Handler

Note the <branches> tag inside the Groovy stage configures the branch handler, and then you can use "bh.enqueue()" to branch your new sub jobs.

<component name="SubJobBranchesExample" subType="default" factoryName="aspire-groovy">
  <script>
    <![CDATA[
 
 import com.searchtechnologies.aspire.services.AspireObject;
 import com.searchtechnologies.aspire.services.Job;
 
 for(int x = 0 ; x < 10 ; x++) {
   AspireObject subDoc = new AspireObject("doc");
   subDoc.add("X", String.valueOf(x));
   subDoc.add("OtherVar", "otherValue");
 
   Job subJob = job.createSubJob(subDoc, job.getJobId() + "-" + x);
   bh.enqueue(subJob, "onPublish");
 }
 
    ]]>
  </script>
  <branches>
     <branch event="onPublish" pipelineManager="ProcessSubJobPipelineManager" 
              pipeline="process-document" />
  </branches>
</component>

If you're unsure where to place the <branches> tag, see the branch handler for more details.

Branching the Current Job

You can also use "job.setBranch()" to branch the current job to an event on the pipeline manager.

 <component name="federate" subType="default" factoryName="aspire-groovy">
   <script>
     <![CDATA[
     .
     .
     // Set the main job to branch so we miss the unfederated query
     job.setBranch("onFederatedQuery");
   ]]>
   </script>
 </component>

If you're unsure where to place the <branches> tag, see the branch handler for more details.

Using a 3rd Party Jar from a Groovy script (using Reflection)

Below is a base/example script that will allow you to incorporate functionality from external jar files into a groovy stage in the pipeline.

Normally, 3rd Party jars must be "wrapped" to be used in Aspire. This requires a wrapper stage to be coded, but if you only want to use a small number of methods in the jar file, this Groovy stage method can be used.

The script assumes that the jar(s) file (and dependencies) has/have been copied into the "lib" folder as part of the Aspire distribution. Please note an example folder structure below:

Aspire
lib
config
data
bin
felix-cache


//=======================================================================================================
// Details of 3rd Party Jar
//=======================================================================================================

// The third party jar
def basePathLangDetect = "lib/langdetect.jar";
def basePathJSONIC = "lib/jsonic-1.3.0.jar";
def basePathCatCon = "lib/tgcatcon.jar";
def basePaths = [basePathLangDetect, basePathJSONIC, basePathCatCon] as String[];

// Classes to use
def classDetectorFactory = "com.cybozu.labs.langdetect.DetectorFactory";
def classJSON = "net.arnx.jsonic.JSON";
def classJSONException = "net.arnx.jsonic.JSONException";

//=======================================================================================================
// Base methods to load 3rd Party Classes
//=======================================================================================================
//There should not be major reason to change this code.

//Obtains the relative path
def getRelativePath = { basePath ->
	def path = "file://" + new File(basePath).toURI().getPath();
}

// Obtains the classloader consisting of the original class loader with the jar file added.
// Use this if you only have a single 3rd Party jar file.
// param: path - the path to the jar file
def getClassLoader = { path ->
	def classLoader = ClassLoader.systemClassLoader
	while (classLoader.parent) {
		classLoader = classLoader.parent
	}
	def newClassLoader = new URLClassLoader([new File(path).toString().toURL()] as URL[], classLoader);
	return newClassLoader;
}

// Obtains the classloader consisting of the original class loader with the jar files added.
// Use this if you have multiple 3rd Party jar files.
// param: paths - an array of the paths to the jar file
def getMultiClassLoader = { paths ->
	def classLoader = ClassLoader.systemClassLoader
	while (classLoader.parent) {
		classLoader = classLoader.parent
	}
	ArrayList<URL> urls = new ArrayList<URL>(paths.length);
	for (String path:paths) {
		urls.add(new File(getRelativePath(path)).toString().toURL());
	}
	def newClassLoader = new URLClassLoader(urls.toArray([] as URL[]), classLoader);
	return newClassLoader;
}

// Obtains a new instance of a class using the classloader and a constructor with an empty argument list
def getClass = { cName, cLoader ->
	// load the class
	Class clazz = cLoader.loadClass(cName)
	return clazz.newInstance()
}

// Obtains a new instance of a class using the classloader and a constructor with the specified array of arguments.
// paramTypes is an array of classes representing the Constructor parameters.
// paramValues is an array of Objects representing the Constructor parameter values.
def getClassWithArgs = { cName, cLoader, paramTypes, paramValues ->
	// load the class
	Class clazz = cLoader.loadClass(cName)
	return clazz.getConstructor(paramTypes).newInstance(paramValues)
}

// Load the class using the classloader.
// Use this if you don't need an instance of the class but the script needs to be aware of it.
def loadClass = { cName, cLoader ->
	return cLoader.loadClass(cName)
}
//=======================================================================================================
// End Base methods to load 3rd Party Classes
//=======================================================================================================


//=======================================================================================================
// Example usage
//=======================================================================================================

// Get a Class Loader for all 3rd Party Jars. Need to use the same one for all classes.
classLoader = getMultiClassLoader(basePaths);

// Load the extra classes required by LangDetect
jsonException = loadClass(classJSONException, classLoader);
json = loadClass(classJSON, classLoader);

// Get a DetectorFactory and load the files from the LangDetect profiles directory
detectorFactory = loadClass(classDetectorFactory, classLoader);
detectorFactory.loadProfile("C:/Profiles");

// Instantiate a class with no constructor parameters
catHandle = getClass(classname, classLoader);

// Call a method
catHandle.addServer("127.0.0.1", 6500);

// Example of instantiating a class with constructor parameters (equivalent to cal = new SimpleTimeZone(0, "en"))
cal = getClassWithArgs("java.util.SimpleTimeZone", classLoader, [int.class, String.class] as Class[], [0, "en"] as Object[]);