Groovy Scripting 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Groovy Scripting Stage

Groovy Scripting Stage
Description: Executes a Groovy script for each job. The script has access to the AspireDocument and the XML that it contains.
Inputs: The AspireDocument object passed down the pipeline
Outputs: Is determined by the Groovy Scripting module.
Factory: aspire-groovy (previously aspire.Groovy)
Sub Type: default
Object Type: AspireDocument

Other Notes

Groovy scripting is available for quickly creating new pipeline modules without having to write Java Code and create a Jar for the component. Groovy scripts can do many things that a regular Java pipeline stage can do.

  • For examples of various Groovy Scripts, see Groovy Scripting Examples
  • Groovy scripts are pooled and reused to minimize re-compiling of scripts
    • Therefore, all scripts are thread safe.

Configuration

Element Type Default Description
script string none The Groovy script to be executed on every job in the pipeline.
startup string none The Groovy script to be executed on startup. See discussion below about what is and is not allowed inside of startup scripts.
branches Branch Handler Configuration none You can add a standard Branch Handler configuration to your Groovy stage which will allow you to access the "bh" variable to create and branch sub jobs. See Branch Handler for more information
variable Static Variable none Allows you to add static variables which are initialized when the stage is initialized, and then are available for all executions of the script across all jobs. See the discussion below for more details.

Your Scripts are Inside XML

Don't forget that your scripts are XML. Therefore, you must escape the <, >, and & characters if you need them in your script.

Alternatively, you could enclose the content of your script in a <![CDATA[ ]]> tag, for example:

<component name="MyTest2" subType="default" factoryName="aspire-groovy">
 <config>
    <script>
      <![CDATA[
        if(true && 5 == 5 && 12 > 3)
          println "TRUE!!"
        else
          println "FALSE!!"
      ]]>
    </script>
  </config>
</component>

When Not to Use a Groovy Script

All of the following are planned as near-term enhancements.

Metadata Mapper - Mapping metadata values using the prioritized metadata mapper.

Shared Components - Currently, Groovy scripting modules can not be shared, deployed, or distributed except within a system configuration file. Therefore, code which is to be widely shared should be probably be implemented in a Java stage.

Anything Big or Complex - Probably could benefit from more careful unit testing with a Java based JUnit test bench.

More Information on Groovy

Groovy is a Java-based scripting language which compiles directly into Java byte code. It's advantages are that it runs very fast in the Java JVM and it has full access to all Java objects, classes, and methods.

Some useful Groovy Scripting links:

Variables

Your Groovy script will automatically have a number of variables pre-defined before the script is called. These variables can be used directly from your script (no special setup or imports or anything required).

Variable When Available Description
job always Java Type = com.searchtechnologies.aspire.services.Job

References the job which is being procesed. You can use this variable to create new sub-jobs, check on job status, wait for sub-jobs, etc.

doc always Java Type = com.searchtechnologies.aspire.services.AspireDocument which is derived from AXML

The AspireDocument object which holds all of the metadata for the current document being processed. This is the same as job.getObject() - the job's data object.

Note that the AspireDocument is a sub-class of the Aspire AXML class, a very useful class for manipulating and querying XML DOMs.

dom always Java Type = org.w3c.dom.Element

The W3C DOM root Element object of the XML which contains all of the metadata in the AspireDocument. This will be the same as doc.getMyElement(). The "dom" variable can be used within a Groovy use(groovy.xml.dom.DOMCategory) { } declaration to access the AspireDocument XML in a Groovy way.

component always Java Type = com.searchtechnologies.aspire.framework.StageImpl which is derived from ComponentImpl

This variable provides access to the component itself. This can be used for a variety of useful tasks, such as logging, accessing other components, getting Aspire Home, and turning relative paths to Aspire Home into absolute paths. Note that StageImpl extends ComponentImpl, where all of the most useful methods are located.

bh If <branches> configured Java Type = com.searchtechnologies.aspire.framework.BranchHandler

If a <branches> tag is defined within the <config> for your groovy stage, then this variable will contain a pointer to the loaded branch handler. You can use this to create sub-jobs with your groovy stage and then branch those sub-jobs (with the bh.enqueue() method) to other pipeline managers and pipelines. Note, if branching the current job, use job.setBranch() instead.

contentStream After Fetch URL Java Type = java.io.InputStream

The variable contains a reference to the Java InputStream object most recently opened by the prior Fetch URL stage.

Fetch URL doesn't actually read any data from the URL, it only opens a stream to the URL. It depends on a later stage (typically Extract Text or XML Loader) to do something with that stream.

jdbc After RDBMS Connection Java Type = java.sql.Connection

Contains a pointer to a Java SQL Connection object, which can be used to execute SQL statements against the connected database. The connection is created from the connection pool by RDBMS Connection stage and then stored into the "jdbc" variable so it can be accessed by Groovy.

Job Variables

All of the variables in the AspireDocument object map are automatically available as variables in your Groovy Script. This makes it possible for you to create variables which are attached to your document which can then be used by your own script, or other Groovy scripts in other pipeline modules.

Note: These variables are attached to the AspireDocument object within the job and not the groovy stage itself. Therefore, the values for these variables are passed down the pipeline with the job and are therefore available to any later Groovy stage which processes the same job.

Within Groovy code, you can access variables just like regular variables.

Example 1: Setting a document variable (will be stored in the AspireDocument)

 myVar = 12;

Example 2: Using a document variable

 print myVar + 33;

Example 3: Setting a local variable (will not be stored in the AspireDocument)

 def myVar = 32;

Example 4: Setting the variable with putVariable() (works the same as example 1)

 doc.putVariable("myVar",12);
 println myVar + 32;

Note that, once the "myVar" variable is set with doc.putObject() (all of the examples above except example #3), it is attached to the document. All Groovy scripts which process the same document will have access to the "myVar" variable.

Hierarchical Job Variables

In Aspire, jobs can be split into a series of smaller jobs called "sub jobs". Every sub-job maintains a link to the parent-job from whence it was derived.

This forms a "job hierarchy". For example, you might have a large XML file which has multiple data records within it. Each of the records within the XML file may be processed by a sub job. Sub jobs are typically extracted with "Sub Job Extractors" such as the Tabular Files Extractor or the XML Sub Job Extractor.

When accessing a variable, the following procedure is followed:

  1. Check to see if the variable is on the current job's AspireDocument
  2. If not, then check to see if the job has a parent job.
    1. If so, then look for the variable in the parent job's AspireDocument
  3. Continue checking up the job hierarchy until the variable is found or you have reached a job with no parent.

What this means is that referencing a variable:

 println myVar

will first check the current job, and then will automatically check the parent job and grandparent jobs (if any). If no job has the variable, then the variable will return null.

All new variables set, for example:

 myNewVar = 12

will be place in the current job, and not the parent job.

Static Variables

Static variables are variables which are attached to the instance of the Groovy stage. These variables are initialized when the stage is initialized, and are then available to all other jobs which pass through the stage.

Note that these variables are not attached to the job - but to the stage.

An example of a stage with static variables looks like this:

 <component name="MyTest" subType="default" factoryName="aspire-groovy">
   <config>
     <variable name="test">100+1</variable>
     <variable name="test_string">"my string"</variable>
     <script>
       println "The value of test is:  " + test;
       println "The value of test_string is:  " + test_string;
     </script>
   </config>
 </component>

Note that the Groovy script which is the content of the variable is only executed once, when the stage is initialized.

Static variables are initialized with groovy script, for example:

 <variable name="test">return 5+3;</variable>

Note that the "return" statement is optional:

 <variable name="test">5+3</variable>

Note that the variable script can include any other variables previously defined:

 <variable name="test">return 5+3;</variable>
 <variable name="secondtest">return test*20;</variable>

Since the contents of the variable tag is Groovy, constant strings need to be surrounded by double quotes:

 <variable name="theFileName">"data/testfile.txt"</variable>

Also, you may need import statements for classes:

 <variable name="count">
    import java.util.concurrent.atomic.AtomicInteger;
    return new AtomicInteger();
 </variable>

Component Variables

Component variables are like static variables. They are initialized with the <variable> tag, as follows:

 <variable name="namesHash" component="/MySystem/NamesHashTable"/>

Once the component variable is declared, you can then access the methods exported by that component. For example:

 <component name="MyTest" subType="default" factoryName="aspire-groovy">
   <config>
     <variable name="namesHash" component="/MySystem/NamesHashTable"/>
     <script>
       println "The size of the hash table is:  " + namesHash.size();
       println "The values associated with xxnamexx are:  " + namesHash.size("xxnamexx");
     </script>
   </config>
 </component>

Technical Details: When the variable is initialized, the Groovy stage will create an OSGi service tracker for the component. When each job is processed, the Groovy stage will fetch the latest object from the service tracker and assign it to the variable, so that the Groovy script will always have the latest reference to the object (even if the component is refreshed).

Accessing Another Groovy Component's Static Variables

Components can access another Groovy component's variables as long as those variables are specified in a <variable> tag. For example:

Component 1:

 <component name="MyTest" subType="default" factoryName="aspire-groovy">
   <config>
     <variable name="thisIsATestVar">return 'Hello World!'</variable>
     <script> . . .  </script>
   </config>
 </component>

Component 2:

 <component name="SecondTest" subType="default" factoryName="aspire-groovy">
   <config>
     <variable name="otherComponent" component="MyTest"/>
     <script>
       println "Getting a variable from another component: " + otherComponent.getVariable("thisIsATestVar");
     </script>
   </config>
 </component>


Notice that, in component 2, there is a variable (otherComponent) which holds a reference to the first component. This reference can now be used to access variables this other component.

Startup Scripts

Groovy scripts can be specified with the <startup> tag.

  <!-- Create an embedded database in memory -->
  <component name="EmbeddedDB" subType="default" factoryName="aspire-derby"/>
  
  <component name="InitDB" subType="default" factoryName="aspire-groovy">
    <config>
      <variable name="embeddedDB" component="EmbeddedDB"/>
      <startup>
        def sqlConn = embeddedDB.getConnection();
        gsql = new Sql(sqlConn);
        gsql.execute('''create table PERSON (
            id integer not null primary key,
            firstname varchar(20),
            lastname varchar(20),
            location_id integer,
            location_name varchar(30)
        )''')
        embeddedDB.closeConnection(sqlConn);
      </startup>
    </config>
  </component>

On-Line Script Testing

The Groovy stage provides a method to test scripts. From the status page of any running Groovy stage, the user may enter the text of a script to test along with the XML of a document to run against.

The return value is then displayed back via the interface.

The script is not added to the component; this is entirely for testing. If the tested script is to be used in an Aspire system, it must be added to the appropriate system configuration file.

NOTE: If the script writes output, be careful as to the command used.

System.out.println("hello") writes to the console the Felix container is running in, and is not reported back to the interface, while println("world") is returned to the interface, but not written to the Felix console.


Using Groovy with the Branch Handler

Note the <branches> tag inside the Groovy stage configures the branch handler, and then you can use "bh.enqueue()" to branch your new sub jobs.

<component name="SubJobBranchesExample" subType="default" factoryName="aspire-groovy">
  <config>
    <script>
      <![CDATA[
 
 import com.searchtechnologies.aspire.services.AspireDocument;
 import com.searchtechnologies.aspire.services.Job;
 
 for(int x = 0 ; x < 10 ; x++) {
   AspireDocument subDoc = new AspireDocument();
   subDoc.add("X", String.valueOf(x));
   subDoc.add("OtherVar", "otherValue");
 
   Job subJob = job.createSubJob(subDoc, job.getJobId() + "-" + x);
   bh.enqueue(subJob, "onPublish");
 }
 
      ]]>
    </script>
  </config>
  <branches>
     <branch event="onPublish" pipelineManager="ProcessSubJobPipelineManager" 
              pipeline="process-document" />
  </branches>
</component>

If you're unsure where to place the <branches> tag, see the branch handler for more details.

Branching the Current Job

You can also use "job.setBranch()" to branch the current job to an event on the pipeline manager.

 <component name="federate" subType="default" factoryName="aspire-groovy">
   <config>
     <script>
       <![CDATA[
       .
       .
       // Set the main job to branch so we miss the unfederated query
       job.setBranch("onFederatedQuery");
     ]]>
     </script>
   </config>
 </component>

If you're unsure where to place the <branches> tag, see the branch handler for more details.

Using a 3rd Party Jar from a Groovy script (using Reflection)

Below is a base/example script that will allow you to incorporate functionality from external jar files into a groovy stage in the pipeline.

Normally, 3rd Party jars must be "wrapped" to be used in Aspire. This requires a wrapper stage to be coded, but if you only want to use a small number of mehods in the jar file, this Groovy stage method can be used.

The script assumes that the jar(s) file (and dependencies) has/have been copied into the "lib" folder as part of the Aspire distribution. Please note an example folder structure below:

Aspire

   --lib
   --config
   --data
   --bin
   --felix-cache

The script is a template - you only need to "fill in the blanks":

  • change the name of the jar file,
  • change the class name,
  • change the method name
  • populate your array (actually a list) of parameters.
 import java.util.regex.Matcher
 import java.util.regex.Pattern
 import groovy.xml.DOMBuilder
 import groovy.xml.dom.DOMCategory
 import javax.xml.xpath.*
 import java.lang.reflect.Method;
 import java.text.SimpleDateFormat;
 
 //=======================================================================================================
 //Author: Manuel Alfaro
 //=======================================================================================================
 //Script for obtaining the bbc week using the Jar file provided from the other teams: SGW and services
 //=======================================================================================================
 //May I suggest that you either enable the Groovy plugin available for your IDE or use the GroovyConsole.
 //This will simplify editing this file.
 //=======================================================================================================
 
 
 //=======================================================================================================
 //        Variables to change
 //=======================================================================================================
 
 def source = "setBBCWeek"; // script name
 
 // The third party jar
 def basePath = "lib/em3utils.jar";
 
 // Class to use
 def classname = "uk.co.bbc.fabric.em3.gateway.utils.BbcBroadcastDateServiceImpl";
 
 // log file name
 def logFile = "setbbcweek.log";
 
 //Flag used to enable/disable debugging, via the log file
 def debug = 1;
 
 
 //=======================================================================================================
 // Base code - Reflection and logging
 //=======================================================================================================
 //There should not be major reason to change this code.
 
 //Logging function
 def log = {
   def file;
   if(file == null){
       file = new File(logFile);
   }
   file.append it + "\n";
   return;    
 } 
 
 //Obtains the relative path
 def getRelativePath = { 
     def path = "file://" + new File(basePath).toURI().getPath();
 }
 
 //Obtains the classloader consisting of the original class loader with the jar file added
 // param: path - the path to the jar file
 def getClassLoader = { path ->
     def classLoader = ClassLoader.systemClassLoader
     while (classLoader.parent) {
         classLoader = classLoader.parent
     }
     def newClassLoader = new URLClassLoader([new File(path).toString().toURL()] as URL[], classLoader);
     return newClassLoader;
 }
 
 //Obtains the class from the classloader
 // param: cloader - the class loader (including the jar)
 // param: clazz   - the (string) name of the required class 
 def getClass = { cloader , clazz ->
     Class<?> serviceClass = cloader.loadClass(clazz);
     return serviceClass;
 }
 
 //Invokes the method
 // param: clazz  - the class to invoke a method on 
 // param: method - the (string) name of the required method
 // param: params - array of parameters 
 
 def invokeMethod = { clazz, method, params ->
     clazz.newInstance()."$method"(*params);
 }
 
 // Call to executing the method.
 // param: method - the (string) name of the required method
 // param: params - array of parameters 
 def runLogic = { method, parameters ->
     def clz = getClass(getClassLoader(getRelativePath()),classname);
     def result = invokeMethod(clz,method,parameters);
     return result;    
 }
 
 //=======================================================================================================
 //Methods for specific stage functionality
 //=======================================================================================================
 
 //Method to obtain the week through reflection
 def getWeek = { dt ->
     def parameters = [dt];
     int weekNumber = runLogic("getWeekNumber", parameters);
     return weekNumber;
 }
 
 //Stage specific code
 def executeStage = {
     String pub_date = "";
     try {
         pub_date = dom.PUBLICATION_TIMESTAMP.item(0).firstChild.getData();
         if (!pub_date.equals("")){
             String bbcDate = "";
             SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssz");
             String pDate = pub_date.substring(0, pub_date.length()-1) + "GMT-00:00";
             Date d = df.parse(pDate);   
             int w = getWeek(d);   
             doc.add("bbcWeek", ""+w).setAttribute("source", source)
         } else {
             if (debug) log "Input field non-existent or empty";
         }
     } catch (Exception e) {
         log "exception ==> " + e.getMessage();
         e.printStackTrace();
     }    
 }
 
 //=======================================================================================================
 //Main method call
 //=======================================================================================================
 
 def main = {
     use(groovy.xml.dom.DOMCategory) {
     if(debug) log "starting ... ";
         try{
             if(debug) log "inside the main try ... ";
             executeStage();
             if(debug) log "finishing ...";
         }catch(Exception e){
             log "exception ==> " + e.getMessage();
               e.printStackTrace();
         }
     }
 }
 
 main();