Scheduler (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here


Scheduler (Aspire 2)
Factory Name  com.searchtechnologies.aspire:aspire-scheduler
subType  default
Inputs  N/A
Outputs  An AspireObject published to the configured pipeline manager.

The Aspire Scheduler uses the Quartz scheduler from Terracotta (see http://www.quartz-scheduler.org/ for details) to provide the backbone for scheduling jobs. The Aspire scheduler provides a wrapper around the Quartz scheduler and contains all necessary Quartz classes required for operation. Thus (apart from the services and framework), the Aspire Scheduler has no dependencies.

Upon startup, the Aspire Scheduler reads its configuration from the application.xml file and sets schedules within Quartz to represent each of the configured schedules. Each configuration will be given an id, known as the scheduleId. Optionally, the component will also attempt to read schedule information from a relational database.

The configuration of the schedules will include a “cron” style definition of the job execution time. The format of this definition is described later.

The Quartz scheduler is then started. If the scheduler and schedule are enabled, then when the scheduled time is reached, Quartz runs a java method. This method checks that the current schedule does not already have a job outstanding and publishes a job onto the configured Aspire pipeline. Each job published by the Aspire scheduler is given a unique job number known as the jobNumber.

Note that the data in the Aspire job is configurable.

Cron style execution schedule

The configuration stored in the system.xml file defines the execution time using a “cron” style format. The format used by Quartz differs from some “cron” implementations and is described below.

Cron expressions provide the ability to specify complex time combinations such as "At 8:00am every Monday through Friday" or "At 1:30am every last Friday of the month".

Cron expressions are comprised of 6 required fields and one optional field separated by white space. The fields respectively are described as follows:


Field Name Allowed Values Allowed Special Characters
Seconds 0-59 , - * /
Minutes 0-59 , - * /
Hours 0-23 , - * /
Day-of-month 1-31 , - * ? / L W
Month 1-12 or JAN-DEC , - * /
Day-of-Week 1-7 or SUN-SAT , - * ? / L #
Year (Optional) empty, 1970-2199 , - * /


The special characters are described below:

Character Allowed fields Description
* all Used to specify all values. For example, "*" in the minute field means "every minute".
? day-of-month, day-of-week Used to specify 'no specific value'. This is useful when you need to specify something in one of the two fields, but not the other.
- all Used to specify ranges. For example "10-12" in the hour field means "the hours 10, 11 and 12".
, all Used to specify additional values. For example "MON,WED,FRI" in the day-of-week field means "the days Monday, Wednesday, and Friday".
/ all Used to specify increments. For example "0/15" in the seconds field means "the seconds 0, 15, 30, and 45". And "5/15" in the seconds field means "the seconds 5, 20, 35, and 50". Specifying '*' before the '/' is equivalent to specifying 0 as the value to start with. Essentially, for each field in the expression, there is a set of numbers that can be turned on or off. For seconds and minutes, the numbers range from 0 to 59. For hours 0 to 23, for days of the month 0 to 31, and for months 1 to 12. The "/" character simply helps you turn on every "nth" value in the given set. Thus "7/6" in the month field only turns on month "7", it does NOT mean every 6th month, please note that subtlety.
L day-of-month, day-of-week Short-hand for "last", but it has different meaning in each of the two fields. For example, the value "L" in the day-of-month field means "the last day of the month" - day 31 for January, day 28 for February on non-leap years. If used in the day-of-week field by itself, it simply means "7" or "SAT". But if used in the day-of-week field after another value, it means "the last xxx day of the month" - for example "6L" means "the last friday of the month". You can also specify an offset from the last day of the month, such as "L-3" which would mean the third-to-last day of the calendar month. When using the 'L' option, it is important not to specify lists, or ranges of values, as you'll get confusing/unexpected results.
W day-of-month Used to specify the weekday (Monday-Friday) nearest the given day. As an example, if you were to specify "15W" as the value for the day-of-month field, the meaning is: "the nearest weekday to the 15th of the month". So if the 15th is a Saturday, the trigger will fire on Friday the 14th. If the 15th is a Sunday, the trigger will fire on Monday the 16th. If the 15th is a Tuesday, then it will fire on Tuesday the 15th. However if you specify "1W" as the value for day-of-month, and the 1st is a Saturday, the trigger will fire on Monday the 3rd, as it will not 'jump' over the boundary of a month's days. The 'W' character can only be specified when the day-of-month is a single day, not a range or list of days.

The 'L' and 'W' characters can also be combined for the day-of-month expression to yield 'LW', which translates to "last weekday of the month".

# day-of-week Used to specify "the nth" XXX day of the month. For example, the value of "6#3" in the day-of-week field means the third Friday of the month (day 6 = Friday and "#3" = the 3rd one in the month). Other examples: "2#1" = the first Monday of the month and "4#5" = the fifth Wednesday of the month. Note that if you specify "#5" and there is not 5 of the given day-of-week in the month, then no firing will occur that month. If the '#' character is used, there can only be one expression in the day-of-week field ("3#1,6#3" is not valid, since there are two expressions).

The legal characters and the names of months and days of the week are not case sensitive.


NOTES:

  • Support for specifying both a day-of-week and a day-of-month value is not complete (you'll need to use the '?' character in one of these fields).
  • Overflowing ranges is supported - that is, having a larger number on the left hand side than the right. You might do 22-2 to catch 10 o'clock at night until 2 o'clock in the morning, or you might have NOV-FEB. It is very important to note that overuse of overflowing ranges creates ranges that don't make sense and no effort has been made to determine which interpretation CronExpression chooses. An example would be "0 0 14-6 ? * FRI-MON".

EXAMPLES:

  • 0 10 20 * * ? This combination is legal and would fireup at 8:10pm on every day. Here, the star stands for Every month and every date and the question mark for Any day of the week.
  • 0 10 20 ? * 1 This combination is legal and would fireup at 8:10pm on every Sunday. Here, the star stands for Every month and question mark for Any date.
  • 0 10 20 * * 1 This combination IS NOT legal. Combination of All dates and Specific day is not accepted by Quartz.
  • 0 10 20 ? ? SUN This combination IS NOT legal. Month can be specified as specific, a range, or All, but not Any.
  • * 10 20 * * ? This combination is legal, but DANGEROUS as it would fire up 60 times, once for every second of the 10th minute after 8pm of every day.

Published Jobs

The basic configuration taken from the system.xml file allows the user to optionally specify in XML or JSON the data for the job that will be published to the pipeline. If specified, then the published job will be as configured, but the path to the scheduler, the sourceName, scheduleId and jobNumber will be added as attributes to the root tag (normally <doc>). The source id, action (start/stop/pause/resume), event type (scheduled/manual) and properties are also added.

If the job data is not specified, then an empty document is published onto the configured pipeline:

 <doc scheduler="/path/schedulerName" scheduleId="1" jobNumber="1" sourceName="myJob" sourceId="XXXX" actionProperties="full" actionType="manual" crawlId="123" action="start"/>

NOTE:

  • sourceId is only available when the schedule has come from an RDB.
  • crawlId is only available when the schedule has come from an RDB and the rdb/sql/getCrawlId SQL is configured.

This method may be used to trigger sub job processing where the contents of the job is irrelevant, but something is needed to start processing at a scheduled time.

Once jobs are published they will run to completion. Jobs that error will be logged to the scheduler log file. Jobs could run indefinitely as they will not timeout.

User interface

The user interface allows the administrator to view and update the schedules via the normal Aspire web interface.

On browsing to the Aspire Schedulers status page, the administrator is able to see the current schedules and their status. This includes the schedule, event, whether the schedule is currently enabled, its last and next execution time and whether it is currently running (i.e. has submitted a job which has not yet completed). Clicking on this schedule provides further information about the schedule, such as the job data, pipeline and last error response.

From the status page, the administrator is able to enable or disable individual schedules and enable or disable the scheduler.

The administrator is also able to add a new schedule, specifying the schedule, event, and optionally whether the schedule is enabled and is a singleton.

The administrator may also manually "fire" events, causing jobs to be published on to the Aspire pipeline. The administrator may send "start", "stop", "pause" and "resume" jobs. These jobs will specify the action in the action attribute and show "manual" as the actionType attribute.

Configuration

The scheduler recognizes the following configuration tags.

Element Type Default Description
enabled boolean true Whether the scheduler is enabled. If false, then no jobs will be submitted for any configured schedule.
schedules One or more schedules on which jobs will be fired. Also see the section on schedules stored in a database below.
schedules/schedule A schedule on which jobs will be fired.
schedules/schedule/@name String The (optional) name for the schedule.
schedules/schedule/@enabled boolean true Whether this specific schedule is enabled. If false, then no jobs will be submitted for this schedule.
schedules/schedule/@singleton boolean true Specifies that this schedule may only fire one job at a time. If true and the scheduled time is reached again, then a new job will only be published if the previous job has completed.
schedules/schedule/cron String Mandatory for this schedule Specifies the schedule in cron style (see above for the format). This must be specified for any schedule configured here.
schedules/schedule/job String Specifies the job data that will be published when the scheduled time is reached. The data can be specified in either XML or JSON style (indicated by the type attribute – see below). The data will have the scheduler information added as attributes to the root node. If not specified, an empty document will be published.
NOTE: this configuration item is a String and XML/JSON text should be surrounded with a <[CDATA[]]>.
schedules/schedule/job/@type String xml Specifies style of the data in the <job> tag. Can be either xml or json.
schedules/schedule/event String Mandatory for this schedule Specifies the event to publish the job to. Must match one of the events configured in the branch handler <branches> configuration.
quartz N/A Container for the properties to be passed to the Quartz Scheduler.
quartz/property String The value of the property to be passed to the Quartz Scheduler.
quartz/property/@name String The name of the property to be passed to the Quartz Scheduler.


The scheduler can read its schedules from a database. To configure this, the following configuration can be used:

Element Type Description
rdb/@component String If schedules should be loaded from a database, this attribute holds the path to the Aspire database connection pool component (aspire-rdb).
rdb/sql/schedules String If schedules should be loaded from a database, this element holds the SQL that will be used to extract the schedules from the database configured via the schedules/@rdb attribute. See below for the columns that should be returned.
rdb/sql/jobRunningCheck String If schedules taken from the RDB are singletons, this SQL will be run when the schedule fires to check whether a job is still running. If not specified, no check on the database will be performed, but the existing check making sure that the number of outstanding jobs is 0 may still prevent the job from firing. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobStarted String This SQL is run when a job is started. Typically it is used to allow singleton control via an external database. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobStopped String This SQL is run when a stop job is sent. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobPaused String This SQL is run when a pause job is sent. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobResumed String This SQL is run when a resume job is sent. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobFinished String This SQL is run when a job finishes successfully. Typically it is used to allow singleton control via an external database. This SQL may be blank, to allow completion of a job to be marked by an external process. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/jobFailed String This SQL is run when a job finishes with an error. Typically it is used to allow singleton control via an external database. This SQL may be blank, to allow completion of a job to be marked by an external process. However, if the job failed, the external process may not have marked the job as complete, meaning singleton jobs would be blocked. The SQL provided is a template that has values substituted. See below for the values that may be substituted.
rdb/sql/crawlId String The SQL used to determine the crawl id. If this SQL exists, it is run whenever a job is published and the result is added to the job in the crawlId attribute of the document. The first column of the first row of the result set is used as the crawl ID.
rdb/autoReloadSchedules long Time in milliseconds between automatic reloads of the schedules from the RDB. If missing or 0, automatic reloads will be disabled.

Database Schedule Selection SQL

The SQL should return the mandatory columns and may return the optional columns from the following:

Column Description
name The schedule name
enabled True if the schedule is enabled (defaults to true).
singleton True if this schedule is a singleton (defaults to true).
cron The cron schedule (mandatory).
jobType The type of data given in the jobData column (defaults to XML).
jobData The data to be sent in the job when the scheduled time is reached. This may be given in XML or JSON

format as specified by the jobType column and should be given as a string.

event The event to publish the job on (mandatory).
sourceId The external ID (of the source) to be added to the job (if available).

The format of the columns follows the formats given in the Basic Configuration section above. Column names can be enforced by use of the SQL “AS” keyword.

Database Job Control SQL

SQL contained in the jobRunningCheck, jobStarted, jobFinished and jobFailed may contain variables for substitution. Variables are surrounded with { } (see Simple Templates for more details). The following variables my be specified:

Variable Available Description
scheduler always The component name of the scheduler.
scheduleId always The ID of the schedule that fired this job.
sourceName always The name of the source that fired this job.
sourceId always The source ID of the source that fired this job if available (from the sourceId column of the schedule SQL).
jobNumber jobStarted, jobStopped, jobPaused, jobResumed,jobFinished, jobFailed The unique number allocated to this job from the scheduler.
jobId jobStarted, jobStopped, jobPaused, jobResumed,, jobFinished, jobFailed The job ID associated to the Job object published for this schedule.
jobSuccess jobFinished, jobFailed true if the job listener received a JobComplete event (i.e. the job completed the pipeline without failure), false otherwise.
jobResult jobFinished, jobFailed XML representation of the result from the JobEvent.

Branch Configuration

The Aspire Scheduler publishes jobs using the branch manager. Thus it requires the standard Branch Handler configuration detailed below:

Element Type Description
branches/branch/@event String The event to configure. At the very least, you should include the onPublish event.
branches/branch/@pipelineManager String The URL of the pipeline manager to publish to. Can be relative.
branches/branch/@pipeline String The name of the pipeline to publish to.
branches/branch/@stage String The name of the stage to publish to.


Example Configuration

   <component name="myScheduler" subType="default" factoryName="aspire-scheduler">
     <schedules>
       <schedule name="myFirstSchedule" enabled="false">
         <cron>1/10 * * * * ?</cron>
         <event>onPublish</event>
         <job>
           <![CDATA[
           <doc>
             <fetchUrl>support.searchtechnologies.com</fetchUrl>
           </doc>
           ]]>
         </job>
       </schedule>
       <schedule enabled="false">
         <cron>2/10 * * * * ?</cron>
         <event>onPublish2</event>
       </schedule>
       <schedule enabled="false">
         <cron>3/10 * * * * ?</cron>
         <event>onPublish3</event>
         <job type="json">
           <![CDATA[
           {
             "doc" : {
               "fetchUrl" : "www.searchtechnologies.com"
             }
           }
           ]]>
         </job>
       </schedule>
       <schedule enabled="false">
         <cron>4/10 * * * * ?</cron>
         <event>onPublish4</event>
         <job type="json">
           <![CDATA[
           {
             "doc" : {
               "fetchUrl" : "repositories.searchtechnologies.com"
             }
           }
           ]]>
         </job>
       </schedule>
     </schedules>
     <branches>
       <branch event="onPublish" pipelineManager="PipelineManager" />
       <branch event="onPublish2" pipelineManager="PipelineManager" pipeline="myPipeline" />
       <branch event="onPublish3" pipelineManager="PipelineManager" pipeline="myPipeline" stage="myStage" />
       <branch event="onPublish4" pipelineManager="PipelineManager-not-exist" />
     </branches>
   </component>

Servlet Commands

The following servlet commands are available via the scheduler (via http://server:port/scheduler?cmd=XXXX&param=value):

Command Description Parameters
add Adds a schedule to the scheduler event: the event the schedule should publish to

cron: the cron schedule
name: the name for the schedule (optional)
enabled: true if the schedule is enabled (optional - defaults to true)
singleton: true if only one job should run at a time (optional - defaults to false)
job: the data to be sent when the schedule fires (optional)
jobType: the format of the job parameter - xml/json (optional - defaults to xml)

delete Deletes a schedule from the scheduler extId: the external ID of the schedule to be deleted (optional, but this or schedId must be specified)

schedId: the ID of the schedule to be deleted (optional, but this or extId must be specified)

disable Disables the scheduler, or a schedule if specified extId: the external ID of the schedule to be disabled (optional)

schedId: the ID of the schedule to be disabled (optional)
If no schedule is specified, the scheduler will be disabled

enable Enables the scheduler, or a schedule if specified extId: the external ID of the schedule to be enabled (optional)

schedId: the ID of the schedule to be enabled (optional)
If no schedule is specified, the scheduler will be enabled

reload Reloads all the schedules from the database. None
start Sends a 'start' job for the given schedule extId: the source (external) ID of the schedule to be started (optional, but this or schedId must be specified)

schedId: the ID of the schedule to be started (optional, but this or extId must be specified)
properties: string containing properties to be sent in the actionProperties attribute of the job (see below)

stop Sends a 'stop' job for the given schedule extId: the source (external) ID of the schedule to be stopped (optional, but this or schedId must be specified)

schedId: the ID of the schedule to be stopped (optional, but this or extId must be specified)

pause Sends a 'pause' job for the given schedule extId: the source (external) ID of the schedule to be paused (optional, but this or schedId must be specified)

schedId: the ID of the schedule to be paused (optional, but this or extId must be specified)

resume Sends a 'resume' job for the given schedule extId: the source (external) ID of the schedule to be resumed (optional, but this or schedId must be specified)

schedId: the ID of the schedule to be resumed (optional, but this or extId must be specified)

Services interface

Other components will be able to access the scheduler via a number of methods. These are made available via two interfaces – one to handle the schedules and one to handle the scheduler.

The component exposes the following interface to handle jobs:

AspireSchedule.java

The component will expose the following interface to handle the scheduler:

AspireScheduler.java