Scheduler 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Aspire Scheduler 0.4

Aspire Scheduler 0.4
Description: A Quartz based scheduler that publishes jobs on to pipelines at period intervals.
Inputs: N/A
Outputs: An AspireDocument object published to the configured pipeline manager.
Factory: aspire-scheduler
Sub Type: default
Object Type: Produces AspireDocument objects.


Configuration

The scheduler recognises the following tags inside its <config> tag.

Element Type Default Description
enabled boolean true Whether the scheduler is enabled. If false, then no jobs will be submitted for any configured schedule.
schedules One or more schedules on which jobs will be fired.
schedules A schedule on which jobs will be fired.
schedules/schedule/@name String The (optional) name for the schedule.
schedules/schedule/@enabled boolean true Whether this specific schedule is enabled. If false, then no jobs will be submitted for this schedule.
schedules/schedule/@singleton boolean true Specifies that this schedule may only fire one job at a time. If true and the scheduled time is reached again, then a new job will only be published if the previous job has completed.
schedules/schedule/cron String Mandatory Specifies the schedule in cron style (see above for the format). This must be specified.
schedules/schedule/job String Specifies the job data that will be published when the scheduled time is reached. The data is specified in XML. The data will have the scheduler information added as attributes to the root node. If not specified, an empty document will be published.
NOTE: this configuration item is a String and XML text should be surrounded with a <[CDATA[]]> tag
schedules/schedule/event String Mandatory Specifies the event to publish the job to. Must match one of the events configured in the branch handler <branches> configuration.
quartz N/A Container for the propertries to be passed to the Quartz Scheduler.
quartz/property String The value of the property to be passed to the Quartz Scheduler.
quartz/property/@name String The name of the property to be passed to the Quartz Scheduler.


Branch Configuration

The Aspire Scheduler publishes jobs using the branch manager. Thus it requires the standard Branch Handler configuration detailed below:

Element Type Description
branches/branch/@event String The event to configure. At the very least, you should include the onPublish event.
branches/branch/@pipelineManager string The URL of the pipeline manager to publish to. Can be relative.
branches/branch/@pipeline string The name of the pipeline to publish to.
branches/branch/@stage string The name of the stage to publish to.

Example Configuration

   <component name="myScheduler" subType="default" factoryName="aspire-scheduler">
     <config>
       <schedules>
         <schedule name="myFirstSchedule" enabled="false">
           <cron>1/10 * * * * ?</cron>
           <event>onPublish</event>
           <job>
             <![CDATA[
             <doc>
               <fetchUrl>support.searchtechnologies.com</fetchUrl>
             </doc>
             ]]>
           </job>
         </schedule>
         <schedule enabled="false">
           <cron>2/10 * * * * ?</cron>
           <event>onPublish2</event>
         </schedule>
         <schedule enabled="false">
           <cron>3/10 * * * * ?</cron>
           <event>onPublish3</event>
           <job>
             <![CDATA[
               <doc>
                 <fetchUrl>www.searchtechnologies.com</fetchUrl>
               </doc>
             ]]>
           </job>
         </schedule>
         <schedule enabled="false">
           <cron>4/10 * * * * ?</cron>
           <event>onPublish4</event>
           <job>
             <![CDATA[
               <doc>
                 <fetchUrl>repositories.searchtechnologies.com</fetchUrl>
               </doc>
             ]]>
           </job>
         </schedule>
       </schedules>
       <branches>
         <branch event="onPublish" pipelineManager="PipelineManager" />
         <branch event="onPublish2" pipelineManager="PipelineManager" pipeline="myPipeline" />
         <branch event="onPublish3" pipelineManager="PipelineManager" pipeline="myPipeline" stage="myStage" />
         <branch event="onPublish4" pipelineManager="PipelineManager-not-exist" />
       </branches>
     </config>
   </component>

Servlet Commands

The following servlet command are available via the scheduler (via http://server:port/scheduler?cmd=XXXX&param=value):

Command Description Parameters
add Adds a schedule to the scheduler event: the event the schedule should publish to

cron: the cron schedule
name: the name for the schedule (optional)
enabled: true if the schedule is enabled (optional - defaults to true)
singleton: true if only one job should run at a time (optional - defaults to false)
job:the data to be sent when the schedule fires (optional)

delete Deletes a schedule from the scheduler extId: the external ID of the schedule to be deleted (optional, but this or schedId must be specified)

schedId: the ID of the schedule to be deleted (optional, but this or extId must be specified)

disable Disables the scheduler, or a schedule if specified extId: the external ID of the schedule to be disabled (optional)

schedId: the ID of the schedule to be disabled (optional)
If no schedule is specified, the scheduler will be disabled

enable Enables the scheduler, or a schedule if specified extId: the external ID of the schedule to be enabled (optional)

schedId: the ID of the schedule to be enabled (optional)
If no schedule is specified, the scheduler will be enabled

The Aspire Scheduler

General operation

The Aspire Scheduler uses the Quartz scheduer from Terracotta (see http://www.quartz-scheduler.org/ for details) to provide the backbone for scheduling jobs. The Aspire scheduler provides a wrapper around the Quartz scheduler and contains all necessary Quartz classes required for operation. Thus (apart from the services and framework), the Aspire Scheduler has no dependencies.

Upon startup, the Aspire Scheduler reads its configuration from the system.xml file and sets schedules within Quartz to represent each of the configured schedules. Each configuration will be given an id, known as the scheduleId.

The configuration of the schedules will include a “cron” style definition of the job execution time. The format of this definition is described later.

The Quartz scheduler is then started. If the scheduler and schedule are enabled, then when the scheduled time is reached, Quartz runs a java method. This method checks that the current schedule does not already have a job outstanding and publishes a job onto the configured Aspire pipeline. Each job published by the Aspire scheduler is given a unique job number known as the jobNumber.

Note that the data in the Aspire job is configurable.

Cron style execution schedule

The configuration stored in the system.xml file defines the execution time using a “cron” style format. The format used by Quartz differs from some “cron” implementations and is described below.

Cron expressions provide the ability to specify complex time combinations such as "At 8:00am every Monday through Friday" or "At 1:30am every last Friday of the month".

Cron expressions are comprised of 6 required fields and one optional field separated by white space. The fields respectively are described as follows:


Field Name Allowed Values Allowed Special Characters
Seconds 0-59 ,- * /
Minutes 0-59 , - * /
Hours 0-23 , - * /
Day-of-month 1-31 , - * ? / L W
Month 1-12 or JAN-DEC , - * /
Day-of-Week 1-7 or SUN-SAT , - * ? / L #
Year (Optional) empty, 1970-2199 , - * /


The special characters are described below:

Character Allowed fields Description
* all Used to specify all values. For example, "*" in the minute field means "every minute".
? day-of-month, day-of-week Used to specify 'no specific value'. This is useful when you need to specify something in one of the two fields, but not the other.
- Used to specify ranges. For example "10-12" in the hour field means "the hours 10, 11 and 12".
, all Used to specify additional values. For example "MON,WED,FRI" in the day-of-week field means "the days Monday, Wednesday, and Friday".
/ all Used to specify increments. For example "0/15" in the seconds field means "the seconds 0, 15, 30, and 45". And "5/15" in the seconds field means "the seconds 5, 20, 35, and 50". Specifying '*' before the '/' is equivalent to specifying 0 is the value to start with. Essentially, for each field in the expression, there is a set of numbers that can be turned on or off. For seconds and minutes, the numbers range from 0 to 59. For hours 0 to 23, for days of the month 0 to 31, and for months 1 to 12. The "/" character simply helps you turn on every "nth" value in the given set. Thus "7/6" in the month field only turns on month "7", it does NOT mean every 6th month, please note that subtlety.
L day-of-month, day-of-week Short-hand for "last", but it has different meaning in each of the two fields. For example, the value "L" in the day-of-month field means "the last day of the month" - day 31 for January, day 28 for February on non-leap years. If used in the day-of-week field by itself, it simply means "7" or "SAT". But if used in the day-of-week field after another value, it means "the last xxx day of the month" - for example "6L" means "the last friday of the month". You can also specify an offset from the last day of the month, such as "L-3" which would mean the third-to-last day of the calendar month. When using the 'L' option, it is important not to specify lists, or ranges of values, as you'll get confusing/unexpected results.
W day-of-month Used to specify the weekday (Monday-Friday) nearest the given day. As an example, if you were to specify "15W" as the value for the day-of-month field, the meaning is: "the nearest weekday to the 15th of the month". So if the 15th is a Saturday, the trigger will fire on Friday the 14th. If the 15th is a Sunday, the trigger will fire on Monday the 16th. If the 15th is a Tuesday, then it will fire on Tuesday the 15th. However if you specify "1W" as the value for day-of-month, and the 1st is a Saturday, the trigger will fire on Monday the 3rd, as it will not 'jump' over the boundary of a month's days. The 'W' character can only be specified when the day-of-month is a single day, not a range or list of days.

The 'L' and 'W' characters can also be combined for the day-of-month expression to yield 'LW', which translates to "last weekday of the month".

# day-of-week Used to specify "the nth" XXX day of the month. For example, the value of "6#3" in the day-of-week field means the third Friday of the month (day 6 = Friday and "#3" = the 3rd one in the month). Other examples: "2#1" = the first Monday of the month and "4#5" = the fifth Wednesday of the month. Note that if you specify "#5" and there is not 5 of the given day-of-week in the month, then no firing will occur that month. If the '#' character is used, there can only be one expression in the day-of-week field ("3#1,6#3" is not valid, since there are two expressions).

The legal characters and the names of months and days of the week are not case sensitive.


NOTES:

  • Support for specifying both a day-of-week and a day-of-month value is not complete (you'll need to use the '?' character in one of these fields).
  • Overflowing ranges is supported - that is, having a larger number on the left hand side than the right. You might do 22-2 to catch 10 o'clock at night until 2 o'clock in the morning, or you might have NOV-FEB. It is very important to note that overuse of overflowing ranges creates ranges that don't make sense and no effort has been made to determine which interpretation CronExpression chooses. An example would be "0 0 14-6 ? * FRI-MON".

Published Jobs

The basic configuration taken from the system.xml file allows the user to optionally specify in XML the data for the job that will be published to the pipeline. If specified, then the published job will be as configured, but the path to the scheduler, the sourceName, scheduleId and jobNumber will be added as attributes to the root tag (normally <doc>). The source id, action (start/stop/pause/resume), event type (scheduled/manual) and properties are also added.

If the job data is not specified, then an empty document is published onto the configured pipeline:

 <doc scheduler="/path/schedulerName" scheduleId="1" jobNumber="1" sourceName="myJob" sourceId="XXXX" actionProperties="full" actionType="manual" crawlId="123" action="start"/>

NOTE:

  • sourceId is only available when the schedule has come from and rdb.
  • crawlId is only available when the schedule has has come from and rdb and the rdb/sql/getCrawlId SQL is configured.

This method may be used to trigger sub job processing where the contents of the job is irrelevant, but something is needed to start processing at a scheduled time.

Once jobs are published they will run to completion. Jobs that error will be logged to the scheduler log file. Jobs could run indefinitely as they will not timeout.

User interface

The user interface allows the administrator to view and update the schedules via the normal Aspire web interface.

On browsing to the Aspire Schedulers status page, the administrator is able to see the current schedules and their status. This includes the schedule, event, whether the schedule is currently enabled, its last and next execution time and whether it is currently running (i.e. has submitted a job which has not yet completed). Clicking on this schedule provides further information about the schedule, such as the job data, pipeline and last error response.

From the status page, the administrator is able to enable or disable individual schedules and enable or disable the scheduler.

The administrator is also able to add a new schedule, specifying the schedule, event, and optionally whether the schedule is enabled and is a singleton.

The administrator may also manually "fire" events, causing jobs to be published on to the Aspire pipeline. The administrator may send "start", "stop", "pause" and "resume" jobs. These jobs will specify the action in the action attribute and show "manuel" as the actionType attribute.

Services interface

Other components will be able to access the scheduler via a number of methods. These are made available via two interfaces – one to handle the schedules and one to handle the scheduler.

The component exposes the following interface to handle jobs:

AspireSchedule.java

The component will expose the following interface to handle the scheduler:

AspireScheduler.java