Difference between revisions of "FTP Connector App-bundle (Aspire 2)"

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

 
Line 6: Line 6:
 
|appBundleName=FTP Connector
 
|appBundleName=FTP Connector
 
|mavenCoordinates=com.searchtechnologies.aspire:app-ftp-connector
 
|mavenCoordinates=com.searchtechnologies.aspire:app-ftp-connector
|versions=2.0
+
|versions={{CurrentVersion2.X}}
 
|typeFlags=scheduled
 
|typeFlags=scheduled
 
}}  
 
}}  

Latest revision as of 00:11, 9 December 2015


FTP Connector App-bundle (Aspire 2)
AppBundle Name  FTP Connector
Maven Coordinates  com.searchtechnologies.aspire:app-ftp-connector
Versions  2.2.2
Type Flags  scheduled
Inputs  AspireObject from a content source submitter holding all the information required for a crawl.
Outputs  An AspireObject containing the URL, content, ACLs and Metadata processed for each file.

The FTP Connector performs full and incremental scans over content stored on an FTP server and will extract metadata and content from each file scanned. You can configure the name and port of the FTP server, the user to connect as, the directory to crawl and whether the scan should be recursive or not. Each scanned file will be tagged with one of three possible actions--add, update, or delete--and can be routed to any Aspire pipeline as desired.

The connector, once started, can be stopped, paused or resumed sending a new Scanner Configuration Job. Typically the start job will contain all information required by the job to perform the scan. When pausing or stopping, the connector will wait until all the jobs it published have completed before updating the statistics and status of the connector.

NOTE that the FTP connector does not support document level security, and any jobs published will contain no access control lists (ACLs) for the scanned content.


Configuration

This section lists all configuration parameters available to install the FTP Connector Application Bundle and to execute crawls using the connector.

General Application Configuration

Property Type Default Description
snapshotDir string ${aspire.home}/snapshots The directory for snapshot files to be stored.
disableTextExtract boolean false By default, connectors use Apache Tika to extract text from downloaded documents. If you wish to apply special text processing to the downloaded document in the workflow, you should disable text extraction. The downloaded document is then available as a content stream.
workflowReloadPeriod int 15m The period after which to reload the business rules. Defaults to ms, but can be suffixed with ms, s, m, h or d to indicate the required units.
workflowErrorTolerant boolean false When set, exceptions in workflow rules will only effect the execution of the rule in which the exception occurs. Subsequent rules will be executed and the job will complete the workflow sucessfully. If not set, exceptions in workflow rules will be re-thrown and the job will be moved to the error workflow.
debug Boolean false Controls whether debugging is enabled for the application. Debug messages will be written to the log files.


FTP Connector Specific Configuration

There are no configuration options specific to the FTP connector

Configuration Example

To install the application bundle, add the configuration, as follows, to the <autoStart> section of the Aspire settings.xml.

<application config="com.searchtechnologies.aspire:app-ftp-connector">
  <properties>
    <property name="generalConfiguration">true</property>
    <property name="snapshotDir">${dist.data.dir}/${app.name}/snapshots</property>
    <property name="workflowReloadPeriod">15s</property>
    <property name="batchSize">50</property>
    <property name="batchTimeout">60000</property>
    <property name="waitForSubJobs">600000</property>
    <property name="maxThreads">10</property>
    <property name="jobQueue">30</property>
    <property name="extractTimeout">180000</property>
    <property name="extractTextMaxSize">unlimited</property>
    <property name="disableTextExtract">false</property>
    <property name="workflowErrorTolerant">false</property>
    <property name="emitStartJob">true</property>
    <property name="emitEndJob">true</property>
    <property name="enableAuditing">true</property>
    <property name="debug">false</property>
    <property name="non-text-document">true</property>
    <property name="nonTextDocumentsExtensions">jpg,gif,mp3,mp4,mpg,avi,wav,bmp,swf</property>
    <property name="enableFetchUrl">true</property>
    <property name="fdService">false</property>
    <property name="fdServiceUrl"/>
  </properties>
</application>

Note: Any optional properties can be removed from the configuration to use the default value described on the table above.

Source Configuration

Scanner Control Configuration

The following table describes the list of attributes that the AspireObject of the incoming scanner job requires to correctly execute and control the flow of a scan process.

Element Type Options Description
@action string start, stop, pause, resume, abort Control command to tell the scanner which operation to perform. Use start option to launch a new crawl.
@actionProperties string full, incremental When a start @action is received, it will tell the scanner to either run a full or an incremental crawl.
@normalizedCSName string Unique identifier name for the content source that will be crawled.
displayName string Display or friendly name for the content source that will be crawled.

Header Example

  <doc action="start" actionProperties="full" actionType="manual" crawlId="0" dbId="0" jobNumber="0" normalizedCSName="FeedOne_Connector"
   scheduleId="0" scheduler="##AspireSystemScheduler##" sourceName="ContentSourceName">
    ...
    <displayName>testSource</displayName>
    ...
  </doc>

All configuration properties described in this section are relative to /doc/connectorSource of the AspireObject of the incoming Job.

Property Type Default Description
server string Server Name
port int The port on which the FTP server is running
url string The directory on the FTP server to crawl
username string The username to connect with.
password string The password of the username to connect with.
passive Boolean false Connect to the FTP server using passive mode
indexContainers boolean false true if folders (as well as files) should be indexed.
scanRecursively boolean false true if subfolders of the given URL should be scanned.
fileNamePatterns/include/@pattern regex none Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is included by the scanner. Multiple include nodes can be added.
fileNamePatterns/include/@pattern regex none Optional. A regular expression pattern to evaluate file urls against; if the file name matches the pattern, the file is excluded by the scanner. Multiple exclude nodes can be added.

Scanner Configuration Example

  <doc action="start" actionProperties="full" actionType="manual" normalizedCSName="FTP_Connector" sourceName="FTP_Connector">
    <connectorSource>
      <server>ftp.searchtechnologies.com</server>
      <port>21</port>
      <url>/test</url>
      <username>sd-ftp-user</username>
      <password>encrypted:562E81591F85B858E5A5D3876F9C9FDB</password>
      <passive>true</passive>
      <indexContainers>false</indexContainers>
      <scanRecursively>true</scanRecursively>
      <scanExcludedItems>false</scanExcludedItems>
      <fileNamePatterns/>
    </connectorSource>
    <displayName>FTP_Connector</displayName>
  </doc>

Example Output

<doc>
  <url>/test/11/00/0/1.txt</url>
  <id>/test/11/00/0/1.txt</id>
  <fetchUrl>/test/11/00/0/1.txt</fetchUrl>
  <displayUrl>/test/11/00/0/1.txt</displayUrl>
  <snapshotUrl>005 /test/11/00/0/1.txt</snapshotUrl>
  <doFetch>true</doFetch>
  <doPopulate>true</doPopulate>
  <docType>item</docType>
  <lastModified>Thu Jun 25 11:42:00 BST 2015</lastModified>
  <dataSize>4264</dataSize>
  <repItemType>aspire/file</repItemType>
  <sourceName>FTP Connector</sourceName>
  <sourceType>ftp</sourceType>
  <connectorSource type="ftp">
    <server>ftp.searchtechnologies.com</server>
    <port>21</port>
    <url>/test</url>
    <username>sd-ftp-user</username>
     <password>encrypted:562E81591F85B858E5A5D3876F9C9FDB</password>
    <passive>true</passive>
    <indexContainers>false</indexContainers>
    <scanRecursively>true</scanRecursively>
    <scanExcludedItems>false</scanExcludedItems>
    <fileNamePatterns/>
    <displayName>FTP Connector</displayName>
  </connectorSource>
  <action>add</action>
  <hierarchy>
    <item id="E392FAB00D12B2340E5BE938C982ABBA" level="5" name="1.txt" url="/test/11/00/0/1.txt">
      <ancestors>
        <ancestor id="0891335A4083FCD65DD995A58E23EF39" level="4" name="0" parent="true" type="aspire/folder" url="/test/11/00/0"/>
        <ancestor id="46FFA5AD9FAD0068CE164E1B5D7917E1" level="3" name="00" type="aspire/folder" url="/test/11/00"/>
        <ancestor id="1C53A23263BBF5898E320F633910B6F6" level="2" name="11" type="aspire/folder" url="/test/11"/>
        <ancestor id="4539330648B80F94EF3BF911F6D77AC9" level="1" name="FTP Connector" type="aspire/folder" url="/test"/>
      </ancestors>
    </item>
  </hierarchy>
  <contentType source="ExtractTextStage/Content-Type">text/plain; charset=windows-1252</contentType>
  <extension source="ExtractTextStage">
    <field name="Content-Encoding">windows-1252</field>
    <field name="X-Parsed-By">org.apache.tika.parser.DefaultParser</field>
    <field name="resourceName">/test/11/00/0/1.txt</field>
  </extension>
  <content source="ExtractTextStage"><![CDATA[A brutal blast of arctic air has settled over eastern North America, bringing dangerously low temperatures not seen in decades.

About half of the US population has been placed under a wind chill warning or cold weather advisory.

In Toronto, the temperature dropped to -24C (-11F) before dawn on Tuesday.

Air, rail and road travel remain snarled by high, freezing wind, and residents have been warned to stay indoors to avoid frostbite.

Cold air broke records in Chicago on Monday, where the temperature of -16F (-27C) was the lowest ever seen on that date.

It was one of more than 120 daily temperature records broken in cities across the US since the beginning of 2014, many dating back decades.
Sharp temperature drop

Chicagoans explain how they cope with the extreme weather

The arrival late on Monday of the arctic weather pattern caused temperatures to plummet overnight in New York and Washington DC by as much as 45 degrees in a matter of hours, from unseasonably warm highs a day earlier.

New York Governor Andrew Cuomo closed parts of major highways across his state in preparation for the extreme weather.

Adding to the misery, forecasters say the areas on the eastern shores of the Great Lakes could again be blanketed by snow, as the cold air moved over the water.

In Canada, 4,000 residents of Quebec and 1,000 in Newfoundland were still without power on Tuesday amid the freezing temperatures and snow.

The polar blast was threatening crops and livestock across the American farm belt, even in the usually temperate Deep South. The freeze was expected to reach as far south as Texas and central Florida, the National Weather Service said.

Meteorologists said some 187 million people in all would feel the effects of the cold by Tuesday.
Transport trouble

The frigid temperatures have been widely blamed on a shift in the weather pattern known as the "polar vortex".

What can you wear to help cope with extreme cold weather?

On Tuesday, the extreme weather caused the cancellation of 2,500 flights, along with widespread road and rail delays.

JetBlue Airways operations, which had been suspended at airports in Boston and around New York City, were returning to normal.

More than 500 passengers on their way to Chicago were stuck overnight in northern Illinois on three Amtrak passenger trains after drifting snow and ice covered the tracks.

And in Indianapolis, Indiana, it has temporarily been made illegal to drive except in an emergency or to seek shelter, in order to keep the roads free for emergency vehicles.

Cold temperatures reached deep into the US south-east.

The weather has been blamed for at least 16 deaths in recent days, including:

    A one-year-old boy in Missouri who was killed in a car collision with a snowplough
    A worker at a Philadelphia salt storage facility who died when a 100-ft (30-m) pile of road salt collapsed on him
    Four men across Illinois who suffered fatal heart attacks while shovelling snow

Frostbite graphic

The state of Minnesota and the city of Chicago, Illinois, have ordered all schools closed.

It was so cold that even the polar bear at Chicago's Lincoln Park Zoo was kept indoors, CNN reports.

In Kentucky, an inmate who escaped a minimum security prison turned himself in to get out of the cold, the Associated Press reported.

Some relief was in sight in the Midwest, as the cold air pattern moved eastward, the National Weather Service said.
A pedestrian walks past a mural depicting a winter scene in Montreal, Quebec, on 7 January 2014 A pedestrian walks past a mural depicting a winter scene in Montreal, Quebec
A man warms himself near a fire in Indianapolis, Indiana, on 7 January 2014 A man warms himself before a fire in Indianapolis, Indiana
Passengers wait for a train in below-zero temperatures in Chicago, Illinois, on 7 January 2014 Passengers wait for a train in below-zero temperatures in Chicago, Illinois
A man walks past a snow encrusted bicycle in Chicago on 7 January 2014 A frozen bicycle in downtown Chicago on Tuesday
A salesmen at a car dealer digs out cars covered in snow in Indianapolis, Indiana, on 7 January 2014 A salesmen digs out cars at a dealership in Indianapolis, Indiana 
]]></content>
</doc>