RDBMS Application Bundle

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

RDBMS Application Bundle
AppBundle Name  RDBMS Connector
Maven Coordinates  com.searchtechnologies.appbundles:app-snapshot-rdbms-connector
Versions  1.0, 1.0.1-SNAPSHOT
Type Flags  scheduled
Inputs  AspireObject from a content source submitter holding all the information required for a crawl.
Outputs  An AspireObject containing the data extracted from the database to be processed.

The RDBMS Connector performs full and incremental scans of the rows in one or more tables of a database. Each row extracted from the database will be tagged with one of three possible actions - add, update or delete, so a different Aspire pipeline or job route can differentiate on the task to perform given this action. The exact nature of the extraction method is documented with the RDB

The connector once started can be stopped, paused or resumed via the Scheduler Component. Typically the start job will contain all information required by the job to perform the scan. When pausing or stopping, the connector will wait until all the jobs it published have completed before updating the statistics and status of the connector. Pausing a scan will only work in incremental mode - resuming is much like a new incremental feed apart from the fact that the pre update SQL is not run .


Aspire Connector

Application Configuration

Property Type Default Description
Default JDBC Url string The JDBC URL for your RDBMS server and database. For example, "jdbc:mysql://192.168.40.27/mydb" (MySQL). This will vary depending on the type of RDBMS.

Only used if the information is not passed in the source configuration.

Default Database User string The name of a database user with read-only access to all of the tables which need to be indexed, and write access to the necessary update tables (if update management is handled through the RDB).

Only used if the information is not passed in the source configuration.

Default Database Password string The database password.

Only used if the information is not passed in the source configuration.

Default JDBC Driver Jar string Path to the JDBC driver JAR file for your RDBMS. Typically this is placed in the "lib" directory inside your Aspire Home, for example "lib/myjdbcdriver.jar".

Only used if the information is not passed in the source configuration.

Default JDBC Driver Class (optional) string The name of the JDBC driver class if the class name from the META-INF/services/java.sql.Driver file in the driver JAR file should not be used, or that file does not exist in the driver JAR file.

Only used if the information is not passed in the source configuration.

Note on Application Configuration

When parameters are not supplied in the source configuration, default values for connection details and SQL from the application configuration are only used if no connection details or SQL is given. That is, either all or none of the default values are used for connection details and SQL statements. For example, if no connection details are given in the source configuration, the default values from the application configuration are used. If (say) the connection URL but no user name or password is given in the source configuration, the component will not use any default for username and password given in the application configuration. This prevents the mixing of default and non-default values.

Source Configuration

Property Type Default Description
Full crawl SQL string This SQL will be executed when the user clicks on "full crawl". Each record produced by this statement will be indexed as a separate document. Some field names have special meaning (such as 'title', 'content', 'url', 'aspire_id', etc.)
Pre incremental crawl SQL string SQL to run before an incremental crawl. This SQL can be used to mark documents for update, save timestamps, clear update tables, etc. as needed to prepare for an incremental crawl. Can be left blank if you never do an incremental crawl.
Incremental crawl SQL string SQL to run for an incremental crawl. This SQL should provide a list of all adds and deletes to the documents in the index. Some field names have special meaning (such as 'title', 'content', 'url', 'aspire_id', etc.) see the wiki for more information. Note the special column, 'aspire_action' should report 'I' (for inserts), 'U' (for updates, typically the same as updates for most search engines), and 'D' (for deletes).
Post incremental crawl SQL string SQL to run after each record processed. This SQL can be used un-mark / delete each document from the tables after it is complete.
Post incremental crawl SQL (failures) string SQL to run after each record if processing fails. If this SQL is left blank, the 'Post incremental crawl SQL' will be run instead
JDBC Url string The JDBC URL for your RDBMS server and database. For example, "jdbc:mysql://192.168.40.27/mydb" (MySQL). This will vary depending on the type of RDBMS.
User string The name of a database user with read-only access to all of the tables which need to be indexed, and write access to the necessary update tables (if update management is handled through the RDB).
Password string The database password
JDBC Driver Jar string Path to the JDBC driver JAR file for your RDBMS. Typically this is placed in the "lib" directory inside your Aspire Home, for example "lib/myjdbcdriver.jar".
JDBC Driver Jar string The name of the JDBC driver class if the class name from the META-INF/services/java.sql.Driver file in the driver JAR file should not be used, or that file does not exist in the driver JAR file.

Components

This application utilizes the following components:

What's new on 1.0.1?

  1. New configuration properties for incremental indexing. Update ID, SEQ and ACTION columns can now be configured on the connector source.
  2. New source configuration layout. Grouped incremental indexing configuration properties together and connection details are now on top.

Output

   <doc action="insert">
      <internalId>0--1898410951</internalId>
      <page_is_redirect source="RDBScanner">0</page_is_redirect>
      <url source="RDBScanner">https://wiki.searchtechnologies.com/mediawiki/index.php/Elastic_IP_Addresses</url>
      <crawlId>8</crawlId>
      <id source="RDBScanner">438</id>
      <content source="RDBScanner"></content>
   </doc>
   

<u>Wiki Home</u> / <u>Amazon EC2 Cloud</u> / Elastic IP Addresses

If you want to create a real host name for your web site, such as http://myapp.searchtechnolies.com, then you will need to create a static (i.e. fixed) IP address for your instance.

Once this is done, a host name can be mapped to your static IP address, and then people will be able to access it from the outside using a server name, instead of a list of numbers.

STEP 1: Create a new Elastic IP

Creating a New IP

Static IP addresses in the Amazon cloud are called "Elastic IPs", because they can be quickly mapped to any instance as necessary.

This is ridiculously easy in Amazon:

  • Click on "Elastic IPs" in the navigator on the left
  • Click on "Allocate New Address".

And that's it!

<br clear="all"/>

STEP 2: Associate your IP Address with an EC2 Instance

Associate IP Address

The next step is to associate your IP address with an EC2 instance:

  • Select the IP address you want to associate
  • Click on the "Associate Address" button


WARNING WARNING: YOUR INSTANCE SERVERNAME HAS NOW CHANGED

When you associate a new IP address with your instance, the server name for your instance will have changed.

<b>This means that any putty or WinSCP configurations you have to the old instance will need to be changed to reflect the new server name and/or IP address.</b>.

For example, when you first created your instance you will have a servername which is something like this:

ec2-184-73-126-151.compute-1.amazonaws.com

Then you create a new static IP address for your instance. Suppose the static IP address is this:

23.21.150.25

Once you've done this, your server name will change. Go back to your instances, click on your instance. You should see the new static IP address at the top. The new DNS server name will be:

ec2-23-21-150-25.compute-1.amazonaws.com

Which means you should be able to access your new web server with http://ec2-23-21-150-25.compute-1.amazonaws.com/...

And, of course, because it is a static IP address, you will also be able to access your new instance using just the IP address: http://23.21.150.25/...

STEP 3: Contact Search Technologies IT for a Host Name

The next step is to notify the Search Technologies IT department with a JIRA ticket to create a host name for your new static IP address.

  <title source="RDBScanner">Elastic IP Addresses</title>
  <feederLabel source="RDBScanner">RDBScanner</feederLabel>
  <action source="RDBScanner">add</action>
  <feederType source="RDBScanner">RDBFeeder</feederType>
  <aspire_id source="RDBScanner">438</aspire_id>
  <page_namespace source="RDBScanner">0</page_namespace>
  <connectorSource>
    <dbDriverJar>lib/mysql-connector-java-5.1.18-bin.jar</dbDriverJar>
    <dbPassword>encrypted:C0FA502DA95D855623E02465F95F94D8</dbPassword>
    <dbUrl>jdbc:mysql://192.168.40.27/wikidb</dbUrl>
    <dbUser>aspire_crawl</dbUser>
    <fullSelectSQL>SELECT P.page_id as id,
                          P.page_id as aspire_id,
                          replace(P.page_title,"_"," ") as title, 
                          concat('https://wiki.searchtechnologies.com/mediawiki/index.php/',P.page_title) as url, 
                          P.page_namespace, 
                          P.page_is_redirect, 
                          T.old_id, 
                          T.old_text as content 
                          FROM mw_page P, mw_revision R, mw_text T 
                          WHERE R.rev_id = page_latest AND T.old_id = R.rev_text_id</fullSelectSQL>
    <displayName>TestRDB</displayName>
  </connectorSource>
  <old_id source="RDBScanner">3843</old_id>