Lucene Services 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Aspire Lucene - Lucene libraries component

Aspire Lucene - Lucene libraries component
Description: Provides the Lucene classes to other bundles and methods for some commonly used Lucene functionality.

This component exists as a holder for the Lucene libraries and exports the Lucene classes for use in other components.

It also provides convenienve methods to allow indexing and searching in an index controlled by the component, although configuration of this index is optional. The services are disbled if the index is not configured.

Inputs: Method calls
Outputs: Lucene index (optional)
Factory: aspire-lucene
Sub Type: default
Object Type: N/A

Configuration

Element Type Default Description
indexDirectory string <none> The direcotry on disk of a Lucene index. The index will be created if if does not exist. If this parameter is not given, index and searching methods will not be available.
documentID string <none> The Lucene field to be used as the document id for deletes and updates. If not specified, documents may be added to the index, but updates and deletes will not be available.
luceneMaxFieldLength int 10000 The maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. This setting refers to the number of running terms, not to the number of different terms.

Note: this silently truncates large documents, excluding from the index all terms that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError.

By default, no more than 10,000 terms will be indexed for a field.

luceneMaxBufferedDocuments string -1
= disabled
Determines the minimal number of documents required before the buffered in-memory documents are flushed as a new Segment. Large values generally gives faster indexing.

When this is set, the writer will flush every luceneMaxBufferedDocuments added documents. Pass in -1 to prevent triggering a flush due to number of buffered documents. Note that if flushing by RAM usage is also enabled, then the flush will be triggered by whichever comes first.

Disabled by default (writer flushes by RAM usage).

luceneMergeFactor int 2 Sets the index writer merge factor
luceneRAMBufferSizeMB int 2048 Sets the index writer RAM buffer size in MB
autoCommitMS long 0
= disabled
The time (in ms) bewteen commits of the index. If set to 0, auto-commit based on time is disabled. This index is only commited if documents have been added since the last commit.
autoCommitMS long 0
= disabled
The maximum number of documents that can be added bewteen commits of the index. If set to 0, auto-commit based on document submission is disabled.
autoCommitSpinWait long 1000 ms
= 1 s
The spin wait time for the thread performing auto-commits (if enabled). The thread wakes this often to check whether the time and document threshold have been passed and commits if required.

Example Configuration

Simple

    <component name="LuceneService" subType="default" factoryName="aspire-lucene">
      <config/>
    </component>

Complex

    <component name="LuceneIndexer" subType="default" factoryName="aspire-lucene">
      <config>
        <indexDirectory>data/index/lucene-index</indexDirectory>
        <documentID>url</documentID>
        <luceneMaxFieldLength>10000</luceneMaxFieldLength>
        <luceneMaxBufferedDocuments>100</luceneMaxBufferedDocuments>
        <autoCommitSpinWait>5000</autoCommitSpinWait>
        <autoCommitMS>1800000</autoCommitMS>
        <autoCommitDocs>10000</autoCommitDocs>
      </config>
    </component>

Interface

In order to use the index and searching capabilities of this component, you must configure the <indexDirectory> parameter. Services are then provided using the following interface:

 /**
  * Interface for the common Lucene functionality provided by the Aspire Lucene Component
  *
  *  NOTE: The default analyser for the current implementation is provided by @link VerySimpleAnalyzer and
  *  will tokenise on punctuation & whitespace and will convert to lower case.
  * 
  * @author Steve Denny
  *
  */
 public interface AspireLucene {
   
   /**
    * Gets an index writer
    * @return a Lucene index writer
    * @throws AspireException
    */
   IndexWriter getIndexWriter() throws AspireException;
 
   /**
    * Adds the document to the index using the default analyser or analysers attached to every field
    * @param doc The Lucene document to add
    * @throws AspireException
    */
   void addDoc(Document doc) throws AspireException;
 
   /**
    * Adds the document to the index using the given analyser
    * @param doc The Lucene document to add
    * @param analyzer The Lucene analyzer to use
    * @throws AspireException
    */
   void addDoc(Document doc, Analyzer analyzer) throws AspireException;
 
   /**
    * Adds/updates the document to the index using the default analyser or analysers attached to every field
    * @param docId The document id of the document to update
    * @param doc The Lucene document to add
    * @throws AspireException
    */
   void addDoc(String docId, Document doc) throws AspireException;
 
   /**
    * Adds/updates the document to the index using the default analyser or analysers attached to every field
    * @param docId The document id of the document to update
    * @param doc The Lucene document to add
    * @param analyzer The Lucene analyzer to use
    * @throws AspireException
    */
   void addDoc(String docId, Document doc, Analyzer analyzer) throws AspireException;
 
   /**
    * Deletes a document using the given document id
    * @param docId The document id of the document to delete
    * @throws AspireException
    */
   void deleteDoc(String docId) throws AspireException;
 
   /**
    * Deletes all documents from the index and commits
    * @throws AspireException
    */
   void deleteAllDocs() throws AspireException;
 
   /**
    * Commits the configured index
    * @throws AspireException
    */
   void commit() throws AspireException;
 
   /**
    * Rolls back the configured index
    * @throws AspireException
    */
   void rollback() throws AspireException;
 
   /**
    * Optimizes the configured index
    * @throws AspireException
    */
   void optimize() throws AspireException;
 
   /**
    * Get the number of documents in the index
    * @return the number of documents currently in the index
    * @throws AspireException
    */
   long numDocs() throws AspireException;
 
   /**
    * Gets the default Analyzer used for indexing
    * @return the Analyser used by default for indexing
    * @throws AspireException
    */
   Analyzer getDefaultAnalyzer() throws AspireException;
   
   /**
    * Gets the index reader for the configured index
    * @return the index reader
    * @throws AspireException
    */
   IndexReader getIndexReader() throws AspireException;
   
   /**
    * Gets the index searcher for the configured index
    * @return the index searcher
    * @throws AspireException
    */
   IndexSearcher getIndexSearcher() throws AspireException;
 
   /**
    * Creates a lucene component that can be filled with fields to index
    * @return a {@link Document}
    */
   Document createDocument();
 
   /**
    * Creates a field an adds it to the document passed in the parameter
    * @param doc The {@link Document}
    * @param fieldName The field name
    * @param value The field value
    * @param store Whether to store the field or not
    * @param analyze Whether to analyze the field or not
    * @return
    */
   Document addField(Document doc, String fieldName, String value, boolean store,
         boolean analyze);
 }