Lucene Analyzer

Identifier: org.eclipse.help.luceneAnalyzer

Description: This extension point is used to register text analyzers for use by by help when indexing and searching documentation.

Help exploits capabilities of the Lucene search engine, that allows indexing of token streams (streams of words). Analyzers create tokens from the character stream. They examine text content and provide tokens for use with the index. The text stream can be tokenized in many unique ways. A trivial analyzer can tokenize streams at white space, a different one can perform filtering of tokens, based on the application needs. Since the documentation is mostly human readable text, it is desired that analyzers used by the help system perform language and grammar aware tokenization and normalization of indexed text. For some languages the quality of search increases significantly if stop word removal and stemming is performed on the indexed text. This extension points allows configuring analyzers for the languages that default help system does not provide language aware analyzers.

Configuration Markup:

   <!ELEMENT analyzer EMPTY>
   <!ATTLIST analyzer
      locale         CDATA #REQUIRED
      class          CDATA #REQUIRED
   >

locale - a string identifying locale for which the supplied analyzer is to be used, it two letters language is provided, the analyzer will be available to all locales of that language
class - a fully qualified name of the Java class extending org.apache.lucene.analysis.Analyzer

Examples:

Following is an example of Lucene Analyzer configuration:

    <extension id="com.xyz.XYZ" point="org.eclipse.help.luceneAnalyzer">
        <analyzer locale="ll_CC" class="com.xyz.ll_CCAnalyzer" />
    </extension>

API Information:

The value of the locale attribute must represent either a five or two character locale string. If analyzer is configured for a language by specifying two letter language designation, the analyzer is going to be used for all locales of this language. If analyzer is configured that matches five characters locale, it is going to be used instead.

The value of the class attribute must represent a class that extends org.apache.lucene.analysis.Analyzer. It is recommended that this analyzer performs lowercase filtering for languages where it is possible to increase number of search hits by making search case insensitive.

Supplied Implementation: Help system comes with English and German analyzers, that are configured to be used for en and de locales respectively. These analyzers perform stop word filtering, lowercase filtering, and stemming. For languages that no analyzers are configured, help uses simple analyzer that performs lowercase filtering and English stop word filtering.