The default search engine in
com.sun.java.help.search.DefaultSearchEngine
uses an
effective natural language search technology that not only retrieves
documents, but locates specific passages within those documents where
the answers to a request are likely to be found. The technology
involves a conceptual indexing engine that analyzes documents to
produce an index of their content and a query engine that uses this
index to find relevant passages in the material.
The query engine makes use of a technique called "relaxation ranking" to identify and score specific passages of material where the answer to a request is likely to be found. This is referred to as "specific passage retrieval" and is contrasted with the traditional "document retrieval" which retrieves documents but leaves the user with the task of finding the relevant information within the document (or finding that the desired information is not in the document after all).
The relaxation ranking algorithm looks at the search terms and compares them to occurrences of the same or related terms in the documents. The algorithm attempts to find passages in the documents in which as many as possible of the query terms occur in as nearly as possible to the same form and the same order, but will automatically relax these constraints to identify passages in which not all of the terms occur or they occur in different forms or they occur in different order or they occur with intervening words, and it assigns appropriate penalties to the passages for the various ways in which a passage departs from an exact match of the requested information. Passages with words in the same order as the search terms are scored better than passages with the matching words in some other order. Passages with matching words in any order are scored better than passages which do not contain matches for all of the requested terms.
Conceptual index consists of the following linguistic resources
IMPORTANT: Although the core search engine in the reference implementation supports all these concepts, the indexer (search builder) available in JavaHelp 1.0 only incorporates tokens. Details of the other concepts are included below just for the interested reader.
The indexing engine can perform linguistic content processing of the indexed material to analyze the structure and interrelationships of words and phrases and to organize all of the words and phrases from the indexed material into a conceptual taxonomy that can be browsed and can be used to make connections between terms in a query and related terms in the material that you'd like to find.
The relaxation ranking algorithm is a very effective retrieval method all by itself, but can produce significantly improved results by using morphological and semantic relationships from the conceptual taxonomy to automatically make connections between query terms and related terms that may occur in desired passages.
Morphological relationships refer to relationships between different inflected and derived forms of a word, such as the relationship between "renew" and "renewed" (past tense inflection) and "renew" and "renewal" (derived nominalization). Derived and inflected forms of a word are treated as more specific terms in the conceptual taxonomy, so that a request for "renew" will automatically match "renewed" and "renewal" (with a small penalty).
Semantic relationships refer to relationships between terms that are more general or more specific than other terms or that imply other terms. For example, "washing" is a kind of "cleaning" and since it is more specific than "cleaning" it will automatically be matched by a request for "cleaning" (again with a small penalty).
Passages with exact word matches are scored better than passages with morphological matches or matches using semantic relationships.