Isite and Synonym Expansion
Starting with Isite v2.06/Isearch v.146, you have the ability to provide an external list of synonyms for use with Isite. You can use this list to expand each term of the user's query into its defined synonyms, or to replace each term in the user's query with a synonym from a controlled list.
Q1. What does the list of synonyms look like?
It's a simple, formatted table. Here is the file doc/Synonyms.txt, provided in the Isite distribution:
# Rows of the form:
# parent phrase = child1+child2+multiword child+ ... +childN
#
# White space is ignored at the start and end of child terms
# Comments start with #
spatial=geospatial+geographic+terrestrial # Here are more comments
land use=land cover + land characterization + land surface + ownership property
wetlands=wet land+NWI+hydric soil+inundated
hydrography=stream+river+spring+lake+pond+aqueduct+siphon+well
hypsography=elevation + relief + topgraphy + contourEach entry consists of a parent term or phrase (for example, the word "spatial" is a parent term), and a list of one or more child terms or phrases. The parent is separated from the children by the "=" sign, and children are separated from each other with "+" signs. Each entry must go on a single line. Comments start with the "#" character.
Just create a file like this, with whatever terms you wish to use, and save it somewhere convenient on your disk. This file will be used as an input parameter for Iindex.
Expanding a term with its synonyms is a two-step process. First, the child term is matched with its parent (that is, look to the left of the equals sign). Then, find all of the children for that parent, and replace the original term with this list, including the parent. If no parent is found for the original term, the original term is left alone.
This gives two real possibilities. The parent terms can be treated as coming from a list of controlled keywords. For example, you can use it to translate any of a set of user-supplied terms into a word found in your documents. Many types of metadata use controlled lists of keywords in certain fields, for example, and this would be a way to map the user's query into one of those controlled terms.
The second use is to replace a user's query term with equivalent terms, so that one doesn't miss relevant documents which don't happen to contain the particular words the user specified. In this case, one would replace the original term with the entire list of synonyms. When the search is performed, each term is treated as a separate search and the results are OR'd together. This means that searching with synonym expansion turned on isn't free - there's a performance cost.
Q2. How do I build the synonyms into the index?
Run Iindex as you normally do, with the additional command line parameter -syn <filename>. For example, if you normally build an index with the command:
Iindex -d db/test -t fgdc -o fieldtype=bin/fgdc.fields -m 8 data/*.sgml
you can enable synonym expansion with the command:
Iindex -d db/test -t fgdc -o fieldtype=bin/fgdc.fields -m 8 -syn data/Synonyms.txt data/*.sgml
This will use the file data/Synonyms.txt for the synonym definitions.
Note that the synonym expansion capability is not dependent on the doctype. It works with all of them.
Q3. Does synonym expansion work with the command-line Isearch program? How do I turn it on?
Yes. Use this command:
Isearch -d db/test -syn river
Q4. Does synonym expansion work with zserver? How do I turn it on?
Yes. Edit the section of the file sapi.ini for your index by adding the parameter
Synonyms=ON
Q5. Does synonym expansion work with Isearch-cgi? How do I turn it on?
No. It's not hard, but it's not implemented yet. If you want it, follow the implementation in Isearch.cxx - search for the word "Synonyms".
Q6. Can Z39.50 clients turn off synonym expansion?
No. It's an attribute of the index, and it's defined as on or off when the server starts. Furthermore, at the current time, the Z39.50 protocol has no mechanism for passing such a request.
Q7. Are there big thesauri I can use?
Not yet. I'd like to find some.
Archie Warnock (warnock@awcubed.com)
A/WWW Enterprises