Preliminary Isearch FAQ
Original Version: N. Nassar
Modified by: A. Warnock
This file is meant as an addition to, not a replacement of, the other documentation
provided with Isearch. Please be sure to read the TUTORIAL and README files carefully
before proceeding.
Q.1 I'm trying to build Isearch-cgi, and can't find idb.hxx, even looking
through the whole file system. Compilation is with g++ from gcc-2.7.0.
Compiling Isearch-cgi requires that you first compile Isearch in a separate directory,
and it needs to be the most recent release of Isearch. Newer versions of Isearch include
Isearch-cgi but Isearch-cgi is no longer distributed separately.
Q.2 How can I make Iindex index single character words?
Remove them from the stop word list (sw.hxx).
Q.3 I ran Iindex on a large set of files overnight, but didn't see any index
files created. How can I speed up indexing?
Try using -m to increase the amount of memory Iindex uses for indexing. The higher you
set -m, the faster indexing will be, until you reach the physical memory limits of the
machine. Iindex uses about 3 or 4 times (depending on the number of document records) the
amount of memory specified in m. At a minimum, -m should be set to slightly larger
than the size of the largest document record being indexed, but if you have the memory,
you should set it higher for the sake of speed.
Q.4 I am trying to compile Isearch on my machine, but I'm running into
problems. Are the pre-compiled binaries available?
Q.5 I've indexed a set of web pages, and searching seems to work fine, except
for when I search for something like: 9-12, I get no hits, even though it is in the data.
What's going on?
I think the problem is the "-" character. Right now Isearch stops at the
first non-alphanumeric character. You would have to search for '9 and 12' since they are
treated as separate words. Until we put in configurable stop characters, that's all you
can do.
Q.6 Do I really have to re-index the data every time a file is modified or
deleted?
Yes. Like many full-text search engines, Isearch is based on the assumption that fast
searching is more important than fast updating (indexing). (We will, however, be speeding
up Isearch indexing significantly.) And it doesn't really make sense to search on an old
index if the data have changed, because the results would no longer be meaningful. For
example, if you removed all the occurrences of a word from a file, and the index still
reported the word as being there, then you would get wrong results if you were to search
on that word.
The main restriction is that the data being modified must not be the same files that
were indexed. Each night, just before indexing, you could make a copy of the most recent
versions of the files, and index those static copies. As soon as the index is finished,
you would alias the new index files and data as current, and delete the old data and index
files. Of course you would do this in a script, and it would take some disk space; but it
is the best way I know of to provide continuous search access to changing data with a
search engine that uses an indexing phase.
Q.7 Are there any plans for Isearch to support spatial searching like freeWAIS,
i.e. northernmost latitude, etc.?
Yes - Isearch-2.00 supports numeric, date and spatial data searching.
Q.8 Does Iindex supports indexing with headlines?
Try using Iindex with "-t FIRSTLINE" if all you need is to return the first
textual line.
Q.9 How complex can Boolean queries be?
Isearch supports arbitrarily complex Boolean queries, as long as they are phrased in
RPN format. That is:
Isearch -d STORIES -rpn cat dog and mouse or
is equivalent to ((cat and dog) or mouse). We should be releasing a version of Isearch
cgi to enable more 'normal' Boolean queries soon.
Q.10 I've encountered several bugs with Isearch. What information should I
include in a bug report?
In general, information needed to track bugs is:
- Detailed bug description so that it may be duplicated!
- Operating System and revision information (showrev)
- Compiler version and revision information
- Hardware
| Architecture (e.g. sun4m) |
| Model (e.g. SS20M71)-- yes some bugs show up on different machines of the same arch. |
Q.11 Is there stemming support in Isearch?
No, but we are planning on adding it.
Q.12 I'm indexing an extremely large database (around 600 Megs), and just
killed Iindex after allowing it to run over 4 days. What's going on?
Isearch was started with an emphasis on design, and some performance issues (for
example, large data sets) have taken a back seat to adding features in an extensible way.
In time, Isearch will far exceed the features, speed, and large-database capacity of other
search engines such as freeWAIS.
Q.13 Is there any documentation (or would someone be willing to answer
questions) on the document types supported by Iindex v1.20?
Q.14 I'm working on a database that we are trying to update about once a day.
From what I can tell, it seems to be re-indexing all the data. What's going wrong?
No, nothing wrong... The index merging is very slow! We suggest having two databases,
one for the incremental additions and the main index.. The incremental database is always
much smaller so the append time short (and well suited for a process thread).
Q.15 Is it possible to search multiple databases with one Isearch command?
Yes. Kevin Gamiel wrote a "virtual IDB" class that lets you treat
multiple Isearch databases as a single database (it opens an array of IDB objects and
searches across all of them and combines the results).
Q.16 From what I can tell, all the index hits are located before any results
are provided. I'm working with a very large database, though, and I'd like Isearch to
produce output as it gets a hit so I can process it on the fly. Is something like this
possible?
Don't really see how. Until we finish we don't know the scores. While we can know some
hits we don't know until we are done what the top hits are. With a large database this
would most all the time return the "wrong" result set. The model is not grep,
where we only want to know any hits but we want to know what the highest scores are with
sentences composed of multiple terms.
Q.17 I've been trying everything to get Isearch to compile and run, but nothing
seems to work. I'm using the very latest version of gcc/g++.
We've received periodic reports of strange problems with the newest version of gcc/g++.
We have used versions 2.6.3 and 2.7.2 to compile the pre-compiled versions available, and
suggest trying one of those if other versions don't work.
Q.18 I'm working with a large database of files which have many occurrences of
words like "hud" "house" and "home". A search on
"house" seems to foul up the search. Can I fix this by adding these words as
stopwords in the sw.hxx file?
Searching for those words does not "foul" up the search but a single word
"house" is just too narrow to return anything reasonable. Try many words!
Something will or should float to the top. Single words queries are often (in large
databases) not very interesting...
Adding to a stoplist removes any reference to the words. The question is: Is the word
without any meaning in the context of the database? If the answer is yes then add it. If
not don't. The question is not "common" occurrences but one of semantics.
Q.19 If one indexed a set a documents and some of documents are edited does
that make the previous index totally invalid?
YES. You must mark the old version deleted. Move it, change the MTD path to reflect the
new path and THEN add the new version... Version control is critical to the functioning of
the index! The Isearch model does not have a dictionary so any, even minor, change to any
of the documents can invalidate the index.
Q.20 I'm seeing inconsistencies in assigning relevance scores. The same file is
given a different relevance score for the same search terms depending on how large a range
I select with -startdoc and -enddoc. How are the relevance scores assigned?
It is because the scores are scaled based on the result subset. So, if the same file
shows up as part of a different result subset (by specifying starting and ending docs), it
will be given a different score. This is a bug in the Isearch command line tool, and we're
working on it.
Q.21 Can Index documents that repeat the same field several times?
The Isearch engine does support repeating fields, but there is no interface for
retrieving the individual instances, simply because I haven't gotten around to it yet. It
is quite easy to add a simple method that lets you retrieve field contents based on a
subscript (e.g., title[1], title[2]). If people need this feature, I will go ahead and add
it for the next version.
Also, when I designed the field classes, I did want to support hierarchical fields, but
felt that the field spec classes were already too complicated. I finally decided to use a
flat model, with plans to add eventually some kind of support methods to allow a
directory-like structure within the field name string that would have the effect of
hierarchical fields.
Q.22 Has anyone developed a document type for MARC records in Iindex?
Yes, that has been available since version 1.09.10, along with some other bugs.
Q.23 I've just compiled and I get errors complaining about not finding
iostreams. What's going on?
You need to be sure to install the Gnu C++ runtime library libg++ in addition to
gcc/g++. Its not included in the compiler distribution.
Q.24 Does Isearch support hierarchical fields?
Isearch does not now have the code to access fields in this way, but the capability is
there b/c the interface to the engine is "open" (i.e. the engine doesn't care
whether fields are nested or not; all field access is via internal mappings). Some
additional work needs to be done in the Isearch library to provide a context sensitive
interface to the field data. Once this is done, we can allow any combination of flat
(context-independent) or context-dependent field specifications in the query.
Q.25 Can I update indexes while users are searching the database?
The current code does not handle updates during searching. Since all access to the
database happens through IDB, a simple way to do this would be to lock the entire database
in the constructor. But then your update would need to be pretty fast, and the current
version does not have code for fast updates. You could update a secondary database, and
then merge results during searching from the main and secondary databases.
Q.26 Does Isearch support soundex (or other phonetic matching)
No, Nassib's code notwithstanding. Implementing it (if I remember what he said at the
time) would require a rewrite of the way we do the indexing.
Q.27 Does Isearch support root/suffix matching, i.e. "child" should
match "children" and moreso, "index" should match
"indicies".
Isearch supports right truncation - "child*" will match child, children,
childless, etc. It has to be explicitly requested, though - it doesn't do it by default.
It also does not currently do left truncation or word stemming.
Q.28 Does Isearch support synonym matching, i.e. infant should match neonate.
We've talked about this. In principle, it's not hard to implement but getting a good
list of synonyms is hard. Roget's charges big bucks for their thesaurus. We'll probably
implement this at some point with some small, discipline-specific lists to see how it
goes.
Q.29 Does Isearch support similar concept matching, i.e. child should match
baby, or even dog should match bitch, in the proper context (yeah, I know, getting pretty
hairy).
No. Right now, Isearch does no real content analysis.
Q.30 How do I unsubscribe from the ISITE-L listserver?