Preliminary Isearch FAQ

Preliminary Isearch FAQ

Original Version: N. Nassar
Modified by: A. Warnock

This file is meant as an addition to, not a replacement of, the other documentation provided with Isearch. Please be sure to read the TUTORIAL and README files carefully before proceeding.

Q.1 I'm trying to build Isearch-cgi, and can't find idb.hxx, even looking through the whole file system. Compilation is with g++ from gcc-2.7.0.

Compiling Isearch-cgi requires that you first compile Isearch in a separate directory, and it needs to be the most recent release of Isearch. Newer versions of Isearch include Isearch-cgi but Isearch-cgi is no longer distributed separately.

Q.2 How can I make Iindex index single character words?

Remove them from the stop word list (sw.hxx).

Q.3 I ran Iindex on a large set of files overnight, but didn't see any index files created. How can I speed up indexing?

Try using -m to increase the amount of memory Iindex uses for indexing. The higher you set -m, the faster indexing will be, until you reach the physical memory limits of the machine. Iindex uses about 3 or 4 times (depending on the number of document records) the amount of memory specified in –m. At a minimum, -m should be set to slightly larger than the size of the largest document record being indexed, but if you have the memory, you should set it higher for the sake of speed.

Q.4 I am trying to compile Isearch on my machine, but I'm running into problems. Are the pre-compiled binaries available?

Precompiled binaries of Isearch and Isearch-cgi are available at:

ftp://ftp.awcubed.com/Software/

Q.5 I've indexed a set of web pages, and searching seems to work fine, except for when I search for something like: 9-12, I get no hits, even though it is in the data. What's going on?

I think the problem is the "-" character. Right now Isearch stops at the first non-alphanumeric character. You would have to search for '9 and 12' since they are treated as separate words. Until we put in configurable stop characters, that's all you can do.

Q.6 Do I really have to re-index the data every time a file is modified or deleted?

Yes. Like many full-text search engines, Isearch is based on the assumption that fast searching is more important than fast updating (indexing). (We will, however, be speeding up Isearch indexing significantly.) And it doesn't really make sense to search on an old index if the data have changed, because the results would no longer be meaningful. For example, if you removed all the occurrences of a word from a file, and the index still reported the word as being there, then you would get wrong results if you were to search on that word.

The main restriction is that the data being modified must not be the same files that were indexed. Each night, just before indexing, you could make a copy of the most recent versions of the files, and index those static copies. As soon as the index is finished, you would alias the new index files and data as current, and delete the old data and index files. Of course you would do this in a script, and it would take some disk space; but it is the best way I know of to provide continuous search access to changing data with a search engine that uses an indexing phase.

Q.7 Are there any plans for Isearch to support spatial searching like freeWAIS, i.e. northernmost latitude, etc.?

Yes - Isearch-2.00 supports numeric, date and spatial data searching.

Q.8 Does Iindex supports indexing with headlines?

Try using Iindex with "-t FIRSTLINE" if all you need is to return the first textual line.

Q.9 How complex can Boolean queries be?

Isearch supports arbitrarily complex Boolean queries, as long as they are phrased in RPN format. That is:

Isearch -d STORIES -rpn cat dog and mouse or

is equivalent to ((cat and dog) or mouse). We should be releasing a version of Isearch cgi to enable more 'normal' Boolean queries soon.

Q.10 I've encountered several bugs with Isearch. What information should I include in a bug report?

In general, information needed to track bugs is:

Detailed bug description so that it may be duplicated!
Operating System and revision information (showrev)
Compiler version and revision information
Hardware

	Architecture (e.g. sun4m)
	Model (e.g. SS20M71)-- yes some bugs show up on different machines of the same arch.

Q.11 Is there stemming support in Isearch?

No, but we are planning on adding it.

Q.12 I'm indexing an extremely large database (around 600 Megs), and just killed Iindex after allowing it to run over 4 days. What's going on?

Isearch was started with an emphasis on design, and some performance issues (for example, large data sets) have taken a back seat to adding features in an extensible way. In time, Isearch will far exceed the features, speed, and large-database capacity of other search engines such as freeWAIS.

Q.13 Is there any documentation (or would someone be willing to answer questions) on the document types supported by Iindex v1.20?

Q.14 I'm working on a database that we are trying to update about once a day.

From what I can tell, it seems to be re-indexing all the data. What's going wrong?

No, nothing wrong... The index merging is very slow! We suggest having two databases, one for the incremental additions and the main index.. The incremental database is always much smaller so the append time short (and well suited for a process thread).

Q.15 Is it possible to search multiple databases with one Isearch command?

Yes. Kevin Gamiel wrote a "virtual IDB" class that lets you treat multiple Isearch databases as a single database (it opens an array of IDB objects and searches across all of them and combines the results).

Q.16 From what I can tell, all the index hits are located before any results are provided. I'm working with a very large database, though, and I'd like Isearch to produce output as it gets a hit so I can process it on the fly. Is something like this possible?

Don't really see how. Until we finish we don't know the scores. While we can know some hits we don't know until we are done what the top hits are. With a large database this would most all the time return the "wrong" result set. The model is not grep, where we only want to know any hits but we want to know what the highest scores are with sentences composed of multiple terms.

Q.17 I've been trying everything to get Isearch to compile and run, but nothing seems to work. I'm using the very latest version of gcc/g++.

We've received periodic reports of strange problems with the newest version of gcc/g++. We have used versions 2.6.3 and 2.7.2 to compile the pre-compiled versions available, and suggest trying one of those if other versions don't work.

Q.18 I'm working with a large database of files which have many occurrences of words like "hud" "house" and "home". A search on "house" seems to foul up the search. Can I fix this by adding these words as stopwords in the sw.hxx file?

Searching for those words does not "foul" up the search but a single word "house" is just too narrow to return anything reasonable. Try many words! Something will or should float to the top. Single words queries are often (in large databases) not very interesting...

Adding to a stoplist removes any reference to the words. The question is: Is the word without any meaning in the context of the database? If the answer is yes then add it. If not don't. The question is not "common" occurrences but one of semantics.

Q.19 If one indexed a set a documents and some of documents are edited does that make the previous index totally invalid?

YES. You must mark the old version deleted. Move it, change the MTD path to reflect the new path and THEN add the new version... Version control is critical to the functioning of the index! The Isearch model does not have a dictionary so any, even minor, change to any of the documents can invalidate the index.

Q.20 I'm seeing inconsistencies in assigning relevance scores. The same file is given a different relevance score for the same search terms depending on how large a range I select with -startdoc and -enddoc. How are the relevance scores assigned?

It is because the scores are scaled based on the result subset. So, if the same file shows up as part of a different result subset (by specifying starting and ending docs), it will be given a different score. This is a bug in the Isearch command line tool, and we're working on it.

Q.21 Can Index documents that repeat the same field several times?

The Isearch engine does support repeating fields, but there is no interface for retrieving the individual instances, simply because I haven't gotten around to it yet. It is quite easy to add a simple method that lets you retrieve field contents based on a subscript (e.g., title[1], title[2]). If people need this feature, I will go ahead and add it for the next version.

Also, when I designed the field classes, I did want to support hierarchical fields, but felt that the field spec classes were already too complicated. I finally decided to use a flat model, with plans to add eventually some kind of support methods to allow a directory-like structure within the field name string that would have the effect of hierarchical fields.

Q.22 Has anyone developed a document type for MARC records in Iindex?

Yes, that has been available since version 1.09.10, along with some other bugs.

Q.23 I've just compiled and I get errors complaining about not finding iostreams. What's going on?

You need to be sure to install the Gnu C++ runtime library libg++ in addition to gcc/g++. It’s not included in the compiler distribution.

Q.24 Does Isearch support hierarchical fields?

Isearch does not now have the code to access fields in this way, but the capability is there b/c the interface to the engine is "open" (i.e. the engine doesn't care whether fields are nested or not; all field access is via internal mappings). Some additional work needs to be done in the Isearch library to provide a context sensitive interface to the field data. Once this is done, we can allow any combination of flat (context-independent) or context-dependent field specifications in the query.

Q.25 Can I update indexes while users are searching the database?

The current code does not handle updates during searching. Since all access to the database happens through IDB, a simple way to do this would be to lock the entire database in the constructor. But then your update would need to be pretty fast, and the current version does not have code for fast updates. You could update a secondary database, and then merge results during searching from the main and secondary databases.

Q.26 Does Isearch support soundex (or other phonetic matching)

No, Nassib's code notwithstanding. Implementing it (if I remember what he said at the time) would require a rewrite of the way we do the indexing.

Q.27 Does Isearch support root/suffix matching, i.e. "child" should match "children" and moreso, "index" should match "indicies".

Isearch supports right truncation - "child*" will match child, children, childless, etc. It has to be explicitly requested, though - it doesn't do it by default. It also does not currently do left truncation or word stemming.

Q.28 Does Isearch support synonym matching, i.e. infant should match neonate.

We've talked about this. In principle, it's not hard to implement but getting a good list of synonyms is hard. Roget's charges big bucks for their thesaurus. We'll probably implement this at some point with some small, discipline-specific lists to see how it goes.

Q.29 Does Isearch support similar concept matching, i.e. child should match baby, or even dog should match bitch, in the proper context (yeah, I know, getting pretty hairy).

No. Right now, Isearch does no real content analysis.

Q.30 How do I unsubscribe from the ISITE-L listserver?

In brief: send email to:
LISTSERV@LIST.NETPROVISIONS.COM

In the message body type:
SIGNOFF ISITE-L

Preliminary Isearch FAQ

Original Version: N. Nassar Modified by: A. Warnock

Original Version: N. Nassar
Modified by: A. Warnock