Quick How-To Guide For Java Indexer V1.5

Welcome, webmaster!

This document describes how to set up a Web site for indexing with JIndexer V1.5.

See the ExNet JIndexer noticeboard for updates and news on JIndexer as it becomes available.

The Reader

How to set up the part the visitor to your site sees.

Create a new directory near the top of your Web file tree, and call it (for example) ji15classes.
Copy either JIR1V5.TGZ (the gziped, tarred, UNIX-format archive) or JIR1V5.ZIP (the Zip-format archive) into the ji15classes directory and unpack it. A variation on the following commands will do the job:
```
        gzcat < JIR1V5.TGZ | tar xvf -
    
```
or
```
        zip -x JIR1V5.ZIP
    
```
You should now have the following files in the directory (plus the original JIR1V5 file):
```
        DHDArraySort.class
        DHDBitInputStream.class
        DHDBitOutputStream.class
        DHDGolombUtils.class
        DHDHuffmanCode.class
        DHDPreloadClasses.class
        DHDSortCompareInterface.class
        DHDStatusInterface.class
        ISearch.class
        ISearchAutoCue.class
        ISearchMatchItem.class
        ISearchMatchItemCompare.class
        ISearchStatus.class
        InvertedIndexBase.class
        InvertedIndexOneEntry.class
        InvertedIndexOneWordSummary.class
        InvertedIndexOneWordSummaryCompare.class
        InvertedIndexOneWordSummaryCompareCounts.class
        InvertedIndexURLCommon.class
        InvertedIndexURL.class
        Word.class
        WordChar.class
        WordParse.class
        WordText.class
        WordTextBuf.class
        help.html
        jind1o5.zip
    
```
They should all be made readable by your Web server process and any users that will look at the Web site through the filesystem. On UNIX is is usually sufficient to make them globally readable with a command such as:
```
        chmod 644 *
    
```
If all is present you can remove the local copy of the original JIR1V5 file.
Note that the .class files are copies of those in the jind1o5.zip file. Some browsers will be able to take advantage of the jind1o5.zip file and read all the classes at once for better performance across the Internet or a high-latency WAN; those that cannot do this can read the individual .class files.
IMPORTANT NOTE: The directory the class files are in is your applet's CODEBASE; if you are going to access and run the applet from a browser circa Netscape 2 or Netscape 3 via a filesystem rather than from an HTTP server, the index file you generate later will need to be placed in that directory or a subdirectory of it, else a security violation will be reported by the Java system and the applet will not run. If you are accessing the classes via an HTTP server, ie via a URL starting with http://, then the index file can appear anywhere on the same site.
You will need generate an index from your HTML documents (as described below), and add a new HMTL file to load the Java applet and refer to your index. This process is described below.

Generating the Index

First you will need Sun's Java JDK1.0.2 or equivalent installed on your system; this version of JIndexer is not guaranteed to run with other versions. This version of the toolkit/environment is supported by all the major browsers that support Java as at this writing (Oct1997), eg: Netscape 2, Netscape 3, Internet Explorer 3.
I will assume for here on in that the java command to run the Java interpreter is in your path.
In the directory in which you intend to work, unpack either the JIB1V5.ZIP or JIB1V5.TGZ archives as you did for the reader side above.
You should find yourself with at least the following files:
```
        jbld1o5.zip
        jind1o5.zip
    
```
the build-classes archive and the read-classes archive. The latter is a copy of the file you installed for the read side earlier. You do not need to further unpack these files.
The builder-side classes are not free and you may not distribute them to other people. In particular, do NOT put them up on your Web site.
You should add these archive files to the `classpath' through which the java searches for classes. You might do this by setting the appropriate environment variable. In the UNIX C-shell (csh) you might do this with the command:
```
        setenv CLASSPATH jbld1o5.zip:jind1o5.zip
    
```
or for the Bourne shell (sh) with:
```
        CLASSPATH=jbld1o5.zip:jind1o5.zip
        export CLASSPATH
    
```
You can now run the JIndexer tool with the command:
```
        java JIndexer
    
```
This will print out the command-line arguments for JIndexer; a page or so of text.

To get the hang of the indexer, copy a few (2--10) HTML files into the current directory. The names of the files should end .html or .htm (uppercase or lowercase is not important). You might also want to add a couple of plain-text files (with names ending .txt or called readme, again not case-sensitive).

Do a simple run of the indexer of the files in the current directory by typing:

        java JIndexer -verbose Quick s output.dat ./ .

which:

runs the indexer in verbose mode (if you put the -verbose before the JIndexer token you will run the Java interpreter in verbose mode instead), and
builds a `Quick' index.
The output format is ``s'' (for SIMPLE-SMALL format, the only one that the supplied reader can handle).
The output file, which is in a highly-compressed binary format, is written to the file output.dat.
The first ``./'' tells JIndexer that all filenames will be relative to the current directory. This always has to be a directory name, and should end in ``/'' for UNIX systems, and ``\'' for Microsoft Windows systems. This argument is the common prefix or top directory. This part of the filename is not stored in the index; you give the applet a different prefix to prepend to names in the index to make a full URL.
The ``.'' says `index the directory . relative to the top directory', ie everything that looks like a suitable file starting in the top directory. Instead of this ``.'' you can supply one or more files or subdirectories to be indexed; any files supplied this way will be indexed (as if plain text if JIndexer cannot work out what type they actually are), and any subdirectories will be recursively explored.

JIndexer will then recursively descend from the current directory, indexing any files it thinks are HTML or plain-text files.

Be aware that on UNIX systems, JIndexer follows symbolic links, so if you put a loop in your directory structure with such links, JIndexer will get stuck. If your structure is like this, use a tool such as find to generate a list of regular files to be indexed, and pass that list to JIndexer in place of the last ``.''.

If all this works you will get output a little like this:

        JIndexer V1.5.
        VERBOSE MODE
        Creating SIMPLE-SMALL-format index.
        Processing 1 specified seed files...
        Total files found: 8
         Processing [HTML] VOTE-0to3-0.html... [new words|docs: 56|1]
         Processing [HTML] VOTE-0to3-1.html... [new words|docs: 25|1]
         Processing [HTML] VOTE-0to3-2.html... [new words|docs: 24|1]
         Processing [HTML] VOTE-0to3-3.html... [new words|docs: 12|1]
         Processing [HTML] icons.html... [new words|docs: 154|1]
         Processing [HTML] index.html... [new words|docs: 178|1]
         Processing [HTML] quick-how-to.html... [new words|docs: 142|1]
         Processing [HTML] terms.html... [new words|docs: 60|1]
        Compacting...
        Initial lexicon of 651 words found in 8 files---8 document fragments.
        Applying lexicon filters...
        Lexicon of 651 words found in 8 files---8 document fragments.
         Document names to be saved: 8
         Document names total length: 111
         Residue after front coding: 80
         Document zero-order residue encoding: count of all symbols: 80
         Document zero-order residue encoding: highest count: 11
         Document zero-order residue encoding: symbol number: 116
         Document zero-order residue encoding: probability: 0.1375
         Document zero-order residue encoding: alphabet size: 27
         Document zero-order residue encoding: entropy (mean bits/symbol): 4.15802
         Document zero-order residue encoding: not all symbols coded: true
         Document zero-order residue encoding: shortest non-zero code: 3
         Document zero-order residue encoding: longest non-zero code: 7
         Document zero-order residue encoding: average bits per symbol: 4.2125
         Document zero-order residue encoding: bits without encoding: 640
         Document zero-order residue encoding: bits after encoding: 337
         Document zero-order residue encoding: bits for Huffman lengths in file: 176
         Document zero-order residue encoding: bits saved by encoding: 127
         Will use Huffman code for document-name residue.
         Lexicon entries: 651
         Total lexicon characters: 3727
         Lexicon residue after front coding: 2252
         Lexicon zero-order residue encoding: count of all symbols: 2252
         Lexicon zero-order residue encoding: highest count: 287
         Lexicon zero-order residue encoding: symbol number: 14
         Lexicon zero-order residue encoding: probability: 0.127442
         Lexicon zero-order residue encoding: alphabet size: 36
         Lexicon zero-order residue encoding: entropy (mean bits/symbol): 4.42358
         Lexicon zero-order residue encoding: not all symbols coded: false
         Lexicon zero-order residue encoding: shortest non-zero code: 3
         Lexicon zero-order residue encoding: longest non-zero code: 10
         Lexicon zero-order residue encoding: average bits per symbol: 4.44805
         Lexicon zero-order residue encoding: bits without encoding: 13512
         Lexicon zero-order residue encoding: bits after encoding: 10017
         Lexicon zero-order residue encoding: bits for Huffman lengths in file: 168
         Lexicon zero-order residue encoding: bits saved by encoding: 3327
         Will use Huffman code for lexicon residue.
         Pointers saved in index: 1069
        Done.

You can see the files being processed, how many unique words were found (the lexicon), and then some detail about how the index is being encoded. The next-to-last line (Pointers saved in index) says how many different words in how many different documents were saved, and these are essentially the things a user is searching for with the applet.

(JIndexer has a built-in default that recognises files from the last component of their names as described above, to know how to tune the indexing process, eg to discard HTML tags for HTML documents. JIndexer by default also ignores any file or directory whose name starts with a dot (``.'') or is ``SCCS'' or ``RCS'' or ``CVS'', which means you can make files and directories private (ignored by the indexer) if they start with a dot or are archive files for one of the popular source-code-control tools.)

Note that before breaking text into words, JIndexer condenses (folds) all characters into one of the lowercase ASCII letters or digits, or a single non-word character, and very long words (or somewhat shorter digit sequences) are chopped up into smaller pieces. Characters from the ISO-Latin-1 alphabet above character 127 are folded into reasonable equivalents, eg all the accented ``a''s are folded into a plain ``a''. The same process is applied to text typed into the search applet, so the process should be largely transparent.
It is possible to apply a small sanity check to the index produced using the Dump command, eg:
```
        % java JIndexer Dump output.dat
        JIndexer V1.5.
        DUMP of output.dat
          Number of docs: 8
            VOTE-0to3-0.html
              ...
            terms.html
          Lexicon size:   651
            InvertedIndexOneWordSummary("0",3,3)
              ...
            InvertedIndexOneWordSummary("zip",1,1)
        %
    
```
we see that the index in output.dat contains 8 documents from VOTE-0to3-0.html to terms.html, and 651 unique words ranging from ``0'' to ``zip'' (respectively appearing in 3 and 1 documents). Though intended mainly for the benefit of the JIndexer developer, you may find it useful.
In the above example, we used the Quick mode of indexing, which treats HTML and text files as single entities. If the user makes a selection through the applet, the browser is directed to the top of the relevant document.
In QuickFrag mode, HTML documents are split up at the A NAME anchor tags, and the browser is directed to the closest available anchor before the text they are interested in. Especially in long documents with lots of structure, this considerably speeds the process of finding the item of interest.
The other main variation you may wish to make in index processing is to filter certain words from the lexicon to slim down the size of the index.
The argument that was just ``s'' for SIMPLE-SMALL output format above, can be prefixed with a whole pipeline of filters. The output format is considered a type of filtering since different index formats record different subsets of the full index data.
Separate stages in the filter are separated with ``-'' (dash) characters. The parts of each filter component are separated with ``:'' (colon) characters, the first such part being the name of the filter.
The supported filters and a brief summaries of their use are:
- IN, if present must be first. If not specified the following is the default: ``IN:noempty:dup:tiny'', which means don't keep empty documents (or sections when using QuickFrag), do preserve duplicate documents (that contain the same text), and do preserve tiny documents or sections containing only a few words.
  To make a smaller index, at the risk of losing a little information, you might use the filter IN:noempty:nodup:notiny to trim out duplicate and tiny documents/sections.
- SINGLETONS is used to drop words which only appear once throughout all the input files, to drop typos, spelling errors, and random junk from items such as mail IDs. If an additional numeric parameter is supplied then only singletons longer than that are dropped, eg SINGLETONS:22 will only drop singleton words longer than 22 characters, which is virtually only random junk, no real words in English text.
- STOPWORDS:nn is used to stop words that appear in more than nn% of documents; in fact words that appear in many documents have their entries highly compressed, so you need not apply this filter except to remove very common junk or trim the index a little for space purposes.
So, a very condensed index (for fast loading) might be built with Quick and a filter chain such as IN:noempty:nodup:notiny-STOPWORDS:90-SINGLETONS-s, and a much more comprehensive index would be built with QuickFrag and have a filter chain such as IN:noempty:nodup-STOPWORDS:99-SINGLETONS:22-s. But plain, ordinary ``s'' will do a fair job for most source text, so don't worry unduly. For the filters currently provided with JIndexer, apart from the IN filter and the final index format, the order is not important.
You may not like JIndexer's default strategy for deciding which files to ignore, which to regard as HTML, which as plain text, etc.
There is an optional `matchpattern' parameter just before the output filename, whose default is equivalent to ``-REJECT-:.*:SCCS:RCS:CVS;HTML:*.htm:*.html;PLAIN:*.txt:readme''. The syntax is described in the output of:
```
        java JIndexer
    
```
Here is an example of me generating an index for a Web site I manage, the DHD Photo Gallery.
The index is to be detailed, since I want every significant word available for searching on, so a filter chain such as IN:noempty:nodup-STOPWORDS:99-SINGLETONS:22-s is probably fine.
Because the pages on the site are broken up into regular sections with anchors, I will use QuickFrag rather than just Quick. (In fact, the anchors are put there precisely to help JIndexer take the user as close as possible to the selected text.)
Note that the Photo Gallery is only part of the site (/Damon/photos/) and that there could be many independent overlapping or separate indexes of different parts of the same site. Also note the complex pathname to get to the files through the filesystem, all but the last parts of which is omitted from the final index. And indeed the same areas can be indexed with different levels of comprehensiveness using different filter chains and with Quick/QuickFrag.
I want to ignore files and directories beginning with ``.'', and match only those files ending in ``.html'' and regard them as HTML. For this I will use a matchpattern ``-REJECT-:.*-;HTML:*.html''. Because this contains shell `metacharacters; that might be processed specially by the shell, I put it in double quotes on the command line.
The command to build the index file fullindex.dat is (assuming java is in the path and the CLASSPATH has been set appropriately):
```
        java JIndexer -verbose QuickFrag IN:noempty:nodup-STOPWORDS:99-SINGLETONS:22-s "-REJECT-:.*;HTML:*.html" fullindex.dat /ro/docs-public.s0.l/www.hd.org/Damon/photos/ .
    
```
The filter settings I use do not actually result in any singletons or stop-words being dropped, so I could eliminate them if I wanted.
The index can be checked for sanity with the Dump command, viz:
```
        java JIndexer Dump fullindex.dat
    
```
which in this case yields the output:
```
        JIndexer V1.5.
        DUMP of fullindex.dat
          Number of docs: 100
            .how-to-build.html
              ...
            textures/index.html#end-of-STRIP12
          Lexicon size:   1504
            InvertedIndexOneWordSummary("0",8,8)
              ...
            InvertedIndexOneWordSummary("z80a",1,1)
    
```
ie 100 document fragments containing 1504 unique words ranging from ``0'' to `z80a'' (which was ``Z80A'' in the input text, before folding).
The easiest, if not necessarily the most elegant thing to do with the index file is to copy it into the directory where the classes live (or slightly better, a subdirectory), to help avoid problems from Java's security restrictions on applets.
This index generation can be done regularly, say once per week from cron on UNIX systems, keeping the index fresh.

Using the Applet

We are now on the final stage, embedding the applet in an HTML page to use the index.

You may wish to create a new HTML file to contain the applet, ie create a separate search page. For people with very slow links it may take as much as a minute to load the applet the first time they visit the search page, and as long again to load the index (depending on its size). Thus, you may not wish to have people spend that time unless they have made the effort to do a search. (Future versions of JIndexer will try to trim load time further.)

Here is the code fragment that loads the example index built above:

        <CENTER>
        Interactively search the gallery pages by word with the Java applet below.
        <APPLET CODEBASE="http://www.hd.org/ji15classes" CODE=ISearch ARCHIVE="jind1o5.zip" WIDTH=450 HEIGHT=300 ALT="Java Search Tool">
        <PARAM NAME="KEY" VALUE="http://www.hd.org/Damon/photos/fullindex.dat">
        <PARAM NAME="ROOT" VALUE="http://www.hd.org/Damon/photos/">
        <PARAM NAME="TITLE" VALUE="Word Search">
        <PARAM NAME="AUTOCUE" VALUE="Search for keywords#in the image descriptions#Press the HELP button#for more information">
        Your browser can't see the live Java search tool embedded here.
        </APPLET>
        </CENTER>

The CODE value is always ISearch for this applet.

The KEY parameter is the URL of the index.
The ROOT parameter is the partial URL to put in front of the partial filenames recorded in the index to make the full URL of each document to be looked up.
The TITLE gives an alternative title to display in the applet.
The AUTOCUE string consists of a series of `#'-separated messages to be displayed in order to the user if the applet is idle and the search box empty, to prompt the user into action.

Basic Terms and Conditions of Use

You may not disassemble the classes nor distribute them to third parties.
You may place the reader-side classes on any site you manage, and run JIndexer anywhere you need to to generate the indices for the reader side for sites you manage.
You accept that the software is supplied as-is, and ExNet Ltd and Damon Hart-Davis (the authors/suppliers) accept no liability for loss of any sort arising from use of this software beyond the purchase price of the software as supplied by us.

ExNet's home page.
Sales queries to info@exnet.com, technical queries to sysadmin@exnet.com.
All code and documentation copyright DHD/EL 1995--1997.
Some of the words in this document are trademarks of their owners.
[1.39 97/10/20]