This document describes how to set up a Web site for indexing with JIndexer V1.5.
See the ExNet JIndexer noticeboard for updates and news on JIndexer as it becomes available.
gzcat < JIR1V5.TGZ | tar xvf -or
zip -x JIR1V5.ZIP
You should now have the following files in the directory (plus the original JIR1V5 file):
DHDArraySort.class DHDBitInputStream.class DHDBitOutputStream.class DHDGolombUtils.class DHDHuffmanCode.class DHDPreloadClasses.class DHDSortCompareInterface.class DHDStatusInterface.class ISearch.class ISearchAutoCue.class ISearchMatchItem.class ISearchMatchItemCompare.class ISearchStatus.class InvertedIndexBase.class InvertedIndexOneEntry.class InvertedIndexOneWordSummary.class InvertedIndexOneWordSummaryCompare.class InvertedIndexOneWordSummaryCompareCounts.class InvertedIndexURLCommon.class InvertedIndexURL.class Word.class WordChar.class WordParse.class WordText.class WordTextBuf.class help.html jind1o5.zipThey should all be made readable by your Web server process and any users that will look at the Web site through the filesystem. On UNIX is is usually sufficient to make them globally readable with a command such as:
chmod 644 *If all is present you can remove the local copy of the original JIR1V5 file.
Note that the .class files are copies of those in the jind1o5.zip file. Some browsers will be able to take advantage of the jind1o5.zip file and read all the classes at once for better performance across the Internet or a high-latency WAN; those that cannot do this can read the individual .class files.
IMPORTANT NOTE: The directory the class files are in is your applet's CODEBASE; if you are going to access and run the applet from a browser circa Netscape 2 or Netscape 3 via a filesystem rather than from an HTTP server, the index file you generate later will need to be placed in that directory or a subdirectory of it, else a security violation will be reported by the Java system and the applet will not run. If you are accessing the classes via an HTTP server, ie via a URL starting with http://, then the index file can appear anywhere on the same site.
I will assume for here on in that the java command to run the Java interpreter is in your path.
You should find yourself with at least the following files:
jbld1o5.zip jind1o5.zipthe build-classes archive and the read-classes archive. The latter is a copy of the file you installed for the read side earlier. You do not need to further unpack these files.
The builder-side classes are not free and you may not distribute them to other people. In particular, do NOT put them up on your Web site.
setenv CLASSPATH jbld1o5.zip:jind1o5.zipor for the Bourne shell (sh) with:
CLASSPATH=jbld1o5.zip:jind1o5.zip export CLASSPATH
java JIndexerThis will print out the command-line arguments for JIndexer; a page or so of text.
Do a simple run of the indexer of the files in the current directory by typing:
java JIndexer -verbose Quick s output.dat ./ .which:
Be aware that on UNIX systems, JIndexer follows symbolic links, so if you put a loop in your directory structure with such links, JIndexer will get stuck. If your structure is like this, use a tool such as find to generate a list of regular files to be indexed, and pass that list to JIndexer in place of the last ``.''.
If all this works you will get output a little like this:
JIndexer V1.5. VERBOSE MODE Creating SIMPLE-SMALL-format index. Processing 1 specified seed files... Total files found: 8 Processing [HTML] VOTE-0to3-0.html... [new words|docs: 56|1] Processing [HTML] VOTE-0to3-1.html... [new words|docs: 25|1] Processing [HTML] VOTE-0to3-2.html... [new words|docs: 24|1] Processing [HTML] VOTE-0to3-3.html... [new words|docs: 12|1] Processing [HTML] icons.html... [new words|docs: 154|1] Processing [HTML] index.html... [new words|docs: 178|1] Processing [HTML] quick-how-to.html... [new words|docs: 142|1] Processing [HTML] terms.html... [new words|docs: 60|1] Compacting... Initial lexicon of 651 words found in 8 files---8 document fragments. Applying lexicon filters... Lexicon of 651 words found in 8 files---8 document fragments. Document names to be saved: 8 Document names total length: 111 Residue after front coding: 80 Document zero-order residue encoding: count of all symbols: 80 Document zero-order residue encoding: highest count: 11 Document zero-order residue encoding: symbol number: 116 Document zero-order residue encoding: probability: 0.1375 Document zero-order residue encoding: alphabet size: 27 Document zero-order residue encoding: entropy (mean bits/symbol): 4.15802 Document zero-order residue encoding: not all symbols coded: true Document zero-order residue encoding: shortest non-zero code: 3 Document zero-order residue encoding: longest non-zero code: 7 Document zero-order residue encoding: average bits per symbol: 4.2125 Document zero-order residue encoding: bits without encoding: 640 Document zero-order residue encoding: bits after encoding: 337 Document zero-order residue encoding: bits for Huffman lengths in file: 176 Document zero-order residue encoding: bits saved by encoding: 127 Will use Huffman code for document-name residue. Lexicon entries: 651 Total lexicon characters: 3727 Lexicon residue after front coding: 2252 Lexicon zero-order residue encoding: count of all symbols: 2252 Lexicon zero-order residue encoding: highest count: 287 Lexicon zero-order residue encoding: symbol number: 14 Lexicon zero-order residue encoding: probability: 0.127442 Lexicon zero-order residue encoding: alphabet size: 36 Lexicon zero-order residue encoding: entropy (mean bits/symbol): 4.42358 Lexicon zero-order residue encoding: not all symbols coded: false Lexicon zero-order residue encoding: shortest non-zero code: 3 Lexicon zero-order residue encoding: longest non-zero code: 10 Lexicon zero-order residue encoding: average bits per symbol: 4.44805 Lexicon zero-order residue encoding: bits without encoding: 13512 Lexicon zero-order residue encoding: bits after encoding: 10017 Lexicon zero-order residue encoding: bits for Huffman lengths in file: 168 Lexicon zero-order residue encoding: bits saved by encoding: 3327 Will use Huffman code for lexicon residue. Pointers saved in index: 1069 Done.You can see the files being processed, how many unique words were found (the lexicon), and then some detail about how the index is being encoded. The next-to-last line (Pointers saved in index) says how many different words in how many different documents were saved, and these are essentially the things a user is searching for with the applet.
(JIndexer has a built-in default that recognises files from the last component of their names as described above, to know how to tune the indexing process, eg to discard HTML tags for HTML documents. JIndexer by default also ignores any file or directory whose name starts with a dot (``.'') or is ``SCCS'' or ``RCS'' or ``CVS'', which means you can make files and directories private (ignored by the indexer) if they start with a dot or are archive files for one of the popular source-code-control tools.)
% java JIndexer Dump output.dat JIndexer V1.5. DUMP of output.dat Number of docs: 8 VOTE-0to3-0.html ... terms.html Lexicon size: 651 InvertedIndexOneWordSummary("0",3,3) ... InvertedIndexOneWordSummary("zip",1,1) %we see that the index in output.dat contains 8 documents from VOTE-0to3-0.html to terms.html, and 651 unique words ranging from ``0'' to ``zip'' (respectively appearing in 3 and 1 documents). Though intended mainly for the benefit of the JIndexer developer, you may find it useful.
In QuickFrag mode, HTML documents are split up at the A NAME anchor tags, and the browser is directed to the closest available anchor before the text they are interested in. Especially in long documents with lots of structure, this considerably speeds the process of finding the item of interest.
The argument that was just ``s'' for SIMPLE-SMALL output format above, can be prefixed with a whole pipeline of filters. The output format is considered a type of filtering since different index formats record different subsets of the full index data.
Separate stages in the filter are separated with ``-'' (dash) characters. The parts of each filter component are separated with ``:'' (colon) characters, the first such part being the name of the filter.
The supported filters and a brief summaries of their use are:
To make a smaller index, at the risk of losing a little information, you might use the filter IN:noempty:nodup:notiny to trim out duplicate and tiny documents/sections.
There is an optional `matchpattern' parameter just before the output filename, whose default is equivalent to ``-REJECT-:.*:SCCS:RCS:CVS;HTML:*.htm:*.html;PLAIN:*.txt:readme''. The syntax is described in the output of:
java JIndexer
The index is to be detailed, since I want every significant word available for searching on, so a filter chain such as IN:noempty:nodup-STOPWORDS:99-SINGLETONS:22-s is probably fine.
Because the pages on the site are broken up into regular sections with anchors, I will use QuickFrag rather than just Quick. (In fact, the anchors are put there precisely to help JIndexer take the user as close as possible to the selected text.)
Note that the Photo Gallery is only part of the site (/Damon/photos/) and that there could be many independent overlapping or separate indexes of different parts of the same site. Also note the complex pathname to get to the files through the filesystem, all but the last parts of which is omitted from the final index. And indeed the same areas can be indexed with different levels of comprehensiveness using different filter chains and with Quick/QuickFrag.
I want to ignore files and directories beginning with ``.'', and match only those files ending in ``.html'' and regard them as HTML. For this I will use a matchpattern ``-REJECT-:.*-;HTML:*.html''. Because this contains shell `metacharacters; that might be processed specially by the shell, I put it in double quotes on the command line.
The command to build the index file fullindex.dat is (assuming java is in the path and the CLASSPATH has been set appropriately):
java JIndexer -verbose QuickFrag IN:noempty:nodup-STOPWORDS:99-SINGLETONS:22-s "-REJECT-:.*;HTML:*.html" fullindex.dat /ro/docs-public.s0.l/www.hd.org/Damon/photos/ .
The filter settings I use do not actually result in any singletons or stop-words being dropped, so I could eliminate them if I wanted.
The index can be checked for sanity with the Dump command, viz:
java JIndexer Dump fullindex.datwhich in this case yields the output:
JIndexer V1.5. DUMP of fullindex.dat Number of docs: 100 .how-to-build.html ... textures/index.html#end-of-STRIP12 Lexicon size: 1504 InvertedIndexOneWordSummary("0",8,8) ... InvertedIndexOneWordSummary("z80a",1,1)ie 100 document fragments containing 1504 unique words ranging from ``0'' to `z80a'' (which was ``Z80A'' in the input text, before folding).
<CENTER> Interactively search the gallery pages by word with the Java applet below. <APPLET CODEBASE="http://www.hd.org/ji15classes" CODE=ISearch ARCHIVE="jind1o5.zip" WIDTH=450 HEIGHT=300 ALT="Java Search Tool"> <PARAM NAME="KEY" VALUE="http://www.hd.org/Damon/photos/fullindex.dat"> <PARAM NAME="ROOT" VALUE="http://www.hd.org/Damon/photos/"> <PARAM NAME="TITLE" VALUE="Word Search"> <PARAM NAME="AUTOCUE" VALUE="Search for keywords#in the image descriptions#Press the HELP button#for more information"> Your browser can't see the live Java search tool embedded here. </APPLET> </CENTER>
The CODE value is always ISearch for this applet.