Re: [Air-l] neat indexing system

28 Feb 2002

      actually this is the way that the search engine we developed at cddc 
works, so i find it interesting to see another one just like it
the method is simple:

take any text
strip html
put the text into a table(1) row with an index number (TID)
parse the text into words
put each of those words into a  new table(2) with new  row with TID
you can build an index from that(2) easily
	hint:
	you select unique using sql
	you use select like using sql
then you parse the text into html linked to the individual words in the 
index and insert that into a column in table(1)
then you can display the text as a hyperlinked index. which you can 
display

our(center for digital discourse and culture, myself and my assistants)  
innovations on this design
1. we added to table2 the ability to add definitions to individual word 
entries, so that if you click a word to find where it is indexed you 
could put in a definition if one was not present or read the definition 
that is present.  The definitions are all indexed also.
2. we added a table3 combined up of 2 to 4 word phrases which speeds up 
searching + allows you to find proper names much easier and I am 
currently developing code that will clean table 3 based upon what people 
tend to search on, ie table 4
3.  save all search strings and first 3 answers  in table 4
4. we began stripping common words from the base parser for table 2, 
(the, is, are, an, so, that, etc)  this speeds parsing and indexing 
immensely.

the basic code for this will be released in a few weeks on sourceforge 
search for cddc.  We're releasing the complete initial codebase.

and I guess i should probably generate some sort of paper out of this:)

On Thursday, February 28, 2002, at 04:26 AM, Zunt@aol.com wrote:
...
I've not run across this method before, and thought folks on this list 
might
enjoy puzzling over it.
http://www.ugcs.caltech.edu/~harel/lyrics.html
The website contains a collection of texts (popular song lyrics).  
Having
made a selection from the contents, you can click on various linked 
words
within the text (not all possible words are linked).  That action 
triggers
(a) enumeration of the texts in the library that contain the target 
word, and
(b) a hyperlinked index connecting you back to those available texts.  
Each
instance of target word use appears in the index list.
It looks to me like quite a bit of HTML page generation is done 
automatically
via scripting on the server side.
Cheers,
Bob Briggs
Westport, MA
_______________________________________________
Air-l mailing list
Air-l@aoir.org
http://www.aoir.org/mailman/listinfo/air-l
jeremy hunsinger
jhuns@vt.edu
on the ibook
www.cddc.vt.edu
www.cddc.vt.edu/jeremy
www.dromocracy.com