247 lines
12 KiB
HTML
247 lines
12 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<title>Indexing Texts with SMART Issue 13</title>
|
|
</HEAD>
|
|
<BODY >
|
|
|
|
<H4>
|
|
"Linux Gazette...<I>making Linux just a little more fun!</I>
|
|
"</H4>
|
|
|
|
<P> <HR> <P>
|
|
<!--===================================================================-->
|
|
|
|
<center>
|
|
<H1>Indexing Texts with Smart</H1>
|
|
<H4>By Hans Paijmans
|
|
<a href="mailto:paai@kub.nl">paai@kub.nl</a></H4>
|
|
</center>
|
|
<P> <HR> <P>
|
|
|
|
<H2>1. The uses of Linux and MS-DOS</A></H2>
|
|
<P>
|
|
Although my colleagues here on Tilburg University may think that I
|
|
spend my time fiddling with Linux on a PC that could be put to better
|
|
uses, they are wrong. The 'fiddling with Linux' I do at home; at my
|
|
work I only do the bare minimum necessary to keep Linux fed and happy.
|
|
As most readers of this journal know, this involves making the
|
|
occasional backup and for the rest: nothing.
|
|
<P>
|
|
When I sit in front of my PC, I <EM>work</EM> (well, mostly). Linux makes
|
|
it possible to do my work with a minimum of fuss and a big part of the
|
|
credit for this goes to Jacques Gelinas, the man who wrote Umsdos: a
|
|
layer between the Unix operating system and the vanilla MS-DOS 8+3 FAT
|
|
system. This makes it possible to access the DOS-partition of my hard
|
|
disk from either operating system. This is good news, because I am
|
|
totally dependent from two programs: SMART, an indexing and retrieval
|
|
system and SPSS for Windows to twiddle the data I obtain form
|
|
SMART. SMART only runs under Unix (and not all Unixes for that matter)
|
|
and SPSS4Windows, obviously, runs under MS-Windows and whatever the
|
|
virtues of this operating system may be, you emphatically do not want
|
|
to use it in any kind of experimental environment.
|
|
<P>
|
|
I suppose that SPSS (Statistical Package for the Social Sciences) will
|
|
be familiar to most Linux users. If not: SPSS is just what it says, a
|
|
statistical package but not only for the 'social sciences' but for
|
|
about everyone who needs statistical analysis of his data. SMART,
|
|
however, is an indexing and retrieval program for text. What is more:
|
|
it does not just index the words, it also adds weights to them. It
|
|
also allows the user to compare the indexed documents in the so-called
|
|
Vector Space Model and to compute the distances between the documents,
|
|
or between documents and queries. To understand why this is special we
|
|
must delve a bit in the typical problems of Information Retrieval,
|
|
i.e. the storage of books, articles etcetera and the retrieval of
|
|
those on content.
|
|
<P>
|
|
<H2><A NAME="SECTION00011000000000000000">1.1 Why indexing is not enough</A></H2>
|
|
<P>
|
|
When at the end of the sixties automatic indexing of texts became a
|
|
viable option many people thought that the problems of information
|
|
retrieval were solved. Programs like STAIRS (IBM,1972)
|
|
enabled the users to file and rapidly retrieve documents on any word
|
|
in the text or on boolean combinations (AND, OR, NOT) of those words
|
|
and who could ask for more? Then, in 1985 a famous article was
|
|
published by two researchers in the field [<A HREF="#BlairMaron1985">1</A>]. In
|
|
this article they reported on the performance of STAIRS in real life
|
|
and they showed that the efficiency of STAIRS and similar systems was,
|
|
in fact, much lower than assumed. Even experienced users could not obtain a
|
|
recall of more than 20-40% of the relevant documents in a database of
|
|
100,000 documents, and worse, they were not aware of the fact.<BR>
|
|
<P>
|
|
The problem with all retrieval systems of this type is that human
|
|
language is so fuzzy. There may be as much as a dozen different terms
|
|
and words pointing to one and the same object, whereas one word may
|
|
have widely different meanings. In Information Retrieval this will
|
|
lead to one of two situations. Either you try to obtain a high
|
|
precision, when almost all the retrieved documents are relevant (but
|
|
an unknown number of other relevant documents are not included) or you
|
|
go for high recall, but then a lot of irrelevant documents will be
|
|
included in the result. When in a retrieved set of documents the
|
|
proportion of irrelevant documents is high, the user will probably
|
|
stop looking at the documents before he or she has found all the
|
|
relevant ones: in fact his <EM>futility-point</EM> has been reached. In
|
|
such a case the net result is equal to the situation in which those
|
|
relevant records that would be presented after the user reached that
|
|
futility-point were <EM>not</EM> retrieved. Therefore the concept of
|
|
ranking, i.e. the ordering of retrieved documents on relevance, is
|
|
very important in Information Retrieval.
|
|
<P>
|
|
|
|
|
|
<H2><A NAME="SECTION00020000000000000000">2. SMART</A></H2>
|
|
<P>
|
|
Modern (and not so modern) research has offered a number of possible
|
|
solutions to this dilemma. Some of those solutions use the concept of
|
|
<EM>weighted</EM> keywords. This means that every keyword-document
|
|
combination has a weight attached that (hopefully) is an indication of
|
|
the relevance of that particular keyword for that particular
|
|
document. SMART does just that: it creates indexes for a database of
|
|
documents and attaches weights to them. The way that happens may be
|
|
expressed intuitively as <EM>'the more a word occurs in the less
|
|
documents, the higher the weight'</EM>. Or, if the word 'dog' occurs
|
|
twenty times in a given document, but in no other documents, you may
|
|
be relatively certain that this document is about dogs. Information
|
|
Retrieval addicts like me talk about the <I>tf</I>.<I>idf</I> weight.
|
|
<P>
|
|
Smart offers several options as to how that weight should be arrived
|
|
at: I generally prefer the so-called atc-variation, because it adjusts
|
|
for the length of the individual documents.
|
|
<P>
|
|
It calculates the <I>tf</I>.<I>idf</I> in three steps. The first step creates the
|
|
value <IMG WIDTH=35 HEIGHT=18 ALIGN=MIDDLE ALT="tex2html_wrap_inline96" SRC="./gx/hansp/img1.gif" > for the term-frequency (<I>tf</I>) as
|
|
<P>
|
|
<P> <IMG WIDTH=321 HEIGHT=27 ALIGN=BOTTOM ALT="displaymath86" SRC="./gx/hansp/img2.gif" > <P>
|
|
<P>
|
|
where <IMG WIDTH=37 HEIGHT=18 ALIGN=MIDDLE ALT="tex2html_wrap_inline100" SRC="./gx/hansp/img3.gif" > is the term with the highest frequency in the
|
|
document. This adjusts for the document-length and the number of
|
|
terms. Then the weight <IMG WIDTH=37 HEIGHT=7 ALIGN=BOTTOM ALT="tex2html_wrap_inline102" SRC="./gx/hansp/img4.gif" > is calculated as
|
|
<P>
|
|
<P> <IMG WIDTH=313 HEIGHT=26 ALIGN=BOTTOM ALT="displaymath87" SRC="./gx/hansp/img5.gif" > <P>
|
|
<P>
|
|
where N is as before the number of documents and <I>F</I> the document
|
|
frequency of term <I>t</I> (the number of documents in which term <I>t</I>
|
|
occurs). Finally the cosine normalization is applied by
|
|
<P>
|
|
<P> <IMG WIDTH=318 HEIGHT=35 ALIGN=BOTTOM ALT="displaymath88" SRC="./gx/hansp/img6.gif" > <P>
|
|
<P>
|
|
where T is the number of terms in the document vector. Now we have a
|
|
number between zero and one that hopefully correlates with the
|
|
importance of the word as a keyword for that document. For a
|
|
detailed discussion of these and similar techniques see e.g. Salton
|
|
and McGill ([<A HREF="#SaltonMcGill1983">2</A>]). You will love it!
|
|
<P>
|
|
This is not all. When SMART has constructed the index in one of the
|
|
various ways available, it also can retrieve the documents for
|
|
you. This is done according to something called ``the Vector Space
|
|
Model''. This model is best explained using a three-dimensional
|
|
example of a vector-space; you can add another few thousand dimensions
|
|
in your own imagination.
|
|
<P>
|
|
Imagine you want to index your documents according to three keywords
|
|
'cat', 'dog' and 'horse'; keywords that may or may not occur in your
|
|
documents. So you draw three axes to get a normal three-dimensional
|
|
coordinate system. One dimension can be used to indicate the
|
|
``cat-ness'' of every document, the other its ``dog-ness'' and the
|
|
third the ``lion-ness''. To make things easy we only use binary values
|
|
0 and 1, although SMART can cope with floats (the 'weights' mentioned
|
|
before. So if a document is about cats, it scores a one on the
|
|
corresponding axis, otherwise it scores a zero. Any document may now
|
|
be drawn in that space according to the occurrence of one or more of
|
|
the keywords and now we have a relatively easy way to compute the
|
|
difference between those document. Moreover a query consisting of
|
|
one or more of the keywords can be drawn in the same space and the
|
|
documents can be ranked according to the distance to that query. Of
|
|
course a typical document database has thousands of keywords and
|
|
accordingly thousands of dimensions, but the arithmetics involved in
|
|
multi-dimensional distances do not matter much to modern computers,
|
|
and if they bother <EM>you</EM>, you just have to smoke something illegal
|
|
and matters will rapidly become clear. If only till the next morning.
|
|
<P>
|
|
So SMART accepts queries, ranks the documents according to the
|
|
``nearness'' to that query and return them to you in that
|
|
order. Therefore it is still one of the best retrieval systems that
|
|
are ever written although it lacks the bells and whistles of its more
|
|
expensive counterparts in some operating systems I could mention. And
|
|
although it is not really optimized for speed, it runs typically 10-30
|
|
times faster than the fastest indexing program I ever saw under
|
|
MS-Windows.
|
|
<P>
|
|
|
|
|
|
<H2><A NAME="SECTION00030000000000000000">3. The DOS connection</A></H2>
|
|
<P>
|
|
But I am not using SMART for bread-and-butter retrieval, but for the
|
|
weights it computes and the indexes it creates. At this point I want
|
|
to do some other manipulations of these data and again I have to offer
|
|
my thanks to the developers of unix in general and to Linux in
|
|
particular. A whole string of ever more complicated and sophisticated
|
|
shell scripts, the standard unix tools and a few of My Very Own
|
|
utilities suffice to process the SMART output to a file that is ready
|
|
for importing in SPSS.
|
|
<P>
|
|
Nevertheless now I have to quit Linux and boot MS-DOS, start
|
|
MS-Windows and finally enter SPSS to do the statistics and create some
|
|
graphs. I am a newcomer to Unix (indeed it was the fact that Linux
|
|
offered a way to use SMART that pulled me over the line two years
|
|
ago), but already I am wondering how people can live in the stifling
|
|
atmosphere of MS-Windows. The fact that you can't really run two
|
|
applications at the same time is not even the worst thing. But who is
|
|
responsible for the idea that Icons and Popups were better and more
|
|
efficient than the plain old command line? And what happened to pipes
|
|
and filters? And a sensible command language? Be that as it may, SPSS
|
|
gets the job done and when the output is written to disk I immediately
|
|
escape back to Linux to write the final article, report or whatever
|
|
with LaTeX.<BR>
|
|
<P>
|
|
|
|
|
|
<H2><A NAME="SECTION00040000000000000000">4. The bad news</A></H2>
|
|
<P>
|
|
On this point I have two messages: one is good, the other bad. I'll
|
|
start with the good news. SMART is obtainable by anonymous ftp from
|
|
Cornell University and may be used for free for scientific and
|
|
experimental purposes. Better yet: it compiles under Linux without
|
|
much tweaking and twiddling. Also there exists a fairly active
|
|
mailing list for people who use SMART (smart-peoplecs.cornell.edu).<BR>
|
|
<BR>
|
|
<P>
|
|
The bad news: the manual. What manual? SMART is not for the faint of
|
|
heart; after unpacking and compilation you'll find some extremely
|
|
obscure notes and examples and that is it. Nevertheless, if you have
|
|
more than just a few megabytes of text to manage AND the stamina to
|
|
learn SMART, it certainly is the best solution for your information
|
|
retrieval needs. But don't I wish somebody would write a comprehensive
|
|
manual! In the meantime you may perhaps be helped by my ``tutorial for
|
|
newbees'', to be found at
|
|
http://pi0959.kub.nl:2080/Paai/Onderw/Smart/hands.html.<BR> <BR>
|
|
|
|
<P><HR> <P>
|
|
<h3>Bibliography</h3>
|
|
|
|
<DT><A NAME="BlairMaron1985"><STRONG>1</STRONG></A><DD>
|
|
Blair, D.C.; Maron, M.E.,<EM>An evaluation of retrieval effectiveness for a full-text document retrieval system</EM>,Communications of the ACM V28:3, 1985, pp.
|
|
289-299.
|
|
<P>
|
|
<DT><A NAME="SaltonMcGill1983"><STRONG>2</STRONG></A><DD>
|
|
G. Salton and M.J. McGill,<EM>Introduction to Modern Information
|
|
Retrieval</EM>
|
|
New York [etc.] : McGraw-Hill, 1983.
|
|
|
|
<P> <hr> <P>
|
|
<center><H5>Copyright © 1997, Hans Paijmans <BR>
|
|
Published in Issue 13 of the Linux Gazette</H5></center>
|
|
|
|
<!--===================================================================-->
|
|
<P> <hr> <P>
|
|
<A HREF="./index.html"><IMG ALIGN=BOTTOM SRC="../gx/indexnew.gif"
|
|
ALT="[ TABLE OF CONTENTS ]"></A>
|
|
<A HREF="../index.html"><IMG ALIGN=BOTTOM SRC="../gx/homenew.gif"
|
|
ALT="[ FRONT PAGE ]"></A>
|
|
<A HREF="./gm.html"><IMG SRC="../gx/back2.gif"
|
|
ALT=" Back "></A>
|
|
<A HREF="./editors.html"><IMG SRC="../gx/fwd.gif" ALT=" Next "></A>
|
|
<P> <hr> <P>
|
|
</BODY>
|
|
</HTML>
|