old-www/LDP/LG/issue13/smart.html

247 lines
12 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<title>Indexing Texts with SMART Issue 13</title>
</HEAD>
<BODY >
<H4>
&quot;Linux Gazette...<I>making Linux just a little more fun!</I>
&quot;</H4>
<P> <HR> <P>
<!--===================================================================-->
<center>
<H1>Indexing Texts with Smart</H1>
<H4>By Hans Paijmans
<a href="mailto:paai@kub.nl">paai@kub.nl</a></H4>
</center>
<P> <HR> <P>
<H2>1. The uses of Linux and MS-DOS</A></H2>
<P>
Although my colleagues here on Tilburg University may think that I
spend my time fiddling with Linux on a PC that could be put to better
uses, they are wrong. The 'fiddling with Linux' I do at home; at my
work I only do the bare minimum necessary to keep Linux fed and happy.
As most readers of this journal know, this involves making the
occasional backup and for the rest: nothing.
<P>
When I sit in front of my PC, I <EM>work</EM> (well, mostly). Linux makes
it possible to do my work with a minimum of fuss and a big part of the
credit for this goes to Jacques Gelinas, the man who wrote Umsdos: a
layer between the Unix operating system and the vanilla MS-DOS 8+3 FAT
system. This makes it possible to access the DOS-partition of my hard
disk from either operating system. This is good news, because I am
totally dependent from two programs: SMART, an indexing and retrieval
system and SPSS for Windows to twiddle the data I obtain form
SMART. SMART only runs under Unix (and not all Unixes for that matter)
and SPSS4Windows, obviously, runs under MS-Windows and whatever the
virtues of this operating system may be, you emphatically do not want
to use it in any kind of experimental environment.
<P>
I suppose that SPSS (Statistical Package for the Social Sciences) will
be familiar to most Linux users. If not: SPSS is just what it says, a
statistical package but not only for the 'social sciences' but for
about everyone who needs statistical analysis of his data. SMART,
however, is an indexing and retrieval program for text. What is more:
it does not just index the words, it also adds weights to them. It
also allows the user to compare the indexed documents in the so-called
Vector Space Model and to compute the distances between the documents,
or between documents and queries. To understand why this is special we
must delve a bit in the typical problems of Information Retrieval,
i.e. the storage of books, articles etcetera and the retrieval of
those on content.
<P>
<H2><A NAME="SECTION00011000000000000000">1.1 Why indexing is not enough</A></H2>
<P>
When at the end of the sixties automatic indexing of texts became a
viable option many people thought that the problems of information
retrieval were solved. Programs like STAIRS (IBM,1972)
enabled the users to file and rapidly retrieve documents on any word
in the text or on boolean combinations (AND, OR, NOT) of those words
and who could ask for more? Then, in 1985 a famous article was
published by two researchers in the field [<A HREF="#BlairMaron1985">1</A>]. In
this article they reported on the performance of STAIRS in real life
and they showed that the efficiency of STAIRS and similar systems was,
in fact, much lower than assumed. Even experienced users could not obtain a
recall of more than 20-40% of the relevant documents in a database of
100,000 documents, and worse, they were not aware of the fact.<BR>
<P>
The problem with all retrieval systems of this type is that human
language is so fuzzy. There may be as much as a dozen different terms
and words pointing to one and the same object, whereas one word may
have widely different meanings. In Information Retrieval this will
lead to one of two situations. Either you try to obtain a high
precision, when almost all the retrieved documents are relevant (but
an unknown number of other relevant documents are not included) or you
go for high recall, but then a lot of irrelevant documents will be
included in the result. When in a retrieved set of documents the
proportion of irrelevant documents is high, the user will probably
stop looking at the documents before he or she has found all the
relevant ones: in fact his <EM>futility-point</EM> has been reached. In
such a case the net result is equal to the situation in which those
relevant records that would be presented after the user reached that
futility-point were <EM>not</EM> retrieved. Therefore the concept of
ranking, i.e. the ordering of retrieved documents on relevance, is
very important in Information Retrieval.
<P>
<H2><A NAME="SECTION00020000000000000000">2. SMART</A></H2>
<P>
Modern (and not so modern) research has offered a number of possible
solutions to this dilemma. Some of those solutions use the concept of
<EM>weighted</EM> keywords. This means that every keyword-document
combination has a weight attached that (hopefully) is an indication of
the relevance of that particular keyword for that particular
document. SMART does just that: it creates indexes for a database of
documents and attaches weights to them. The way that happens may be
expressed intuitively as <EM>'the more a word occurs in the less
documents, the higher the weight'</EM>. Or, if the word 'dog' occurs
twenty times in a given document, but in no other documents, you may
be relatively certain that this document is about dogs. Information
Retrieval addicts like me talk about the <I>tf</I>.<I>idf</I> weight.
<P>
Smart offers several options as to how that weight should be arrived
at: I generally prefer the so-called atc-variation, because it adjusts
for the length of the individual documents.
<P>
It calculates the <I>tf</I>.<I>idf</I> in three steps. The first step creates the
value <IMG WIDTH=35 HEIGHT=18 ALIGN=MIDDLE ALT="tex2html_wrap_inline96" SRC="./gx/hansp/img1.gif" > for the term-frequency (<I>tf</I>) as
<P>
<P> <IMG WIDTH=321 HEIGHT=27 ALIGN=BOTTOM ALT="displaymath86" SRC="./gx/hansp/img2.gif" > <P>
<P>
where <IMG WIDTH=37 HEIGHT=18 ALIGN=MIDDLE ALT="tex2html_wrap_inline100" SRC="./gx/hansp/img3.gif" > is the term with the highest frequency in the
document. This adjusts for the document-length and the number of
terms. Then the weight <IMG WIDTH=37 HEIGHT=7 ALIGN=BOTTOM ALT="tex2html_wrap_inline102" SRC="./gx/hansp/img4.gif" > is calculated as
<P>
<P> <IMG WIDTH=313 HEIGHT=26 ALIGN=BOTTOM ALT="displaymath87" SRC="./gx/hansp/img5.gif" > <P>
<P>
where N is as before the number of documents and <I>F</I> the document
frequency of term <I>t</I> (the number of documents in which term <I>t</I>
occurs). Finally the cosine normalization is applied by
<P>
<P> <IMG WIDTH=318 HEIGHT=35 ALIGN=BOTTOM ALT="displaymath88" SRC="./gx/hansp/img6.gif" > <P>
<P>
where T is the number of terms in the document vector. Now we have a
number between zero and one that hopefully correlates with the
importance of the word as a keyword for that document. For a
detailed discussion of these and similar techniques see e.g. Salton
and McGill ([<A HREF="#SaltonMcGill1983">2</A>]). You will love it!
<P>
This is not all. When SMART has constructed the index in one of the
various ways available, it also can retrieve the documents for
you. This is done according to something called ``the Vector Space
Model''. This model is best explained using a three-dimensional
example of a vector-space; you can add another few thousand dimensions
in your own imagination.
<P>
Imagine you want to index your documents according to three keywords
'cat', 'dog' and 'horse'; keywords that may or may not occur in your
documents. So you draw three axes to get a normal three-dimensional
coordinate system. One dimension can be used to indicate the
``cat-ness'' of every document, the other its ``dog-ness'' and the
third the ``lion-ness''. To make things easy we only use binary values
0 and 1, although SMART can cope with floats (the 'weights' mentioned
before. So if a document is about cats, it scores a one on the
corresponding axis, otherwise it scores a zero. Any document may now
be drawn in that space according to the occurrence of one or more of
the keywords and now we have a relatively easy way to compute the
difference between those document. Moreover a query consisting of
one or more of the keywords can be drawn in the same space and the
documents can be ranked according to the distance to that query. Of
course a typical document database has thousands of keywords and
accordingly thousands of dimensions, but the arithmetics involved in
multi-dimensional distances do not matter much to modern computers,
and if they bother <EM>you</EM>, you just have to smoke something illegal
and matters will rapidly become clear. If only till the next morning.
<P>
So SMART accepts queries, ranks the documents according to the
``nearness'' to that query and return them to you in that
order. Therefore it is still one of the best retrieval systems that
are ever written although it lacks the bells and whistles of its more
expensive counterparts in some operating systems I could mention. And
although it is not really optimized for speed, it runs typically 10-30
times faster than the fastest indexing program I ever saw under
MS-Windows.
<P>
<H2><A NAME="SECTION00030000000000000000">3. The DOS connection</A></H2>
<P>
But I am not using SMART for bread-and-butter retrieval, but for the
weights it computes and the indexes it creates. At this point I want
to do some other manipulations of these data and again I have to offer
my thanks to the developers of unix in general and to Linux in
particular. A whole string of ever more complicated and sophisticated
shell scripts, the standard unix tools and a few of My Very Own
utilities suffice to process the SMART output to a file that is ready
for importing in SPSS.
<P>
Nevertheless now I have to quit Linux and boot MS-DOS, start
MS-Windows and finally enter SPSS to do the statistics and create some
graphs. I am a newcomer to Unix (indeed it was the fact that Linux
offered a way to use SMART that pulled me over the line two years
ago), but already I am wondering how people can live in the stifling
atmosphere of MS-Windows. The fact that you can't really run two
applications at the same time is not even the worst thing. But who is
responsible for the idea that Icons and Popups were better and more
efficient than the plain old command line? And what happened to pipes
and filters? And a sensible command language? Be that as it may, SPSS
gets the job done and when the output is written to disk I immediately
escape back to Linux to write the final article, report or whatever
with LaTeX.<BR>
<P>
<H2><A NAME="SECTION00040000000000000000">4. The bad news</A></H2>
<P>
On this point I have two messages: one is good, the other bad. I'll
start with the good news. SMART is obtainable by anonymous ftp from
Cornell University and may be used for free for scientific and
experimental purposes. Better yet: it compiles under Linux without
much tweaking and twiddling. Also there exists a fairly active
mailing list for people who use SMART (smart-peoplecs.cornell.edu).<BR>
<BR>
<P>
The bad news: the manual. What manual? SMART is not for the faint of
heart; after unpacking and compilation you'll find some extremely
obscure notes and examples and that is it. Nevertheless, if you have
more than just a few megabytes of text to manage AND the stamina to
learn SMART, it certainly is the best solution for your information
retrieval needs. But don't I wish somebody would write a comprehensive
manual! In the meantime you may perhaps be helped by my ``tutorial for
newbees'', to be found at
http://pi0959.kub.nl:2080/Paai/Onderw/Smart/hands.html.<BR> <BR>
<P><HR> <P>
<h3>Bibliography</h3>
<DT><A NAME="BlairMaron1985"><STRONG>1</STRONG></A><DD>
Blair, D.C.; Maron, M.E.,<EM>An evaluation of retrieval effectiveness for a full-text document retrieval system</EM>,Communications of the ACM V28:3, 1985, pp.
289-299.
<P>
<DT><A NAME="SaltonMcGill1983"><STRONG>2</STRONG></A><DD>
G. Salton and M.J. McGill,<EM>Introduction to Modern Information
Retrieval</EM>
New York [etc.] : McGraw-Hill, 1983.
<P> <hr> <P>
<center><H5>Copyright &copy; 1997, Hans Paijmans <BR>
Published in Issue 13 of the Linux Gazette</H5></center>
<!--===================================================================-->
<P> <hr> <P>
<A HREF="./index.html"><IMG ALIGN=BOTTOM SRC="../gx/indexnew.gif"
ALT="[ TABLE OF CONTENTS ]"></A>
<A HREF="../index.html"><IMG ALIGN=BOTTOM SRC="../gx/homenew.gif"
ALT="[ FRONT PAGE ]"></A>
<A HREF="./gm.html"><IMG SRC="../gx/back2.gif"
ALT=" Back "></A>
<A HREF="./editors.html"><IMG SRC="../gx/fwd.gif" ALT=" Next "></A>
<P> <hr> <P>
</BODY>
</HTML>