old-www/LDP/LG/issue33/ayers1.html

<!--startcut ==========================================================-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<title>The DICT Project LG #33</title>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#A000A0"
ALINK="#FF0000">
<!--endcut ============================================================-->

<H4>
"Linux Gazette...<I>making Linux just a little more fun!</I>"
</H4>

<P> <HR> <P>
<!--===================================================================-->

<center>
<h1><font color="maroon">DICT and Word Inspector</font></h1>
<h4>By <a href="mailto: layers@marktwain.net">Larry Ayers</a></h4>
</center>
<P> <HR> <P>

<center><h3>Introduction</h3></center>

<p>Access to an on-line dictionary has been possible for several years now due
to the Webster TCPIP protocol.  Webster is useful but the number of servers
has been on the decline, and the protocol itself is limited by its dependence
on a single dictionary database.  Rik Faith, a programmer responsible for many
of the essential-but-taken-for-granted Linux utilities, has created a new,
more flexible protocol known as DICT.  DICT is another TCPIP protocol (usable
either over a network or on a local machine) which provides access to any
number of dictionary databases.  Local access is provided by a client program
called <i>dict</i> which contacts the <i>dictd</i> server daemon.
<i>Dictd</i> then searches the available databases and makes any hits
available to <i>dict</i>, which pipes its output to the default pager on the
local machine (usually either <i>more</i>, <i>less</i>, or <i>most</i>).  Net
access is available from several servers, including the home
<a href="http://www.dict.org">DICT</a> site.  Looking up words while on-line
frees the user from needing to install and run the <i>dictd</i> and
<i>dict</i> client and server programs (as well as having to make room for the
bulky databases on a local disk), but if you have the disk space it's
convenient to have the service available at any time.

<p>The <i>dictd</i> and <i>dict</i> programs are licensed under the GPL, so
naturally they are set up to use freely available word databases.

<center><h3>Installing The DICT Distribution Locally</h3></center>

<p>DICT is a typical Unix-style command-line set of programs.  GUI-fans will
regret the absence of a graphical interface, but the glass is really
half-full.  Due to the absence of oft-troublesome GUI toolkit dependencies,
the source for the client and server programs should compile easily.  Toolkits
come and go, but applications written with a simple console interface can
easily be adapted to whatever the future toolkit du jour might be.  There are
numerous programmers who lack the time or inclination to develop Linux
utilities from scratch, but welcome the opportunity to write GUI front-ends to
console programs (see the Word Inspector section below).

<p>
Compiling and installing <i>dictd</i> and <i>dict</i> isn't difficult, but to
make use of them the word databases need to be downloaded and installed.  Here
is a list of the free databases which are currently available from the
<a href="ftp://ftp.dict.org/pub/dict/pre/">DICT</a> FTP site:

<ul>
  <li>A 1913 edition of Webster's Revised Unabridged Dictionary
  <li>The Free On-Line Dictionary of Computing
  <li>Eric Raymond's Jargon File
  <li>The WordNet database
  <li>Easton's 1897 Bible Dictionary
  <li>Hitchcock's Bible Names Dictionary
  <li>The Elements (physical elements)
  <li>U.S Gazetteer (1990)
  <li>The 1995 CIA World Factbook
</ul>

<p>All of these files and their indices will occupy about thirty-one megabytes
of disk space, roughly the same amount as the WordNet dictionary files alone.
The DICT data-files are compressed with a variant of <i>gzip</i> called
<i>dictzip</i>, also written by Rik Faith.  <i>Dictzip</i> adds extra header
information to a compressed file which allows pseudo-random access to the
file.  When the <i>dictd</i> server processes a request for a word it looks
first in the various index files.  These  files (which are human-readable)
are just simple lists with the location of each word within the compressed
dictionary file.  <i>Dictd</i> is able to use this information to uncompress
just the single 64-kb. block of data which contains the word-entry.  This
greatly speeds up access, as the entire dictionary file doesn't need to be
uncompressed and subsequently re-compressed for each transaction.  Files
compressed with <i>dictzip</i> can be recognized by the <i>*.dz</i> suffix.

<p>Although <i>dictzip</i> doesn't compress quite as tightly as <i>gzip</i>,
the added advantage of the header information (at least for the sort of access
<i>dictd</i> needs) is a compensation.  The above-listed dictionary files
would need nearly seventy-five megabytes of disk space if they weren't
compressed.

<center><h3>Comparison With WordNet</h3></center>

<p>In issue 27 of the Gazette, (April, 1998) I wrote about another
dictionary-database system called WordNet.  In order to access a DICT database
the <i>dict</i> server must be running which communicates with <i>dict</i>
client programs, whereas WordNet isn't a client-server program; the small
<i>wn</i> program searches the database indices directly.  The upshot is that
WordNet uses less memory than a DICT system, but since WordNet databases
aren't compressed they occupy more disk space than the specially compressed
DICT files.  DICT files contain more words (along with etymologies, which
WordNet lacks) and can be supplemented with new files in the future, but DICT
lacks WordNet's powerful thesaurus and lexical usage capabilities.  Another
factor to consider is that development of WordNet has ceased, whereas DICT is
still being improved and the chances of its continued development seem likely.
Additionally, DICT can use the WordNet data-files in a compressed format.

<center><h3>Configuration</h3></center>

<p>Sample configuration files are included with the DICT distribution.  The
file <kbd>/etc/dictd.conf</kbd> should contain the location of your local
dictionary files in this format:<br>

<pre><kbd>
database web1913   { data "/mt/dict/web1913.dict.dz"
                     index "/mt/dict/web1913.index" }
database jargon    { data "/mt/dict/jargon.dict.dz"
                     index "/mt/dict/jargon.index" }
</kbd></pre>

<p>The <i>dict</i>  client needs to know where the server is; if a local
server is used a simple <kbd>~/.dictrc</kbd> file containing this line will
work:<br>

<pre><kbd>
server localhost
</kbd></pre>

<p>If both <kbd>~.dictrc</kbd> and <kbd>/etc/dict.conf</kbd> are missing the
<i>dict </i> client program will first attempt to access the www.dict.org
web-server; if that fails it will try some alternate sites.  To prevent these
attempts (when running a local <i>dictd</i> server) just use the above
<kbd>~/.dictrc</kbd> file.

<center><h3>Drawbacks</h3></center>

<p><i>Dictd</i> might not be a service which you would want to run all of the
time.  Though not a large executable, it uses a significant amount of memory,
typically four to five megabytes.  I surmise that the daemon reads the
dictionary index-files into memory when it starts up and keeps them there.
This premise also would explain why the word look-ups are so speedy.  Memory
access is much faster than disk access, and once the daemon determines from
the index which sixty-four kilobyte block holds the desired information it can
quickly decompress that small chunk of the dictionary file and serve up the
word information.  I've found that starting <i>dictd</i> while writing or
whenever I become curious about word-usage and killing the daemon at other
times works well.


<center><h3>Word Inspector</h3></center>

<p>Scott W. Gifford has written a nice graphical front-end to the <i>dict</i>
client program called Word Inspector.  Here's a screenshot of the initial
window:<br>

<p><img alt="Word Inspector Main Window" src="./gx/ayers/inspect1.gif">

<p>And here is one showing the output window:<br>

<p><img alt="Word Inspector Output Window" src="./gx/ayers/inspect2.gif">

<p>In the README file accompanying Word Inspector Scott Gifford suggests
setting up the application with several different window-manager menu-items.
Running <kbd>wordinspect --define --clipboard</kbd> will bring up a Word
Inspector output window (as shown in the second screenshot) with the contents
of the current X primary selection as the input.  Alternatively,
<kbd>wordinspect --search --clipboard</kbd> will cause the initial window to
appear with the X primary selection already shown in the entry field, and
running just plain <kbd>wordinspect</kbd> will bring up an empty initial
window, so that a word can be typed in which isn't a mouse-selection.
These three commands could be set up in a submenu stemming from a top-level
Word Inspector menu-item.

<p>Word Inspector makes good use of right-mouse-button pop-up menus.
Right-clicking on any word in a definition pops up a menu allowing you to
either open a search (initial) window with the selected word already filled
in, or open a definition window for the word.  Highlighting a series of words
with the mouse, then right-clicking, will enable the same behavior for
phrases.

<p>The source of the current version of Word Inspector is this
<a href="http://www.tir.com/~sgifford/wordinspect/">web-site</a>.  The GTK
toolkit is required for compilation, with version 1.06 recommended.

<!-- hhmts start -->
Last modified: Mon 28 Sep 1998
<!-- hhmts end -->

<!--===================================================================-->
<P> <hr> <P>
<center><H5>Copyright &copy; 1998, Larry Ayers <BR>
Published in Issue 33 of <i>Linux Gazette</i>, October 1998</H5></center>

<!--===================================================================-->
<P> <hr> <P>
<A HREF="./index.html"><IMG ALIGN=BOTTOM SRC="../gx/indexnew.gif"
ALT="[ TABLE OF CONTENTS ]"></A>
<A HREF="../index.html"><IMG ALIGN=BOTTOM SRC="../gx/homenew.gif"
ALT="[ FRONT PAGE ]"></A>
<A HREF="./jenkins2.html"><IMG SRC="../gx/back2.gif"
ALT=" Back "></A>
<A HREF="./ayers2.html"><IMG SRC="../gx/fwd.gif" ALT=" Next "></A>
<P> <hr> <P>
<!--startcut ==========================================================-->
</BODY>
</HTML>
<!--endcut ============================================================-->