old-www/LDP/LG/issue57/tindale.html

328 lines
15 KiB
HTML

<!--startcut ==============================================-->
<!-- *** BEGIN HTML header *** -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML><HEAD>
<title>Hash Tables in Java LG #57</title>
<style type="text/css">
<!--
BODY {
color: #000000;
background-color: #ffffff;
} /* default */
span.constant {
color: #5f9ea0;
} /* font-lock-constant-face */
span.keyword {
color: #a020f0;
} /* font-lock-keyword-face */
span.function-name {
color: #0000ff;
} /* font-lock-function-name-face */
span.comment {
color: #b22222;
} /* font-lock-comment-face */
span.string {
color: #bc8f8f;
} /* font-lock-string-face */
span.variable-name {
color: #b8860b;
} /* font-lock-variable-name-face */
span.type {
color: #228b22;
} /* font-lock-type-face */
-->
</style>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
ALINK="#FF0000">
<!-- *** END HTML header *** -->
<CENTER>
<A HREF="http://www.linuxgazette.com/">
<H1><IMG ALT="LINUX GAZETTE" SRC="../gx/lglogo.jpg"
WIDTH="600" HEIGHT="124" border="0"></H1></A>
<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="stoddard.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue57/tindale.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="lg_backpage57.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
<P>
</CENTER>
<!--endcut ============================================================-->
<H4 ALIGN="center">
"Linux Gazette...<I>making Linux just a little more fun!</I>"
</H4>
<P> <HR> <P>
<!--===================================================================-->
<center>
<H1><font color="maroon">Hash Tables in Java</font></H1>
<H4>By <a href="mailto:ben@bluesat.unsw.edu.au">Ben Tindale</a></H4>
</center>
<P> <HR> <P>
<!-- END header -->
<h1>Hash tables in Java</h1>
<p> A hash table is conceptually a contiguous section of memory
with a number of addressable elements, commonly called bins, in which
data can be quickly inserted, deleted and found. Hash tables represent
a sacrifice of memory for the sake of speed - they are certainly not
the most memory efficient means of storing data, but they provide very
fast lookup times. Hash tables are a common means of organising data,
so the designers of the Java programming language have provided a
number of classes for easily creating and manipulating instances of
hash tables.
<p> Hashtable is the class which provides hash tables in
Java. Hashtable inherits directly from Dictionary and implements the
Map, Cloneable and Serializable interfaces. The implementation of the
Map interface is new with Java 1.2. You can view the documentation on
Hashtable <a
href="http://java.sun.com/products/jdk/1.2/docs/api/java/util/Hashtable.html">here</a>.
<p> A key is a value that can be mapped to one of the
addressable elements of the hash table. The Java programming language
provides an interface for generating keys for a number of the core
classes: as an example, the snippet below prints out the key
representation of a string for later use in a hash table.
<pre>
String abc = new String("abc");
System.out.println("Key for \"abc\" is "+ abc.hashCode());
</pre>
</p>
<p> A hashing function is a function that performs some
operation on a set of data such that a key is generated from that
data with the key being used as the means of identifying which memory
element of the hash table to place the data in. There are a number of
properties that it is desirable for a hashing function to possess in
order that the hash table be effectively used:
<ul>
<li> Data should be dispersed as randomly as possible across the hash
table to minimise the chances of a collision. For example, a good
hashing function would place the letter ``b'' fairly far from the
letter ``a''.
<li> The hashing function should execute in a reasonable period.
</ul>
Unfortunately, as we shall see below, the hashing functions provided
by Java do not satisfy the first criterion.
<p> The load factor of a hash table is defined as the ratio of
the number of filled bins to the total number of bins available. A bin
is full when it points to or contains a data element. The load factor
is a useful parameter to use in estimating the likelihood of a
collision occurring. The Java programming language will allocate more
memory to an existing hash table if the load factor exceeds 75%. The
user can also choose to set the initial capacity of the hash table
with the aim of reducing the number of rehashing methods required. The
code snippet below demonstrate how this can be achieved.
<pre>
int initialCapacity = 1000;
float loadFactor = 0.80;
Hashtable ht = new Hashtable(initialCapacity, loadFactor);
</pre>
If you want to allocate more space for your hash table before the load
factor reaches the specified value then use the <i>rehash()</i> method
like this:
<pre>
ht.rehash();
</pre>
<p> A collision occurs when two pieces of data are denoted with
the same key by the hashing function. Since the point of using a hash
table is to maximise the efficiency with which data is inserted,
deleted or found, collisions are to be avoided as much as possible. If
you know the hashing function used to create a key then it can be very
easy to create collisions. For example, the Java code illustrates how
two different strings can have the same hashcode.
<A HREF="misc/tindale/Identical.java.txt">(text version)</A>
<pre>
import Java.util.*;
import Java.lang.*;
// x + 31x = x(31 + 1) = x + 31 + 31(x-1)
public class Identical
{
public static void main(String[] args)
{
String s1 = new String("BB");
String s2 = new String("Aa");
System.out.println(s1.hashCode());
System.out.println(s2.hashCode());
}
}
</pre>
This code generates the following output on my RedHat 6.2 box using
the <a href="#references">kaffe</a> compiler.
<pre>
[bash]$ javac Identical.java
[bash]$ java Identical
2112
2112
[bash]$
</pre>
<p> Chaining is a method of dealing with collisions in a hash table by
imagining that each bin is a pointer to a linked list of data
elements. When a collision occurs, the new data element is simply
inserted into this linked list in some manner. Similarly, when
attempting to remove an element from a bin with more than one entry,
the list is followed until the element matching the one to be deleted
is found.Actually, there is no need for collisions to be dealt with by
solely using a linked list - a data structure such as a binary tree
could also be used. The Hashtable class in the Java foundation classes
uses this method to insert elements into a hash table. Interestingly,
using chaining means that a hashtable can have a load factor that
exceeds 100%.
<p> Open addressing occurs when all of the elements in a hash
table are stored in the table itself - no pointers to linked lists:
each bin holds one piece of data. This is the simplest means of
implementing a hash table, since the table reduces to being an array
where the elements can be inserted or deleted from any index position
at any time.
<p> Linear probing is a means of implementing open addressing by choosing
the next free bin to insert a data element into if a collision occurs
while trying to insert data. Each of the subsequent bins is
checked for a collision before the data is inserted.
<p> The String class contains a method hashCode() which is used
to generate a key which can used as the key for a hash table. The
hashcode for a String object is computed as<br>
<i>s[0]*31^(n-1)+s[1]*31^(n-2)+...+s[n-1]</i><br>
using integer arithmetic and where <i>s[i]</i> is the <i>i</i>th character of a
string of length <i>n</i>. The hash value of an empty string is defined as
zero.
<p> I've included a small sample program called <a
href="#closewords">CloseWords</a> which finds words in the
system dictionary which are ``close'' to the command line argument. To
do this the program explicitly exploits one of the traits of the class
String's hashing function which is that the hashcode generated tends
to cluster together words which are of similar alphanumeric
composition. This is actually an undesirable trait, since if the input
data is comprised of a limited set of characters then there will tend
to be a large number of collisions. The ideal hashing function would
distribute the data randomly over the hash table, with trends in the
data not leading to an all over tendency to cluster.
<p> Another limitation of the hashCode method is that by making
the key of type integer the designers of Java unnaturally limited the
possible magnitude of the key to just <i>2^32 -1</i> meaning that the
probability of a collision occurring is much larger than if the key
was represented by a 64 bit data type.
<p> The Hashtable class and methods supplied in the Java Foundation
Classes are a powerful tool for data manipulation - particularly when
rapid data retrieval, searching or deleting are required. For large
data sets, however, the implementation of the hashing functions in
Java will cause a tendency for clustering - which will unnecessarily
slow down execution. A better implementation of a hash table would
involve a hashing function which distributed data more randomly and a
longer data type used for the key.
<h2>
<a name="references">References and links</a>
</h2>
<p>
For a more complete discussion of the limitations of hash tables in
Java and a better implementation see.
<p>
Java is a superbly documented language - check it out at <a
href="http://www.java.sun.com/docs/">SUN</a>.
<p>
For information on the open source Kaffe compiler visit the <a
href="http://www.kaffe.org">website</a>.
<h2>
<a name="closewords">CloseWords</a>
</h2>
Note: you can grab the source code <a href="misc/tindale/CloseWords.java.txt">here</a>
<pre>
<span class="keyword">import</span> <span class="constant">java</span>.<span class="constant">lang</span>.<span class="constant">*</span>;
<span class="keyword">import</span> <span class="constant">java</span>.<span class="constant">util</span>.<span class="constant">*</span>;
<span class="keyword">import</span> <span class="constant">java</span>.<span class="constant">io</span>.<span class="constant">*</span>;
<span class="comment">/** CloseWords: Exploit the clustering tendency of the native hashCode() method
* in the String class to find words that are &quot;close&quot; to the argument.
*/</span>
<span class="keyword">public</span> <span class="keyword">class</span> <span class="type">CloseWords</span>
{
<span class="type">Hashtable</span> <span class="variable-name">ht</span>;
<span class="type">String</span> <span class="variable-name">currString</span>;
<span class="comment">/** In the code below we create an instance of the Hashtable class in which to store
* our hash of all of the words in the system dictionary (yes, this is a very memory
* inefficient way of indexing the words).
*
* @</span><span class="constant">param</span><span class="comment"> </span><span class="variable-name">args</span><span class="comment">
*/</span>
<span class="keyword">public</span> <span class="keyword">static</span> <span class="type">void</span> <span class="function-name">main</span>(<span class="type">String</span>[] <span class="variable-name">args</span>)
{
ht = <span class="keyword">new</span> <span class="type">Hashtable</span>();
<span class="keyword">try</span>
{
<span class="type">DataInputStream</span> <span class="variable-name">in</span> = <span class="keyword">new</span> <span class="type">DataInputStream</span>(
<span class="keyword">new</span> <span class="type">BufferedInputStream</span>(
<span class="keyword">new</span> <span class="type">FileInputStream</span>(<span class="string">&quot;/usr/dict/words&quot;</span>)));
<span class="keyword">while</span>((currString = in.readLine())!=<span class="constant">null</span>)
ht.put(<span class="keyword">new</span> <span class="type">Integer</span>(currString.hashCode()), currString);
<span class="type">int</span> <span class="variable-name">i</span> = args[0].hashCode();
<span class="type">int</span> <span class="variable-name">found</span>=0;
<span class="keyword">while</span>(found &lt; 5)
{
i--;
<span class="keyword">if</span>(ht.get(<span class="keyword">new</span> <span class="type">Integer</span>(i))!=<span class="constant">null</span>)
{
System.out.println(ht.get(<span class="keyword">new</span> <span class="type">Integer</span>(i)));
found++;
}
}
i = args[0].hashCode();
found=0;
<span class="keyword">while</span>(found &lt; 5)
{
i++;
<span class="keyword">if</span>(ht.get(<span class="keyword">new</span> <span class="type">Integer</span>(i))!=<span class="constant">null</span>)
{
System.out.println(ht.get(<span class="keyword">new</span> <span class="type">Integer</span>(i)));
found++;
}
}
}
<span class="keyword">catch</span>(<span class="type">IOException</span> <span class="variable-name">ioe</span>)
{
System.out.println(<span class="string">&quot;IO Exception&quot;</span>);
}
}
}
</pre>
<!-- *** BEGIN copyright *** -->
<P> <hr> <!-- P -->
<H5 ALIGN=center>
Copyright &copy; 2000, Ben Tindale<BR>
Published in Issue 57 of <i>Linux Gazette</i>, September 2000</H5>
<!-- *** END copyright *** -->
<!--startcut ==========================================================-->
<HR><P>
<CENTER>
<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="stoddard.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue57/tindale.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="lg_backpage57.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
</CENTER>
</BODY></HTML>
<!--endcut ============================================================-->