328 lines
15 KiB
HTML
328 lines
15 KiB
HTML
<!--startcut ==============================================-->
|
|
<!-- *** BEGIN HTML header *** -->
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML><HEAD>
|
|
<title>Hash Tables in Java LG #57</title>
|
|
<style type="text/css">
|
|
<!--
|
|
BODY {
|
|
color: #000000;
|
|
background-color: #ffffff;
|
|
} /* default */
|
|
span.constant {
|
|
color: #5f9ea0;
|
|
} /* font-lock-constant-face */
|
|
span.keyword {
|
|
color: #a020f0;
|
|
} /* font-lock-keyword-face */
|
|
span.function-name {
|
|
color: #0000ff;
|
|
} /* font-lock-function-name-face */
|
|
span.comment {
|
|
color: #b22222;
|
|
} /* font-lock-comment-face */
|
|
span.string {
|
|
color: #bc8f8f;
|
|
} /* font-lock-string-face */
|
|
span.variable-name {
|
|
color: #b8860b;
|
|
} /* font-lock-variable-name-face */
|
|
span.type {
|
|
color: #228b22;
|
|
} /* font-lock-type-face */
|
|
-->
|
|
</style>
|
|
</HEAD>
|
|
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
|
|
ALINK="#FF0000">
|
|
<!-- *** END HTML header *** -->
|
|
|
|
<CENTER>
|
|
<A HREF="http://www.linuxgazette.com/">
|
|
<H1><IMG ALT="LINUX GAZETTE" SRC="../gx/lglogo.jpg"
|
|
WIDTH="600" HEIGHT="124" border="0"></H1></A>
|
|
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="stoddard.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue57/tindale.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="lg_backpage57.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
<P>
|
|
</CENTER>
|
|
|
|
<!--endcut ============================================================-->
|
|
|
|
<H4 ALIGN="center">
|
|
"Linux Gazette...<I>making Linux just a little more fun!</I>"
|
|
</H4>
|
|
|
|
<P> <HR> <P>
|
|
<!--===================================================================-->
|
|
|
|
<center>
|
|
<H1><font color="maroon">Hash Tables in Java</font></H1>
|
|
<H4>By <a href="mailto:ben@bluesat.unsw.edu.au">Ben Tindale</a></H4>
|
|
</center>
|
|
<P> <HR> <P>
|
|
|
|
<!-- END header -->
|
|
|
|
|
|
<h1>Hash tables in Java</h1>
|
|
<p> A hash table is conceptually a contiguous section of memory
|
|
with a number of addressable elements, commonly called bins, in which
|
|
data can be quickly inserted, deleted and found. Hash tables represent
|
|
a sacrifice of memory for the sake of speed - they are certainly not
|
|
the most memory efficient means of storing data, but they provide very
|
|
fast lookup times. Hash tables are a common means of organising data,
|
|
so the designers of the Java programming language have provided a
|
|
number of classes for easily creating and manipulating instances of
|
|
hash tables.
|
|
|
|
<p> Hashtable is the class which provides hash tables in
|
|
Java. Hashtable inherits directly from Dictionary and implements the
|
|
Map, Cloneable and Serializable interfaces. The implementation of the
|
|
Map interface is new with Java 1.2. You can view the documentation on
|
|
Hashtable <a
|
|
href="http://java.sun.com/products/jdk/1.2/docs/api/java/util/Hashtable.html">here</a>.
|
|
|
|
<p> A key is a value that can be mapped to one of the
|
|
addressable elements of the hash table. The Java programming language
|
|
provides an interface for generating keys for a number of the core
|
|
classes: as an example, the snippet below prints out the key
|
|
representation of a string for later use in a hash table.
|
|
<pre>
|
|
String abc = new String("abc");
|
|
System.out.println("Key for \"abc\" is "+ abc.hashCode());
|
|
</pre>
|
|
</p>
|
|
<p> A hashing function is a function that performs some
|
|
operation on a set of data such that a key is generated from that
|
|
data with the key being used as the means of identifying which memory
|
|
element of the hash table to place the data in. There are a number of
|
|
properties that it is desirable for a hashing function to possess in
|
|
order that the hash table be effectively used:
|
|
<ul>
|
|
<li> Data should be dispersed as randomly as possible across the hash
|
|
table to minimise the chances of a collision. For example, a good
|
|
hashing function would place the letter ``b'' fairly far from the
|
|
letter ``a''.
|
|
<li> The hashing function should execute in a reasonable period.
|
|
</ul>
|
|
Unfortunately, as we shall see below, the hashing functions provided
|
|
by Java do not satisfy the first criterion.
|
|
|
|
<p> The load factor of a hash table is defined as the ratio of
|
|
the number of filled bins to the total number of bins available. A bin
|
|
is full when it points to or contains a data element. The load factor
|
|
is a useful parameter to use in estimating the likelihood of a
|
|
collision occurring. The Java programming language will allocate more
|
|
memory to an existing hash table if the load factor exceeds 75%. The
|
|
user can also choose to set the initial capacity of the hash table
|
|
with the aim of reducing the number of rehashing methods required. The
|
|
code snippet below demonstrate how this can be achieved.
|
|
<pre>
|
|
int initialCapacity = 1000;
|
|
float loadFactor = 0.80;
|
|
Hashtable ht = new Hashtable(initialCapacity, loadFactor);
|
|
</pre>
|
|
If you want to allocate more space for your hash table before the load
|
|
factor reaches the specified value then use the <i>rehash()</i> method
|
|
like this:
|
|
<pre>
|
|
ht.rehash();
|
|
</pre>
|
|
<p> A collision occurs when two pieces of data are denoted with
|
|
the same key by the hashing function. Since the point of using a hash
|
|
table is to maximise the efficiency with which data is inserted,
|
|
deleted or found, collisions are to be avoided as much as possible. If
|
|
you know the hashing function used to create a key then it can be very
|
|
easy to create collisions. For example, the Java code illustrates how
|
|
two different strings can have the same hashcode.
|
|
<A HREF="misc/tindale/Identical.java.txt">(text version)</A>
|
|
<pre>
|
|
import Java.util.*;
|
|
import Java.lang.*;
|
|
|
|
// x + 31x = x(31 + 1) = x + 31 + 31(x-1)
|
|
public class Identical
|
|
{
|
|
public static void main(String[] args)
|
|
{
|
|
String s1 = new String("BB");
|
|
String s2 = new String("Aa");
|
|
System.out.println(s1.hashCode());
|
|
System.out.println(s2.hashCode());
|
|
}
|
|
}
|
|
</pre>
|
|
This code generates the following output on my RedHat 6.2 box using
|
|
the <a href="#references">kaffe</a> compiler.
|
|
<pre>
|
|
[bash]$ javac Identical.java
|
|
[bash]$ java Identical
|
|
2112
|
|
2112
|
|
[bash]$
|
|
</pre>
|
|
<p> Chaining is a method of dealing with collisions in a hash table by
|
|
imagining that each bin is a pointer to a linked list of data
|
|
elements. When a collision occurs, the new data element is simply
|
|
inserted into this linked list in some manner. Similarly, when
|
|
attempting to remove an element from a bin with more than one entry,
|
|
the list is followed until the element matching the one to be deleted
|
|
is found.Actually, there is no need for collisions to be dealt with by
|
|
solely using a linked list - a data structure such as a binary tree
|
|
could also be used. The Hashtable class in the Java foundation classes
|
|
uses this method to insert elements into a hash table. Interestingly,
|
|
using chaining means that a hashtable can have a load factor that
|
|
exceeds 100%.
|
|
|
|
<p> Open addressing occurs when all of the elements in a hash
|
|
table are stored in the table itself - no pointers to linked lists:
|
|
each bin holds one piece of data. This is the simplest means of
|
|
implementing a hash table, since the table reduces to being an array
|
|
where the elements can be inserted or deleted from any index position
|
|
at any time.
|
|
|
|
<p> Linear probing is a means of implementing open addressing by choosing
|
|
the next free bin to insert a data element into if a collision occurs
|
|
while trying to insert data. Each of the subsequent bins is
|
|
checked for a collision before the data is inserted.
|
|
|
|
<p> The String class contains a method hashCode() which is used
|
|
to generate a key which can used as the key for a hash table. The
|
|
hashcode for a String object is computed as<br>
|
|
<i>s[0]*31^(n-1)+s[1]*31^(n-2)+...+s[n-1]</i><br>
|
|
using integer arithmetic and where <i>s[i]</i> is the <i>i</i>th character of a
|
|
string of length <i>n</i>. The hash value of an empty string is defined as
|
|
zero.
|
|
|
|
<p> I've included a small sample program called <a
|
|
href="#closewords">CloseWords</a> which finds words in the
|
|
system dictionary which are ``close'' to the command line argument. To
|
|
do this the program explicitly exploits one of the traits of the class
|
|
String's hashing function which is that the hashcode generated tends
|
|
to cluster together words which are of similar alphanumeric
|
|
composition. This is actually an undesirable trait, since if the input
|
|
data is comprised of a limited set of characters then there will tend
|
|
to be a large number of collisions. The ideal hashing function would
|
|
distribute the data randomly over the hash table, with trends in the
|
|
data not leading to an all over tendency to cluster.
|
|
|
|
<p> Another limitation of the hashCode method is that by making
|
|
the key of type integer the designers of Java unnaturally limited the
|
|
possible magnitude of the key to just <i>2^32 -1</i> meaning that the
|
|
probability of a collision occurring is much larger than if the key
|
|
was represented by a 64 bit data type.
|
|
|
|
<p> The Hashtable class and methods supplied in the Java Foundation
|
|
Classes are a powerful tool for data manipulation - particularly when
|
|
rapid data retrieval, searching or deleting are required. For large
|
|
data sets, however, the implementation of the hashing functions in
|
|
Java will cause a tendency for clustering - which will unnecessarily
|
|
slow down execution. A better implementation of a hash table would
|
|
involve a hashing function which distributed data more randomly and a
|
|
longer data type used for the key.
|
|
|
|
<h2>
|
|
<a name="references">References and links</a>
|
|
</h2>
|
|
<p>
|
|
For a more complete discussion of the limitations of hash tables in
|
|
Java and a better implementation see.
|
|
|
|
<p>
|
|
Java is a superbly documented language - check it out at <a
|
|
href="http://www.java.sun.com/docs/">SUN</a>.
|
|
|
|
<p>
|
|
For information on the open source Kaffe compiler visit the <a
|
|
href="http://www.kaffe.org">website</a>.
|
|
|
|
<h2>
|
|
<a name="closewords">CloseWords</a>
|
|
</h2>
|
|
Note: you can grab the source code <a href="misc/tindale/CloseWords.java.txt">here</a>
|
|
<pre>
|
|
<span class="keyword">import</span> <span class="constant">java</span>.<span class="constant">lang</span>.<span class="constant">*</span>;
|
|
<span class="keyword">import</span> <span class="constant">java</span>.<span class="constant">util</span>.<span class="constant">*</span>;
|
|
<span class="keyword">import</span> <span class="constant">java</span>.<span class="constant">io</span>.<span class="constant">*</span>;
|
|
|
|
<span class="comment">/** CloseWords: Exploit the clustering tendency of the native hashCode() method
|
|
* in the String class to find words that are "close" to the argument.
|
|
*/</span>
|
|
<span class="keyword">public</span> <span class="keyword">class</span> <span class="type">CloseWords</span>
|
|
{
|
|
<span class="type">Hashtable</span> <span class="variable-name">ht</span>;
|
|
<span class="type">String</span> <span class="variable-name">currString</span>;
|
|
|
|
<span class="comment">/** In the code below we create an instance of the Hashtable class in which to store
|
|
* our hash of all of the words in the system dictionary (yes, this is a very memory
|
|
* inefficient way of indexing the words).
|
|
*
|
|
* @</span><span class="constant">param</span><span class="comment"> </span><span class="variable-name">args</span><span class="comment">
|
|
*/</span>
|
|
<span class="keyword">public</span> <span class="keyword">static</span> <span class="type">void</span> <span class="function-name">main</span>(<span class="type">String</span>[] <span class="variable-name">args</span>)
|
|
{
|
|
ht = <span class="keyword">new</span> <span class="type">Hashtable</span>();
|
|
<span class="keyword">try</span>
|
|
{
|
|
<span class="type">DataInputStream</span> <span class="variable-name">in</span> = <span class="keyword">new</span> <span class="type">DataInputStream</span>(
|
|
<span class="keyword">new</span> <span class="type">BufferedInputStream</span>(
|
|
<span class="keyword">new</span> <span class="type">FileInputStream</span>(<span class="string">"/usr/dict/words"</span>)));
|
|
<span class="keyword">while</span>((currString = in.readLine())!=<span class="constant">null</span>)
|
|
ht.put(<span class="keyword">new</span> <span class="type">Integer</span>(currString.hashCode()), currString);
|
|
|
|
<span class="type">int</span> <span class="variable-name">i</span> = args[0].hashCode();
|
|
<span class="type">int</span> <span class="variable-name">found</span>=0;
|
|
|
|
<span class="keyword">while</span>(found < 5)
|
|
{
|
|
i--;
|
|
<span class="keyword">if</span>(ht.get(<span class="keyword">new</span> <span class="type">Integer</span>(i))!=<span class="constant">null</span>)
|
|
{
|
|
System.out.println(ht.get(<span class="keyword">new</span> <span class="type">Integer</span>(i)));
|
|
found++;
|
|
}
|
|
}
|
|
i = args[0].hashCode();
|
|
found=0;
|
|
|
|
<span class="keyword">while</span>(found < 5)
|
|
{
|
|
i++;
|
|
<span class="keyword">if</span>(ht.get(<span class="keyword">new</span> <span class="type">Integer</span>(i))!=<span class="constant">null</span>)
|
|
{
|
|
System.out.println(ht.get(<span class="keyword">new</span> <span class="type">Integer</span>(i)));
|
|
found++;
|
|
}
|
|
}
|
|
}
|
|
<span class="keyword">catch</span>(<span class="type">IOException</span> <span class="variable-name">ioe</span>)
|
|
{
|
|
System.out.println(<span class="string">"IO Exception"</span>);
|
|
}
|
|
}
|
|
}
|
|
</pre>
|
|
|
|
|
|
|
|
|
|
<!-- *** BEGIN copyright *** -->
|
|
<P> <hr> <!-- P -->
|
|
<H5 ALIGN=center>
|
|
|
|
Copyright © 2000, Ben Tindale<BR>
|
|
Published in Issue 57 of <i>Linux Gazette</i>, September 2000</H5>
|
|
<!-- *** END copyright *** -->
|
|
|
|
<!--startcut ==========================================================-->
|
|
<HR><P>
|
|
<CENTER>
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="stoddard.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue57/tindale.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="lg_backpage57.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
</CENTER>
|
|
</BODY></HTML>
|
|
<!--endcut ============================================================-->
|