277 lines
11 KiB
HTML
277 lines
11 KiB
HTML
<!--startcut ==========================================================-->
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<title>Searching a Web Site with Linux LG #32</title>
|
|
</HEAD>
|
|
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#A000A0"
|
|
ALINK="#FF0000">
|
|
<!--endcut ============================================================-->
|
|
|
|
<H4>
|
|
"Linux Gazette...<I>making Linux just a little more fun!</I>"
|
|
</H4>
|
|
|
|
<P> <HR> <P>
|
|
<!--===================================================================-->
|
|
|
|
<center>
|
|
<H1><font color="maroon">Searching a Web Site with Linux</font></H1>
|
|
<H4>By <a href="mailto:brw@inetinc.net">Branden Williams</a></H4>
|
|
</center>
|
|
<P> <HR> <P>
|
|
|
|
As your website grows in size, so will the number of people that visit
|
|
your site. Now most of these people are just like you and me in the
|
|
sense that they want to go to your site, click a button, and get exactly
|
|
what information they were looking for. To serve these kinds of users a
|
|
bit better, the Internet community responded with the ``Site
|
|
Search''. A way to search a single website for the information you are
|
|
looking for. As a system administrator, I have been asked to provide search
|
|
engines for people to use on their websites so that their clients can get to
|
|
their information as fast as possible.
|
|
<p>
|
|
Now the trick to most search engines (Internet wide included) is that
|
|
they index and search entire sites. So for instance, you are looking for
|
|
used cars. You decide to look for an early 90s model Nissan Truck. You
|
|
get on the web, and go to AltaVista. If you do a search for ``used Nissan
|
|
truck'', you will most likely come up with a few pages that have listings
|
|
of cars. Now the pain comes when you go to that link and see that 400K
|
|
HTML file with text listings of used trucks. You have to either go line
|
|
by line until you find your choice, or like most people, find it on your
|
|
page using your browser's find command.
|
|
<p>
|
|
Now wouldn't it be nice if you could just search for your used truck and
|
|
get the results you are looking for in one fail swoop?
|
|
<p>
|
|
A recent search CGI that I designed for a company called Resource
|
|
Spectrum (http://www.spectrumm.com/) is what precipitated DocSearch.
|
|
Resource Spectrum needed a solution similar to my truck analogy. They
|
|
are a placement agency for high skilled jobs that needed another
|
|
alternative to posting their job listing to newsgroups. What was proposed
|
|
was a searchable Internet listing of the jobs on their new website.
|
|
<p>
|
|
Now as the job listing came to us, it was in a word document that had
|
|
been exported to HTML. As I searched (no pun intended) long and hard for
|
|
something that I could use, nothing turned up. All of the search
|
|
engines I found <b>only</b> searched sites, not single documents.
|
|
<p>
|
|
This is where the idea for DocSearch came from.
|
|
<p>
|
|
I needed a simple, clean way to search that single HTML document so users
|
|
could get the information they needed quickly and easily.
|
|
<p>
|
|
I got out the old Perl Reference and spent a few afternoons working out a
|
|
solution to this problem. After a few updates, you see in front of you
|
|
DocSearch 1.0.4. You can grab the latest version at
|
|
ftp://ftp.inetinc.net/pub/docsearch/docsearch.tar.gz.
|
|
<p>
|
|
Let's go through the code here so we can see how this works. First
|
|
before we really get into this though, you need to make sure you have the
|
|
CGI Library (cgi-lib.pl) installed. If you do not, you can download it
|
|
from http://www.bio.cam.ac.uk/cgi-lib/. This is simply a Perl library
|
|
that contains several useful functions for CGIs. Place it in your
|
|
cgi-bin directory and make it world readable and executable. (chmod a+rx
|
|
cgi-lib.pl)
|
|
<p>
|
|
Now you can start to configure DocSearch. First off, there are a few
|
|
constants that need to be set. They are in reference to the
|
|
characteristics of the document you are searching. For instance...
|
|
<p>
|
|
<pre>
|
|
# The Document you want to search.
|
|
$doc = "/path/to/my/list.html";
|
|
</pre>
|
|
Set this to the absolute path of the document you are searching.
|
|
<p>
|
|
<pre>
|
|
# Document Title. The text to go inside the
|
|
<title></title> HTML tags.
|
|
$htmltitle = "Nifty Search Results";
|
|
</pre>
|
|
Set this to what you want the results page title to be.
|
|
<p>
|
|
<pre>
|
|
# Optional Back link. If you don't want one, make the string null.
|
|
# i.e. $backlink = "";
|
|
$backlink = "http://www.inetinc.net/some.html";
|
|
</pre>
|
|
If you want to provide a ``Go Back'' link, enter the URL of the file that
|
|
we will be referencing.
|
|
<p>
|
|
<pre>
|
|
# Record delimiter. The text which separates the records.
|
|
$recdelim = " ";
|
|
</pre>
|
|
This part is one of the most important aspects of the search. The
|
|
document you are searching must have something in between the "records"
|
|
to delimit the html document. In English, you will need to place some
|
|
HTML comment or something in between each possible result of the search.
|
|
In my example, MS Word put the <tt>$nbsp;</tt> tag in between all of the
|
|
records by default, so I just used that as a delimiter.
|
|
<p>
|
|
Next we <b>ReadParse()</b> our information from the HTML form that was
|
|
used as a front end to our CGI. Then to simplify things later, we go
|
|
ahead and set the variable <tt>$query</tt> to be the term we are
|
|
searching for.
|
|
<p>
|
|
<pre>
|
|
$query = $input{`term'};
|
|
</pre>
|
|
This step can be repeated for each query item you would like to use to
|
|
narrow your search. If you want any of these items to be optional, just
|
|
add a line like this in your code.
|
|
<p>
|
|
<pre>
|
|
if ($query eq "") {
|
|
$query = " ";
|
|
}
|
|
</pre>
|
|
This will match relatively any record you search.
|
|
<p>
|
|
Now comes a very important step. We need to make sure that any meta
|
|
characters are escaped. Perl's bind operator uses meta characters to
|
|
modify and change search output. We want to make sure that any
|
|
characters that are entered into the form are not going to change the
|
|
output of our search in any way.
|
|
<p>
|
|
<pre>
|
|
$query =~ s/([-+i.<>&|^%=])/\\\1/g;
|
|
</pre>
|
|
Boy does that look messy! That is basically just a Regular Expression to
|
|
escape all of the meta characters. Basically this will change a
|
|
<tt>+</tt> into a <tt>\+</tt>.
|
|
<p>
|
|
Now we need to move right along and open up our target document. When we
|
|
do this, we will need to read the entire file into one variable. Then we
|
|
will work from there.
|
|
<p>
|
|
<pre>
|
|
open (SEARCH, "$doc");
|
|
undef $/;
|
|
$text = <SEARCH>;
|
|
close (SEARCH);
|
|
</pre>
|
|
The only thing you may not be familiar with is the <tt>undef $/;</tt>
|
|
statement you see there. For our search to work correctly, we must
|
|
undefine the Perl variable that separates the lines of our input file.
|
|
The reason this is necessary is due to the fact that we must read the
|
|
<b>entire</b> file into one variable. Unless this is undefined, only one
|
|
line will be read.
|
|
<p>
|
|
Now we will start the output of the results page. It is good to
|
|
customize it and make it appealing somehow to the user. This is free form
|
|
HTML so all you HTML guys, go at it.
|
|
<p>
|
|
Now we will do the real searching job. Here is the meat of our search.
|
|
You will notice there are two commented regular expressions in the search.
|
|
If you want to not display any images or show any links, you should
|
|
uncomment those lines.
|
|
<p>
|
|
<pre>
|
|
@records = split(/$recdelim/,$text);
|
|
</pre>
|
|
We want to split up the file into an array of records. Each record is a
|
|
valid search result, but is separate from the rest. This is where the
|
|
record delimiter comes into play.
|
|
<p>
|
|
<pre>
|
|
foreach $record (@records)
|
|
{
|
|
# $record =~ s/<a.*<\/a>//ig; # Do not print links inside this
|
|
# doc.
|
|
# $record =~ s/<img.*>//ig; # Do not display images inside this
|
|
# doc.
|
|
if ( $record =~ /$query/i ) {
|
|
print $record;
|
|
$matches++;
|
|
}
|
|
}
|
|
</pre>
|
|
This basically prints out every <tt>$record</tt> that matches our search
|
|
criteria. Again you can change the number of search criterion you use by
|
|
changing that if statement to something like this.
|
|
<p>
|
|
<pre>
|
|
if ( ($record =~ /$query/i) && ($record =~ /$anotheritem/) ) {
|
|
</pre>
|
|
This will try to match both queries with <tt>$record</tt> and upon a
|
|
successful match, it will dump that <tt>$record</tt> to our results
|
|
page. Notice how we also increment a variable called <tt>$matches</tt>
|
|
every time a match is made. This is not as much as to tell the user how many
|
|
different records were found, but more of a count to tell us if <b>no</b>
|
|
matches were found so we can tell the user that no, the system is not
|
|
down, but in fact we did not match any records based upon that query.
|
|
<p>
|
|
Now that we are done searching and displaying the results of our search,
|
|
we need to do a few administrative actions to ensure that we have fully
|
|
completed our job.
|
|
<p>
|
|
First off, as I was mentioning before, we need to check for zero matches
|
|
in our search and let the user know that we could not find anything to
|
|
match his query.
|
|
<p>
|
|
<pre>
|
|
if ($matches eq "0") {
|
|
$query =~ s/\\//g;
|
|
|
|
print << "End_Again";
|
|
|
|
<center>
|
|
<h2>Sorry! "$query" was not found!</h2><p>
|
|
</center>
|
|
End_Again
|
|
}
|
|
</pre>
|
|
Notice that lovely Regular Expression. Now that we had to take all of
|
|
the trouble to escape those meta characters, we need to remove the
|
|
escape chars. This way when they see that their <tt>$query</tt> was
|
|
not found, they will not look at it and say ``But that is not what I
|
|
entered!'' Then we want to dump the HTML to disappoint the user.
|
|
<p>
|
|
The only two things left to do is end the HTML document cleanly and allow
|
|
for the back link.
|
|
<p>
|
|
<pre>
|
|
if ( $backlink ne "" ) {
|
|
print "<center>";
|
|
print "<h3><a href=\"$backlink\">Go
|
|
back</a></h3>";
|
|
print "</center>";
|
|
}
|
|
|
|
print << "End_Of_Footer";
|
|
|
|
</body>
|
|
</html>
|
|
|
|
End_Of_Footer
|
|
</pre>
|
|
All done. Now you are happy because the user is happy. Not only have
|
|
you streamlined your website by allowing to search a single page, but you
|
|
have increased the user's utility by giving them the results they want.
|
|
The only result of this is <b>more hits</b>. By helping your user find the
|
|
information he needs, he will tell his friends about your site. And his
|
|
friends will tell their friends and so on. Putting the customer first
|
|
sometimes <b>does</b> work!
|
|
|
|
<!--===================================================================-->
|
|
<P> <hr> <P>
|
|
<center><H5>Copyright © 1998, Branden Williams <BR>
|
|
Published in Issue 32 of <i>Linux Gazette</i>, September 1998</H5></center>
|
|
|
|
<!--===================================================================-->
|
|
<P> <hr> <P>
|
|
<A HREF="./index.html"><IMG ALIGN=BOTTOM SRC="../gx/indexnew.gif"
|
|
ALT="[ TABLE OF CONTENTS ]"></A>
|
|
<A HREF="../index.html"><IMG ALIGN=BOTTOM SRC="../gx/homenew.gif"
|
|
ALT="[ FRONT PAGE ]"></A>
|
|
<A HREF="./jenkins2.html"><IMG SRC="../gx/back2.gif"
|
|
ALT=" Back "></A>
|
|
<A HREF="./rogers.html"><IMG SRC="../gx/fwd.gif" ALT=" Next "></A>
|
|
<P> <hr> <P>
|
|
<!--startcut ==========================================================-->
|
|
</BODY>
|
|
</HTML>
|
|
<!--endcut ============================================================-->
|