old-www/LDP/LG/issue34/ayers2.html

<!--startcut ==========================================================-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<title>How Not To Re-Invent The Wheel LG #34</title>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#A000A0"
ALINK="#FF0000">
<!--endcut ============================================================-->

<H4>
"Linux Gazette...<I>making Linux just a little more fun!</I>"
</H4>

<P> <HR> <P>
<!--===================================================================-->

<center><h1>
<font color="maroon">How Not To Re-Invent The Wheel</font>
</h1></center>
<center>
<h4>By <a href="mailto: layers@marktwain.net">Larry Ayers</a></h4>
</center>
<p><hr><p>

<center><h3>Introduction</h3></center>

<p>With all of the excitement lately about various software firms planning Linux
ports of their products, it's easy to lose sight of the great power and
versatility of the unsung small utilities which are a part of every Linux
distribution.  These tools, mostly GNU versions of small programs such as
<i>awk</i>, <i>grep</i> and <i>sed</i>, date back to the early pioneer days of
Unix and have been in wide use ever since.  They typically have
specialized capabilities and become especially useful when they are chained
together and data is piped from one to another.  Often a shell script serves
as the matrix in which they do their work.

<p>Sometimes a piece of software native to another operating system is ported
to Linux as an independent unit without taking advantage of pre-existing tools
which might have reduced the size of the program and reduced memory usage.
It's always a pleasure to happen upon software written with an awareness of
the power of Linux and its native utilities.  <i>Bu</i> is a backup program
and <i>NoSQL</i> is an ASCII-table relational database system; what they have
in common is their usage of simple but effective Linux tools to accomplish
their respective tasks.
<hr>

<center><h3>
<font color="maroon">Shell-script Backups With bu</font>
</h3></center>

<p>Making a backup of the myriad files on a Linux system isn't necessary
for most stand-alone single-user machines.  Backing up configuration and
personal files to floppies or other removable media is normally all that is
necessary, assuming that a recent Linux distribution CD and a CDROM drive are
available.  The situation becomes more complex with multi-user servers or with
machines used in a business setting, where the sheer number of irreplaceable
files makes this simple method impractical and time-consuming; in these cases
the traditional method in the unix world has been to use <i>cpio</i> or
another archiving utility to copy files to a tape drive.  Though the price of
hard disks has plummeted in recent years while their capacity has ballooned,
reliable tape drives capable of storing the vast amounts of data a modern
hard-disk can hold are still quite expensive, sometimes rivalling the cost of
the computer they protect from loss of data.

<p>Vincent Stemen has developed a small backup utility called <i>bu</i> which
is shell-based and makes good use of standard Linux utilities such as
<i>cp</i> and <i>sed</i>.  Rather than being intended for
backups to tape or other streaming device, <i>bu</i> is designed to mirror
files on another file-system, preferably located on a separate hard drive.

<p><i>Bu</i> is just a twelve kilobyte shell script along with a few
configuration files.  It's remarkably capable; compare this list of features
with those of other backup utilities:<br>

<ul>
  <li>Checks timestamps and only copies new or changed files
  <li>Deals with symbolic links intelligently
  <li>Writes a log-file upon completion
  <li>Will ignore directories which are mounted filesystems
  <li>Easy specification of files and directories to include or exclude
</ul>

<p><i>Bu</i> in its earlier versions used <i>cpio</i>  extensively, but due to
a problem with new directory permissions <i>cp</i> is the main engine of the
utility now.  <i>Cp -a</i> used by itself can be used to bulk-copy entire
filesystems to a new location, but the symbolic links would have to be dealt
with manually, which is time-consuming.  Also missing would be the ability to
automatically include and exclude specific files and directories; <i>bu</i>
refers to two configuration files, <kbd>/usr/local/backups/Exclude</kbd> and
<kbd>/usr/local/backups/Include</kbd>, for this information.

<p>This small and handy utility isn't intended to completely supplant
traditional tape-drive backup systems, but its author has been using <i>bu</i>
as the basis of a backup strategy involving several development machines and
several gigabytes of data.  <i>Bu</i> can be obtained from this
<a href="http://www.crel.com/bu/">web-page</a>; be sure to read the white
paper included in the distribution which details the rationale behind the utility.
<hr>

<center><h3>
<font color="maroon">The NoSQL Relational Database</font>
</h3></center>

<p>Carlo Strozzi (a member of the Italian Linux society) has developed a
relational database management system (RDBMS) which uses tab-delimited ASCII
tables as its data format.  <i>NoSQL</i> is a descendant of an RDBMS developed
by Walter W. Hobbs (of the RAND Organization) called <i>RDB</i>.  The
commercial product <i>/rdb</i> sold by Revolutionary Software is similar, but
uses more compiled C code for greater speed.

<p>Carlo Strozzi had this to say about his motivation for developing
<i>NoSQL</i> (excerpted from the documentation):

<blockquote><pre>
Several times I have found myself writing applications that
needed to rely upon simple database management tasks. Most
commercial database products are often too costly and too
feature-packed to encourage casual use. There are also plenty of
good freeware databases around, but they too tend to provide far
more that I need most of the times, and they too lack the
shell-level approach of NoSQL. Admittedly, having been written
with interpretive languages (Shell, Perl, AWK), NoSQL is not the
fastest DBMS of all, at least not always (a lot depends on the
application).
</pre></blockquote>

<p>The philosophy behind these database systems is well-expressed in an
 article titled <cite>A 4GL Language</cite>, which was written by Evan
Schaffer and Mike Wolf, founders of Revolutionary Software.  The paper
originally appeared in the March 1991 issue of the Unix Review; a Postscript
version is included with the <i>NoSQL</i> documentation.  Here is the
abstract:

<blockquote><pre>
There are many database systems available for UNIX. But almost
all are software prisons that you must get into and leave the
power of UNIX behind. Most were developed on operating systems
other than UNIX. Consequently their developers had very few
software features to build upon, and wrote the functionality they
needed directly, without regard for the features provided by the
operating system. The resulting database systems are large,
complex programs which degrade total system performance,
especially when they are run in a multi-user environment.

UNIX provides hundreds of programs that can be piped together to
easily perform almost any function imaginable. Nothing comes
close to providing the functions that come standard with
UNIX. Programs and philosophies carried over from other systems
put walls between the user and UNIX, and the power of UNIX is
thrown away.

The shell, extended with a few relational operators, is the
fourth generation language most appropriate to the UNIX
environment.
</pre></blockquote>

<p>The complete article is well worth reading for anyone who has ever wondered
just why Linux software is different than that used with mainstream operating
systems, and why GUI software has only recently began to become common.

<p><i>NoSQL</i> incorporates the ideas presented above.  A major
difference between Walter W. Hobbs' <i>RDB</i> database and <i>NoSQL</i> is
that <i>NoSQL</i> uses <i>awk</i> extensively to perform tasks handled by
<i>perl</i> in <i>RDB</i>.  <i>Awk</i> is a more specialized tool with a much
smaller memory footprint, and since the data-pipelining which is the essence
of these relational database management systems requires repeated invocation
of their respective interpreters, <i>NoSQL</i> exerts less of a strain
on a system's resources, especially important in a multi-user environment.

<p>After installing the package (no compilation is involved) a new
subdirectory under <kbd>/usr/local/lib</kbd> called <kbd>nosql</kbd> will be
created and populated; it will have these subdirectories:

<dl>
  <dt><kbd>awk</kbd>
  <dd> contains several <i>awk</i> scripts which are
      responsible for most of the table-manipulation jobs
  <p><dt><kbd>doc</kbd>
      <dd>contains both Postscript and HTML versions of the
      readable and complete <i>NoSQL</i> documentation, as well as a
      Postscript version of the Schaffer and Wolf article from the Unix Review
  <p><dt><kbd>mylib</kbd>
      <dd>an empty directory for new scripts and programs
  <p><dt><kbd>perl</kbd>
      <dd><i>perl</i> scripts which perform other <i>NoSQL</i> functions
  <p><dt><kbd>sh</kbd>
      <dd>shell scripts which act as wrappers for the <i>awk</i>
      and <i>perl</i> scripts.
</dl>

<p>The entire subdirectory occupies just under 600 kb., most of which is
documentation.

<p>After installing the files, the only other step needed before trying out
the database is setting three environment variables.  Here are three lines
from my <kbd>.zshenv</kbd> file (<i>bash</i> users should have these lines in
the <kbd>.bash_profile </kbd> file):<br>

<pre><kbd>
export NSQLIB=/usr/local/lib/nosql
export NSQSH=/bin/ash
export NSQAWK=/usr/bin/mawk
</kbd></pre>

<p>Carlo Strozzi recommends using <i>ash</i> rather than one of the larger and
more powerful shells such as <i>bash</i> or <i>zsh</i>; <i>ash</i> uses less
memory. and since the shell is repeatedly invoked while using <i>NoSQL</i> the
upshot will be a noticeable increase in speed and a reduction in memory
requirements.

<p>Since there is no compiled code in the package, <i>NoSQL</i> should run on
any machines which have <i>awk</i> and <i>perl</i> available; in other words
the database isn't Linux-centric.  The ASCII format of the data tables is also
very portable, and can be manipulated by text editors and common filesystem
tools.  Data can be extracted from tables by means of various &quot;operators&quot;
via input-output redirection (e.g., pipes, STDIN and STDOUT).  The only limits
on the amount of data which can be handled are in the machine running
<i>NoSQL</i>; the installed memory and processor speed are the limiting
factors.

<p>As the name implies this is not an SQL database, which should make
<i>NoSQL</i> more accessible to users lacking SQL expertise.  I don't know SQL
at all and I found the basic commands of <i>NoSQL</i> easy to learn.  All
commands are executed as parameters of the <kbd>nosql</kbd> shell script.
Here's an example <i>NoSQL</i> table:<br>

<blockquote><pre>
Name	   Freq	       Height	   Season
----	   ----	       ------	   ------
laccaria   27	       6	   Fall
lepiota	   5	       8	   Summer
amanita	   42	       7	   Summer
lentinus   85	       5	   Spring-Fall
morchella  45	       6	   Spring
boletus	   65	       5	   Summer
russula	   75	       4	   Summer
</pre></blockquote>

<p>Single tabs must separate the fields from each other, even the spaces
between the groups of dashes on the dashed separator line must be single tabs.
An alternate format for the tabular data is the list; the above table can be
converted to this format with the command<br>

<p><kbd>nosql tabletolist &lt; [filename]</kbd>

<p>The results look like this:<br>

<pre>

Name	laccaria
Freq	27
Height	6
Season	Fall

Name	lepiota
Freq	5
Height	8
Season	Summer

Name	amanita
Freq	42
Height	7
Season	Summer

Name	lentinus
Freq	85
Height	5
Season	Spring-Fall

Name	morchella
Freq	45
Height	6
Season	Spring

Name	boletus
Freq	65
Height	5
Season	Summer

Name	russula
Freq	75
Height	4
Season	Summer

</pre>

<p>If the above table were named <kbd>pilze.rdb</kbd>, either the command<br>

<p><kbd>nosql istable &lt; pilze.rdb</kbd>

<p>or

<p><kbd>nosql islist &lt; pilze.rdb</kbd>

<p>would ask <kbd>nosql</kbd> to check the table or list format's validity,
depending on which format is being checked.  Another command,<br>

<p><kbd>nosql edit &lt; pilze.rdb</kbd><br>

<p>will open the file in the editor defined by the EDITOR environment variable
(often set to <i>vi</i> by default).  A file in table format is automatically
converted into the vertical list format for easier editing, then changed back
into a table when exiting the editor.  When the file is saved or closed
<i>NoSQL</i> will automatically check the validity of the format and give the
line numbers where any errors occur.  This seemingly obsessive concern with
correct format isn't mere pedantry; the various <i>NoSQL</i> operators which
manipulate and extract data need to be able to quickly distinguish headers
from data and data-fields from each other, and single tabs are the criteria.

<p>There are over forty operator functions available, some of which extract or
rearrange fields while others are used to generate reports.  Their names are
more-or-less mnemonic, such as <i>inscol</i> and <i>addcol</i>, which are used
to insert a column into a table, respectively on the left- or right-hand side.
Other operators index and search tables.  Examples of typical usage (i.e.,
connecting <i>NoSQL</i> commands with pipes) are included in the
documentation.

<p>As with any Open-Source software, it's hard to tell how many people or
organizations are using it.  In an e-mail, I asked Carlo Strozzi for examples
of real-world usage of <i>NoSQL</i>; he replied that he has been using it
quite a bit for database-backed CGI scripts for the WWW.  He also stated that
several companies in Italy are using it internally.

Carlo Strozzi works for IBM in Italy, and he has developed several web
applications backed by <i>NoSQL</i>; three of the publicly accessible pages
are:<br>

<p>
<a href="http://www.whoswho-sutter.com">Fortune companies and people profiles<a>
<p><a href="http://www.secondamano.it">Classifieds - this is in Italian</a>
<p><a href="http://annunci-auto.repubblica.it">Car classifieds, in Italian</a>

<p>The latest version of <i>NoSQL</i> can be obtained from
this <a href="ftp://ftp.linux.it/pub/database/NoSQL">FTP site.</a>

<hr>

<!-- hhmts start -->
Last modified: Thu 29 Oct 1998
<!-- hhmts end -->

<!--===================================================================-->
<P> <hr> <P>
<center><H5>Copyright &copy; 1998, Larry Ayers <BR>
Published in Issue 34 of <i>Linux Gazette</i>, November 1998</H5></center>

<!--===================================================================-->
<P> <hr> <P>
<A HREF="./index.html"><IMG ALIGN=BOTTOM SRC="../gx/indexnew.gif"
ALT="[ TABLE OF CONTENTS ]"></A>
<A HREF="../index.html"><IMG ALIGN=BOTTOM SRC="../gx/homenew.gif"
ALT="[ FRONT PAGE ]"></A>
<A HREF="./ayers1.html"><IMG SRC="../gx/back2.gif"
ALT=" Back "></A>
<A HREF="./ayers3.html"><IMG SRC="../gx/fwd.gif" ALT=" Next "></A>
<P> <hr> <P>
<!--startcut ==========================================================-->
</BODY>
</HTML>
<!--endcut ============================================================-->