old-www/LDP/LG/issue70/chung.html

<!--startcut  ==============================================-->
<!-- *** BEGIN HTML header *** -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML><HEAD>
<title>Downloading without a Browser LG #70</title>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
ALINK="#FF0000">
<!-- *** END HTML header *** -->

<CENTER>
<A HREF="http://www.linuxgazette.com/">
<IMG ALT="LINUX GAZETTE" SRC="../gx/lglogo.png"
	WIDTH="600" HEIGHT="124" border="0"></A>
<BR>

<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="arndt.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue70/chung.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="ghosh.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
<P>
</CENTER>

<!--endcut ============================================================-->

<H4 ALIGN="center">
"Linux Gazette...<I>making Linux just a little more fun!</I>"
</H4>

<P> <HR> <P>
<!--===================================================================-->

<center>
<H1><font color="maroon">Downloading without a Browser</font></H1>
<H4>By <a href="mailto:ajchung@email.com">Adrian J Chung</a></H4>
</center>
<P> <HR> <P>

<!-- END header -->


<p> Ever had to download a file so huge over a link so slow that you'd need to
keep the web browser open for hours or days? What if you had 40 files linked
from a single web page, all of which you needed -- will you tediously click on
each one? What if the browser crashes before it can finish? GNU/Linux comes
equipped with a handy set of tools for downloading in the background,
independent of the browser.  This allows you to log out, resume interrupted
downloads, and even schedule them to occur during off-peak Net usage hours.

<H3>When interactivity stands in the way</H3>

<p>
Web browsers are designed to make the Web interactive -- click and
expect results within seconds.  But there are still many files that
can take longer than a few seconds to download, even over the quickest
of connections.  An example are the ISO images that are popular among
those burning their own GNU/Linux CD-ROM distro.  Some web browsers,
especially poorly coded ones, do not behave very well over long
durations, leaking memory or crashing at the most inopportune moment.
Despite the fusion of some browsers with file managers many still do
not support the multi-selection and rubber banding operations that make
it easy to transfer several files all in one go.  You also have to
stay logged in until the entire file has arrived.  Finally, you have
to be present at the office to click the link initiating the download,
thus angering coworkers with whom office bandwidth is being shared.

<p>
Downloading of large files is a task more suitable for a different
suite of tools.  This article will discuss how to combine various
GNU/Linux utilities, namely <tt>lynx</tt>, <tt>wget</tt>, <tt>at</tt>,
<tt>crontab</tt>, etc. to solve a variety of file transfer situations.
A small amount of simple scripting will also be employed, so a little
knowledge of the <tt>bash</tt> shell will help.

<H3>The <tt>wget</tt> utility</H3>

<p>
All the major distributions include the <tt>wget</tt> downloading
tool.

<p><pre>
  bash$ <b>wget http://place.your.url/here</b>
</pre>

<p>This can also handle FTP, date stamps, and recursively mirror
entire web-site directory trees -- and if you're not
careful, entire website and whatever other sites they link to:

<p><pre>
  bash$ <b>wget -m http://target.web.site/subdirectory</b>
</pre>

<p>
Due to the potential high loads this tool can place on servers, this
tool obeys the "robots.txt" protocol when mirroring.  There are
several command options to control what exactly gets mirrored,
limiting the types of links followed and the file types downloaded.
Example: to follow only relative links and skip GIF images:

<p><pre>
  bash$ <b>wget -m -L --reject=gif http://target.web.site/subdirectory</b>
</pre>

<p>
<tt>wget</tt> can also resume interrupted downloads ("<tt>-c</tt>"
option) when given the incomplete file to which to append the
remaining data.  This operation needs to be supported by the server.

<p><pre>
  bash$ <b>wget -c http://the.url.of/incomplete/file</b>
</pre>

<p>
The resumption and mirroring can be combined, allowing one to mirror a
large collection of files over the period of many separate download
sessions.  More on how to automate this later.

<p>
If you're experiencing download interruptions as often as I do in my
office, you can tell <tt>wget</tt> to retry the URL several times:

<p><pre>
  bash$ <b>wget -t 5 http://place.your.url/here</b>
</pre>

<p>
Here we give up after 5 attempts. Use "<tt>-t inf</tt>" to never give
up.

<p>
What about proxy firewalls? Use the <tt>http_proxy</tt> environment
variable or the <tt>.wgetrc</tt> configuration file to specify a proxy
via which to download.  One problem with proxied connections over
intermittent connections is that resumptions can sometimes fail.  If a
proxied download is interrupted, the proxy server will cache only an
incomplete copy of the file.  When you try to use "<tt>wget -c</tt>"
to get the remainder of the file the proxy checks its cache and
erroneously reports that you have the entire file already.  You can
coax most proxies to bypass their cache by adding a special header to
your download request:

<p><pre>
  bash$ <b>wget -c --header="Pragma: no-cache" http://place.your.url/here</b>
</pre>

<p>
The "<tt>--header</tt>" option can add any number and manner of
headers, by which one can modify the behaviour of web servers and
proxies.  Some sites refuse to serve files via externally sourced
links; content is delivered to browsers only if they access it via
some other page on the same site.  You can get around this by
appending a "Referer:" header:

<p><pre>
  bash$ <b>wget --header="Referer: http://coming.from.this/page" http://surfing.to.this/page</b>
</pre>

<p>
Some particularly anti-social web sites will only serve content to a
specific brand of browser.  Get around this with a "User-Agent:"
header:

<p><pre>
  bash$ <b>wget --header="User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" http://msie.only.url/here</b>
</pre>

<p>
(Warning: the above tip may be considered circumventing a content
licensing mechanism and there exist anti-social legal systems that
have deemed these actions to be illegal.  Check your local
legislature.  Your mileage may vary.)


<H3>Downloading <tt>at</tt> what hour?</H3>

<p>
If you are downloading large files on your office computer over a
connection shared with easily angered coworkers who don't like their
streaming media slowed to a crawl, you should consider starting your
file transfers in the off-peak hours.  You do not have to stay in the
office after everyone has left, nor remember to do a remote login from
home after dinner.  Make use of the <tt>at</tt> job scheduler:

<p>
<blockquote><tt>
  bash$ <b>at 2300</b><br>
  warning: commands will be executed using /bin/sh<br>
  at> <b>wget http://place.your.url/here</b><br>
  at></tt> <em>press Ctrl-D</em>
</blockquote>

<p>
Here, we want to begin downloading at 11.00pm.  Make sure that the
<tt>atd</tt> scheduling daemon is running in the background for this
to work.

<H3>It'll take how many days?</H3>

<p>
When there is a lot of data to download in one or several files, and
your bandwidth is comparable to the carrier pigeon protocol, you will
often find that the download you scheduled to occur has not yet
completed when you arrive at work in the morning.  Being a good
neighbour, you kill the job and submit another <tt>at</tt> job, this
time using "<tt>wget -c</tt>", repeating as necessary over as many
days as it'll take.  It is better to automate this using a
<tt>crontab</tt>.  Create a plain text file, called
"<tt>crontab.txt</tt>", containing something like the following:

<p><pre>
0 23 * * 1-5    wget -c -N http://place.your.url/here
0  6 * * 1-5    killall wget
</pre>

<p>
This will be your <tt>crontab</tt> file which specifies what jobs to
execute at periodic intervals.  The first five columns say <EM>when</EM>
to execute the command, and the remainder of each line says <EM>what</EM> to
execute.  The first two columns indicate the time of day -- 0
minutes past 11pm to start <tt>wget</tt>, 0 minutes past 6am to
<tt>killall wget</tt>. The <tt>*</tt> in the 3rd and 4th columns
indicates that these actions are to occur every day of every month. The
5th column indicates on which days of the week to schedule each
operation -- "1-5" is Monday to Friday.

<p>So every weekday at 11pm your download will begin, and at 6am every
weekday any <tt>wget</tt> still in progress will be terminated.  To
activate this <tt>crontab</tt> schedule you need to issue the command:

<p><pre>
  bash$ <b>crontab crontab.txt</b>
</pre>

<p>
The "<tt>-N</tt>" option for <tt>wget</tt> will check the timestamp of
the target file and halt downloading if they match, which is an
indication that the entire file has been transferred.  So you can just
set it and forget it.  "<tt>crontab -r</tt>" will remove this schedule.
I've downloaded many an ISO image over shared dial-up connections
using this approach.

<H3>Dynamically Generated Web Pages</H3>

<p>
Some web pages are generated on demand since they are subject to
frequent changes sometimes several times a day.  Since the target is
technically not a file, there is no file length and resuming a
download becomes meaningless -- the "<tt>-c</tt>" option fails to work.
Example: a PHP-generated page at Linux Weekend News:

<p><pre>
  bash$ <b>wget http://lwn.net/bigpage.php3</b>
</pre>

If you interrupt the download and try to resume, it starts over from
scratch.  My office Net connection is at times so poor that I've
written a simple script detecting when a dynamic HTML page has been
delivered completely:

<p><pre>
#!/bin/bash

#create it if absent
touch bigpage.php3

#check if we got the whole thing
while ! grep -qi '&lt;/html&gt;' bigpage.php3
do
  rm -f bigpage.php3

  #download LWN in one big page
  wget http://lwn.net/bigpage.php3

done
</pre>

<p>
The above <tt>bash</tt> script keeps downloading the document unless
the string "<tt>&lt;/html&gt;</tt>" can be found, which marks the
end of the file.

<H3>SSL and Cookies</H3>

<p>
URLs beginning with "<tt>https://</tt>" must access remote files
through the Secure Sockets Layer.  You will find another download
utility, called <a href="http://curl.haxx.se"><tt>curl</tt></a>, to be
handy in these situations.

<p>
Some web sites force-feed cookies to the browser before serving the
requested content.  One must add a "<tt>Cookie:</tt>" header with the
correct information which can be obtained from your web browser's cookie
file.  For <tt>lynx</tt> and <tt>Mozilla</tt> cookie file formats:

<p><pre>
  bash$ <b>cookie=$( grep nytimes ~/.lynx_cookies |awk '{printf("%s=%s;",$6,$7)}' )</b>
</pre>

<p>
will construct the required cookie for downloading stuff from <a
href="http://www.nytimes.com/">http://www.nytimes.com</a>, assuming
that you have already registered with the site using this browser.
<tt>w3m</tt> uses a slightly different cookie file format:

<p><pre>
  bash$ <b>cookie=$( grep nytimes ~/.w3m/cookie |awk '{printf("%s=%s;",$2,$3)}' )</b>
</pre>

<p>
Downloading can now be carried out thus:

<p><pre>
  bash$ <b>wget --header="Cookie: $cookie" http://www.nytimes.com/reuters/technology/tech-tech-supercomput.html</b>
</pre>

<p>
or using the <tt>curl</tt> tool:

<p><pre>
  bash$ <b>curl -v -b $cookie -o supercomp.html http://www.nytimes.com/reuters/technology/tech-tech-supercomput.html</b>
</pre>

<H3>Making Lists of URLs</H3>

<p>
So far, we've only been downloading single files or mirroring entire
website directories.  Sometimes one is interested in downloading a
large number of files whose URLs are given on a web page but are not
interested in performing a full scale mirror of the entire site.  An
example would be downloading of the top 20 music files on a site that
displays the top 100 in order.  Here the "<tt>--accept</tt>" and
"<tt>--reject</tt>" options wouldn't work since they only operate on
file extensions.  Instead, make use of "<tt>lynx -dump</tt>".

<p><pre>
  bash$ <b>lynx -dump ftp://ftp.ssc.com/pub/lg/ |grep 'gz$' |tail -10 |awk '{print $2}' &gt; urllist.txt</b>
</pre>

<p>
The output from lynx can then be filtered using the various GNU text
processing utilities.  In the above example, we extract URLs ending in
"<tt>gz</tt>" and store the last 10 of these in a file.  A tiny
<tt>bash</tt> scripting command will automatically download any URLs
listed in this file:

<p>
<blockquote><tt>
  bash$<b> for x in $(cat urllist.txt)</b><br>
  &gt;<b> do</b><br>
  &gt;<b> wget $x</b><br>
  &gt;<b> done</b><br>
</tt></blockquote>

<p>
We've succeeded in downloading the last 10 issues of <a
href="http://www.linuxgazette.com/"><I>Linux Gazette</I></a>.

<H3>Swimming in bandwidth</H3>

<p>
If you're one of the select few to be drowning in bandwidth, and your
file downloads are slowed only by bottlenecks at the web server end,
this trick can help "shotgun" the file transfer process.  It requires
the use of <tt>curl</tt> and several mirror web sites where identical
copies of the target file are located.  For example, suppose you want
to download the Mandrake 8.0 ISO from the following three locations:

<p>
<pre><tt><b>
url1=http://ftp.eecs.umich.edu/pub/linux/mandrake/iso/Mandrake80-inst.iso
url2=http://ftp.rpmfind.net/linux/Mandrake/iso/Mandrake80-inst.iso
url3=http://ftp.wayne.edu/linux/mandrake/iso/Mandrake80-inst.iso
</b></tt></pre>

<p>
The length of the file is 677281792, so initiate three simultaneous
downloads using <tt>curl</tt>'s "<tt>--range</tt>" option:

<p><pre><tt>
bash$ <b>curl -r 0-199999999 -o mdk-iso.part1 $url1 &amp;</b>
bash$ <b>curl -r 200000000-399999999 -o mdk-iso.part2 $url2 &amp;</b>
bash$ <b>curl -r 400000000- -o mdk-iso.part3 $url3 &amp;</b>
</tt></pre>

<p>
This creates three background download processes, each transferring a
different part of the ISO image from a different server.  The
"<tt>-r</tt>" options specifies a subrange of bytes to extract from
the target file.  When completed, simply <tt>cat</tt> all three parts
together -- <b><tt>cat mdk-iso.part? &gt;
mdk-80.iso</tt></b>. (Checking the md5 hash before burning to CD-R is
well recommended.) Launching each <tt>curl</tt> in its own window
while using the "<tt>--verbose</tt>" option allows one to track the
progress of each transfer.

<H3>Conclusion</H3>

Do not be afraid to use non-interactive methods for effecting your
remote file transfers.  No matter how hard web designers may try to
force you to surf their sites interactively, there will always be free
tools to help automate the process, thus enriching our overall Net
experience.


<!-- *** BEGIN bio *** -->
<SPACER TYPE="vertical" SIZE="30">
<P>
<H4><IMG ALIGN=BOTTOM ALT="" SRC="../gx/note.gif">Adrian J Chung</H4>
<EM>When not teaching undergraduate computing at the University of
the West Indies, Trinidad, Adrian writes scripts to automate web email
downloads, and experiments with interfacing various scripting
environments with homebrew computer graphics renderers and data
visualization libraries.</EM>

<!-- *** END bio *** -->

<!-- *** BEGIN copyright *** -->
<P> <hr> <!-- P -->
<H5 ALIGN=center>

Copyright &copy; 2001, Adrian J Chung.<BR>
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
Published in Issue 70 of <i>Linux Gazette</i>, September 2001</H5>
<!-- *** END copyright *** -->

<!--startcut ==========================================================-->
<HR><P>
<CENTER>
<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="arndt.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue70/chung.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="ghosh.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
</CENTER>
</BODY></HTML>
<!--endcut ============================================================-->