441 lines
17 KiB
HTML
441 lines
17 KiB
HTML
<!--startcut ==============================================-->
|
|
<!-- *** BEGIN HTML header *** -->
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML><HEAD>
|
|
<title>Downloading without a Browser LG #70</title>
|
|
</HEAD>
|
|
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
|
|
ALINK="#FF0000">
|
|
<!-- *** END HTML header *** -->
|
|
|
|
<CENTER>
|
|
<A HREF="http://www.linuxgazette.com/">
|
|
<IMG ALT="LINUX GAZETTE" SRC="../gx/lglogo.png"
|
|
WIDTH="600" HEIGHT="124" border="0"></A>
|
|
<BR>
|
|
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="arndt.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue70/chung.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="ghosh.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
<P>
|
|
</CENTER>
|
|
|
|
<!--endcut ============================================================-->
|
|
|
|
<H4 ALIGN="center">
|
|
"Linux Gazette...<I>making Linux just a little more fun!</I>"
|
|
</H4>
|
|
|
|
<P> <HR> <P>
|
|
<!--===================================================================-->
|
|
|
|
<center>
|
|
<H1><font color="maroon">Downloading without a Browser</font></H1>
|
|
<H4>By <a href="mailto:ajchung@email.com">Adrian J Chung</a></H4>
|
|
</center>
|
|
<P> <HR> <P>
|
|
|
|
<!-- END header -->
|
|
|
|
|
|
|
|
|
|
<p> Ever had to download a file so huge over a link so slow that you'd need to
|
|
keep the web browser open for hours or days? What if you had 40 files linked
|
|
from a single web page, all of which you needed -- will you tediously click on
|
|
each one? What if the browser crashes before it can finish? GNU/Linux comes
|
|
equipped with a handy set of tools for downloading in the background,
|
|
independent of the browser. This allows you to log out, resume interrupted
|
|
downloads, and even schedule them to occur during off-peak Net usage hours.
|
|
|
|
<H3>When interactivity stands in the way</H3>
|
|
|
|
<p>
|
|
Web browsers are designed to make the Web interactive -- click and
|
|
expect results within seconds. But there are still many files that
|
|
can take longer than a few seconds to download, even over the quickest
|
|
of connections. An example are the ISO images that are popular among
|
|
those burning their own GNU/Linux CD-ROM distro. Some web browsers,
|
|
especially poorly coded ones, do not behave very well over long
|
|
durations, leaking memory or crashing at the most inopportune moment.
|
|
Despite the fusion of some browsers with file managers many still do
|
|
not support the multi-selection and rubber banding operations that make
|
|
it easy to transfer several files all in one go. You also have to
|
|
stay logged in until the entire file has arrived. Finally, you have
|
|
to be present at the office to click the link initiating the download,
|
|
thus angering coworkers with whom office bandwidth is being shared.
|
|
|
|
<p>
|
|
Downloading of large files is a task more suitable for a different
|
|
suite of tools. This article will discuss how to combine various
|
|
GNU/Linux utilities, namely <tt>lynx</tt>, <tt>wget</tt>, <tt>at</tt>,
|
|
<tt>crontab</tt>, etc. to solve a variety of file transfer situations.
|
|
A small amount of simple scripting will also be employed, so a little
|
|
knowledge of the <tt>bash</tt> shell will help.
|
|
|
|
<H3>The <tt>wget</tt> utility</H3>
|
|
|
|
<p>
|
|
All the major distributions include the <tt>wget</tt> downloading
|
|
tool.
|
|
|
|
<p><pre>
|
|
bash$ <b>wget http://place.your.url/here</b>
|
|
</pre>
|
|
|
|
<p>This can also handle FTP, date stamps, and recursively mirror
|
|
entire web-site directory trees -- and if you're not
|
|
careful, entire website and whatever other sites they link to:
|
|
|
|
<p><pre>
|
|
bash$ <b>wget -m http://target.web.site/subdirectory</b>
|
|
</pre>
|
|
|
|
<p>
|
|
Due to the potential high loads this tool can place on servers, this
|
|
tool obeys the "robots.txt" protocol when mirroring. There are
|
|
several command options to control what exactly gets mirrored,
|
|
limiting the types of links followed and the file types downloaded.
|
|
Example: to follow only relative links and skip GIF images:
|
|
|
|
<p><pre>
|
|
bash$ <b>wget -m -L --reject=gif http://target.web.site/subdirectory</b>
|
|
</pre>
|
|
|
|
<p>
|
|
<tt>wget</tt> can also resume interrupted downloads ("<tt>-c</tt>"
|
|
option) when given the incomplete file to which to append the
|
|
remaining data. This operation needs to be supported by the server.
|
|
|
|
<p><pre>
|
|
bash$ <b>wget -c http://the.url.of/incomplete/file</b>
|
|
</pre>
|
|
|
|
<p>
|
|
The resumption and mirroring can be combined, allowing one to mirror a
|
|
large collection of files over the period of many separate download
|
|
sessions. More on how to automate this later.
|
|
|
|
<p>
|
|
If you're experiencing download interruptions as often as I do in my
|
|
office, you can tell <tt>wget</tt> to retry the URL several times:
|
|
|
|
<p><pre>
|
|
bash$ <b>wget -t 5 http://place.your.url/here</b>
|
|
</pre>
|
|
|
|
<p>
|
|
Here we give up after 5 attempts. Use "<tt>-t inf</tt>" to never give
|
|
up.
|
|
|
|
<p>
|
|
What about proxy firewalls? Use the <tt>http_proxy</tt> environment
|
|
variable or the <tt>.wgetrc</tt> configuration file to specify a proxy
|
|
via which to download. One problem with proxied connections over
|
|
intermittent connections is that resumptions can sometimes fail. If a
|
|
proxied download is interrupted, the proxy server will cache only an
|
|
incomplete copy of the file. When you try to use "<tt>wget -c</tt>"
|
|
to get the remainder of the file the proxy checks its cache and
|
|
erroneously reports that you have the entire file already. You can
|
|
coax most proxies to bypass their cache by adding a special header to
|
|
your download request:
|
|
|
|
<p><pre>
|
|
bash$ <b>wget -c --header="Pragma: no-cache" http://place.your.url/here</b>
|
|
</pre>
|
|
|
|
<p>
|
|
The "<tt>--header</tt>" option can add any number and manner of
|
|
headers, by which one can modify the behaviour of web servers and
|
|
proxies. Some sites refuse to serve files via externally sourced
|
|
links; content is delivered to browsers only if they access it via
|
|
some other page on the same site. You can get around this by
|
|
appending a "Referer:" header:
|
|
|
|
<p><pre>
|
|
bash$ <b>wget --header="Referer: http://coming.from.this/page" http://surfing.to.this/page</b>
|
|
</pre>
|
|
|
|
<p>
|
|
Some particularly anti-social web sites will only serve content to a
|
|
specific brand of browser. Get around this with a "User-Agent:"
|
|
header:
|
|
|
|
<p><pre>
|
|
bash$ <b>wget --header="User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" http://msie.only.url/here</b>
|
|
</pre>
|
|
|
|
<p>
|
|
(Warning: the above tip may be considered circumventing a content
|
|
licensing mechanism and there exist anti-social legal systems that
|
|
have deemed these actions to be illegal. Check your local
|
|
legislature. Your mileage may vary.)
|
|
|
|
|
|
<H3>Downloading <tt>at</tt> what hour?</H3>
|
|
|
|
<p>
|
|
If you are downloading large files on your office computer over a
|
|
connection shared with easily angered coworkers who don't like their
|
|
streaming media slowed to a crawl, you should consider starting your
|
|
file transfers in the off-peak hours. You do not have to stay in the
|
|
office after everyone has left, nor remember to do a remote login from
|
|
home after dinner. Make use of the <tt>at</tt> job scheduler:
|
|
|
|
<p>
|
|
<blockquote><tt>
|
|
bash$ <b>at 2300</b><br>
|
|
warning: commands will be executed using /bin/sh<br>
|
|
at> <b>wget http://place.your.url/here</b><br>
|
|
at></tt> <em>press Ctrl-D</em>
|
|
</blockquote>
|
|
|
|
<p>
|
|
Here, we want to begin downloading at 11.00pm. Make sure that the
|
|
<tt>atd</tt> scheduling daemon is running in the background for this
|
|
to work.
|
|
|
|
<H3>It'll take how many days?</H3>
|
|
|
|
<p>
|
|
When there is a lot of data to download in one or several files, and
|
|
your bandwidth is comparable to the carrier pigeon protocol, you will
|
|
often find that the download you scheduled to occur has not yet
|
|
completed when you arrive at work in the morning. Being a good
|
|
neighbour, you kill the job and submit another <tt>at</tt> job, this
|
|
time using "<tt>wget -c</tt>", repeating as necessary over as many
|
|
days as it'll take. It is better to automate this using a
|
|
<tt>crontab</tt>. Create a plain text file, called
|
|
"<tt>crontab.txt</tt>", containing something like the following:
|
|
|
|
<p><pre>
|
|
0 23 * * 1-5 wget -c -N http://place.your.url/here
|
|
0 6 * * 1-5 killall wget
|
|
</pre>
|
|
|
|
<p>
|
|
This will be your <tt>crontab</tt> file which specifies what jobs to
|
|
execute at periodic intervals. The first five columns say <EM>when</EM>
|
|
to execute the command, and the remainder of each line says <EM>what</EM> to
|
|
execute. The first two columns indicate the time of day -- 0
|
|
minutes past 11pm to start <tt>wget</tt>, 0 minutes past 6am to
|
|
<tt>killall wget</tt>. The <tt>*</tt> in the 3rd and 4th columns
|
|
indicates that these actions are to occur every day of every month. The
|
|
5th column indicates on which days of the week to schedule each
|
|
operation -- "1-5" is Monday to Friday.
|
|
|
|
<p>So every weekday at 11pm your download will begin, and at 6am every
|
|
weekday any <tt>wget</tt> still in progress will be terminated. To
|
|
activate this <tt>crontab</tt> schedule you need to issue the command:
|
|
|
|
<p><pre>
|
|
bash$ <b>crontab crontab.txt</b>
|
|
</pre>
|
|
|
|
<p>
|
|
The "<tt>-N</tt>" option for <tt>wget</tt> will check the timestamp of
|
|
the target file and halt downloading if they match, which is an
|
|
indication that the entire file has been transferred. So you can just
|
|
set it and forget it. "<tt>crontab -r</tt>" will remove this schedule.
|
|
I've downloaded many an ISO image over shared dial-up connections
|
|
using this approach.
|
|
|
|
<H3>Dynamically Generated Web Pages</H3>
|
|
|
|
<p>
|
|
Some web pages are generated on demand since they are subject to
|
|
frequent changes sometimes several times a day. Since the target is
|
|
technically not a file, there is no file length and resuming a
|
|
download becomes meaningless -- the "<tt>-c</tt>" option fails to work.
|
|
Example: a PHP-generated page at Linux Weekend News:
|
|
|
|
<p><pre>
|
|
bash$ <b>wget http://lwn.net/bigpage.php3</b>
|
|
</pre>
|
|
|
|
If you interrupt the download and try to resume, it starts over from
|
|
scratch. My office Net connection is at times so poor that I've
|
|
written a simple script detecting when a dynamic HTML page has been
|
|
delivered completely:
|
|
|
|
<p><pre>
|
|
#!/bin/bash
|
|
|
|
#create it if absent
|
|
touch bigpage.php3
|
|
|
|
#check if we got the whole thing
|
|
while ! grep -qi '</html>' bigpage.php3
|
|
do
|
|
rm -f bigpage.php3
|
|
|
|
#download LWN in one big page
|
|
wget http://lwn.net/bigpage.php3
|
|
|
|
done
|
|
</pre>
|
|
|
|
<p>
|
|
The above <tt>bash</tt> script keeps downloading the document unless
|
|
the string "<tt></html></tt>" can be found, which marks the
|
|
end of the file.
|
|
|
|
<H3>SSL and Cookies</H3>
|
|
|
|
<p>
|
|
URLs beginning with "<tt>https://</tt>" must access remote files
|
|
through the Secure Sockets Layer. You will find another download
|
|
utility, called <a href="http://curl.haxx.se"><tt>curl</tt></a>, to be
|
|
handy in these situations.
|
|
|
|
<p>
|
|
Some web sites force-feed cookies to the browser before serving the
|
|
requested content. One must add a "<tt>Cookie:</tt>" header with the
|
|
correct information which can be obtained from your web browser's cookie
|
|
file. For <tt>lynx</tt> and <tt>Mozilla</tt> cookie file formats:
|
|
|
|
<p><pre>
|
|
bash$ <b>cookie=$( grep nytimes ~/.lynx_cookies |awk '{printf("%s=%s;",$6,$7)}' )</b>
|
|
</pre>
|
|
|
|
<p>
|
|
will construct the required cookie for downloading stuff from <a
|
|
href="http://www.nytimes.com/">http://www.nytimes.com</a>, assuming
|
|
that you have already registered with the site using this browser.
|
|
<tt>w3m</tt> uses a slightly different cookie file format:
|
|
|
|
<p><pre>
|
|
bash$ <b>cookie=$( grep nytimes ~/.w3m/cookie |awk '{printf("%s=%s;",$2,$3)}' )</b>
|
|
</pre>
|
|
|
|
<p>
|
|
Downloading can now be carried out thus:
|
|
|
|
<p><pre>
|
|
bash$ <b>wget --header="Cookie: $cookie" http://www.nytimes.com/reuters/technology/tech-tech-supercomput.html</b>
|
|
</pre>
|
|
|
|
<p>
|
|
or using the <tt>curl</tt> tool:
|
|
|
|
<p><pre>
|
|
bash$ <b>curl -v -b $cookie -o supercomp.html http://www.nytimes.com/reuters/technology/tech-tech-supercomput.html</b>
|
|
</pre>
|
|
|
|
<H3>Making Lists of URLs</H3>
|
|
|
|
<p>
|
|
So far, we've only been downloading single files or mirroring entire
|
|
website directories. Sometimes one is interested in downloading a
|
|
large number of files whose URLs are given on a web page but are not
|
|
interested in performing a full scale mirror of the entire site. An
|
|
example would be downloading of the top 20 music files on a site that
|
|
displays the top 100 in order. Here the "<tt>--accept</tt>" and
|
|
"<tt>--reject</tt>" options wouldn't work since they only operate on
|
|
file extensions. Instead, make use of "<tt>lynx -dump</tt>".
|
|
|
|
<p><pre>
|
|
bash$ <b>lynx -dump ftp://ftp.ssc.com/pub/lg/ |grep 'gz$' |tail -10 |awk '{print $2}' > urllist.txt</b>
|
|
</pre>
|
|
|
|
<p>
|
|
The output from lynx can then be filtered using the various GNU text
|
|
processing utilities. In the above example, we extract URLs ending in
|
|
"<tt>gz</tt>" and store the last 10 of these in a file. A tiny
|
|
<tt>bash</tt> scripting command will automatically download any URLs
|
|
listed in this file:
|
|
|
|
<p>
|
|
<blockquote><tt>
|
|
bash$<b> for x in $(cat urllist.txt)</b><br>
|
|
><b> do</b><br>
|
|
><b> wget $x</b><br>
|
|
><b> done</b><br>
|
|
</tt></blockquote>
|
|
|
|
<p>
|
|
We've succeeded in downloading the last 10 issues of <a
|
|
href="http://www.linuxgazette.com/"><I>Linux Gazette</I></a>.
|
|
|
|
<H3>Swimming in bandwidth</H3>
|
|
|
|
<p>
|
|
If you're one of the select few to be drowning in bandwidth, and your
|
|
file downloads are slowed only by bottlenecks at the web server end,
|
|
this trick can help "shotgun" the file transfer process. It requires
|
|
the use of <tt>curl</tt> and several mirror web sites where identical
|
|
copies of the target file are located. For example, suppose you want
|
|
to download the Mandrake 8.0 ISO from the following three locations:
|
|
|
|
<p>
|
|
<pre><tt><b>
|
|
url1=http://ftp.eecs.umich.edu/pub/linux/mandrake/iso/Mandrake80-inst.iso
|
|
url2=http://ftp.rpmfind.net/linux/Mandrake/iso/Mandrake80-inst.iso
|
|
url3=http://ftp.wayne.edu/linux/mandrake/iso/Mandrake80-inst.iso
|
|
</b></tt></pre>
|
|
|
|
<p>
|
|
The length of the file is 677281792, so initiate three simultaneous
|
|
downloads using <tt>curl</tt>'s "<tt>--range</tt>" option:
|
|
|
|
<p><pre><tt>
|
|
bash$ <b>curl -r 0-199999999 -o mdk-iso.part1 $url1 &</b>
|
|
bash$ <b>curl -r 200000000-399999999 -o mdk-iso.part2 $url2 &</b>
|
|
bash$ <b>curl -r 400000000- -o mdk-iso.part3 $url3 &</b>
|
|
</tt></pre>
|
|
|
|
<p>
|
|
This creates three background download processes, each transferring a
|
|
different part of the ISO image from a different server. The
|
|
"<tt>-r</tt>" options specifies a subrange of bytes to extract from
|
|
the target file. When completed, simply <tt>cat</tt> all three parts
|
|
together -- <b><tt>cat mdk-iso.part? >
|
|
mdk-80.iso</tt></b>. (Checking the md5 hash before burning to CD-R is
|
|
well recommended.) Launching each <tt>curl</tt> in its own window
|
|
while using the "<tt>--verbose</tt>" option allows one to track the
|
|
progress of each transfer.
|
|
|
|
<H3>Conclusion</H3>
|
|
|
|
Do not be afraid to use non-interactive methods for effecting your
|
|
remote file transfers. No matter how hard web designers may try to
|
|
force you to surf their sites interactively, there will always be free
|
|
tools to help automate the process, thus enriching our overall Net
|
|
experience.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<!-- *** BEGIN bio *** -->
|
|
<SPACER TYPE="vertical" SIZE="30">
|
|
<P>
|
|
<H4><IMG ALIGN=BOTTOM ALT="" SRC="../gx/note.gif">Adrian J Chung</H4>
|
|
<EM>When not teaching undergraduate computing at the University of
|
|
the West Indies, Trinidad, Adrian writes scripts to automate web email
|
|
downloads, and experiments with interfacing various scripting
|
|
environments with homebrew computer graphics renderers and data
|
|
visualization libraries.</EM>
|
|
|
|
<!-- *** END bio *** -->
|
|
|
|
<!-- *** BEGIN copyright *** -->
|
|
<P> <hr> <!-- P -->
|
|
<H5 ALIGN=center>
|
|
|
|
Copyright © 2001, Adrian J Chung.<BR>
|
|
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
|
|
Published in Issue 70 of <i>Linux Gazette</i>, September 2001</H5>
|
|
<!-- *** END copyright *** -->
|
|
|
|
<!--startcut ==========================================================-->
|
|
<HR><P>
|
|
<CENTER>
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="arndt.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue70/chung.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="ghosh.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
</CENTER>
|
|
</BODY></HTML>
|
|
<!--endcut ============================================================-->
|