379 lines
14 KiB
HTML
379 lines
14 KiB
HTML
<!--startcut ==============================================-->
|
|
<!-- *** BEGIN HTML header *** -->
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML><HEAD>
|
|
<title>Downloading LinuxToday links and Linux Gazette's TOC with Python (and Perl) LG #63</title>
|
|
</HEAD>
|
|
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
|
|
ALINK="#FF0000">
|
|
<!-- *** END HTML header *** -->
|
|
|
|
<CENTER>
|
|
<A HREF="http://www.linuxgazette.com/">
|
|
<H1><IMG ALT="LINUX GAZETTE" SRC="../gx/lglogo.png"
|
|
WIDTH="600" HEIGHT="124" border="0"></H1></A>
|
|
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="gibbs.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue63/nielsen.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="nielsen2.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
<P>
|
|
</CENTER>
|
|
|
|
<!--endcut ============================================================-->
|
|
|
|
<H4 ALIGN="center">
|
|
"Linux Gazette...<I>making Linux just a little more fun!</I>"
|
|
</H4>
|
|
|
|
<P> <HR> <P>
|
|
<!--===================================================================-->
|
|
|
|
<center>
|
|
<H1><font color="maroon">Downloading LinuxToday links and Linux Gazette's TOC with Python (and Perl)</font></H1>
|
|
<H4>By <a href="mailto:articles@gnujobs.com">Mark Nielsen</a></H4>
|
|
</center>
|
|
<P> <HR> <P>
|
|
|
|
<!-- END header -->
|
|
|
|
|
|
|
|
|
|
<h3>Contents</h3>
|
|
|
|
<ol>
|
|
<li>
|
|
<a href="#Introduction">Introduction</a></li>
|
|
|
|
<li>
|
|
<a href="#python">The Python Script</a></li>
|
|
|
|
<li>
|
|
<a href="#cron">Setting up a cron job</a></li>
|
|
|
|
<li>
|
|
<a href="#perl"> A Perl Script I wrote</a> to download <I>Linux Gazette</I> TOC.
|
|
|
|
<li>
|
|
<a href="#perl2">A Perl Script I wrote to download Debian Weekly News</a>
|
|
|
|
<li>
|
|
<a href="#Conclusion">Conclusion</a></li>
|
|
|
|
<li>
|
|
<a href="#REF">References</a></li>
|
|
|
|
</ol>
|
|
|
|
<h3>
|
|
<a NAME="Introduction"></a>Introduction</h3>
|
|
I wanted to add Linux Today's links to my website
|
|
<a href="http://www.gnujobs.com">GNUJobs.com</a>, just for the fun of it.
|
|
Later, I want to add more headlines from other websites, and perhaps <I>LG</I>'s
|
|
latest edition.
|
|
I had a choice of Perl or Python. I choose Python because I have been
|
|
using it for quite a while for a mathematical project, and it has proven
|
|
quite useful. I want to make a habit of using Python now. It tends to be
|
|
easier for me to program in Python than Perl. Also, in the future, I wish
|
|
to use threading to download many webpages at the same time,
|
|
which Python does very well. I might as well
|
|
do it in Python now since I know I will use it later.
|
|
<p>
|
|
Both Perl and Python will let you download webpages off of the internet.
|
|
You can do more than just download webpages, such as ftp, gopher, and connect
|
|
to other services. Downloading a webpage is just one thing these languages
|
|
can do.
|
|
<p>
|
|
There are several things the programming language has to do:
|
|
<ul>
|
|
<li> Download the webpage</li>
|
|
<li> Parse the data correctly to reformat the data</li>
|
|
<li> Reformat the data</li>
|
|
<li> Replace the old file with the new file only if it contains valid data</li>
|
|
</ul>
|
|
<p>
|
|
This article isn't going to be too long. I commented the Python code.
|
|
|
|
|
|
<h3>
|
|
<a NAME="python"></a>The Python Script</h3>
|
|
|
|
If you want to include the output of this script to a webpage, then you can use
|
|
the Server-Side Include (SSI) module in the Apache webserver and use a command
|
|
like:
|
|
<pre>
|
|
<!--#include virtual="/lthead.html" -->
|
|
</pre>
|
|
in your webpage. Various programming languages (like PHP, Perl ASP, Perl Mason, etc)
|
|
can also include files.
|
|
|
|
<p>
|
|
It is assumed you are using a GNU/Linux
|
|
operating system.
|
|
Also, I was using Python 1.5.2, which is not the latest version.
|
|
You might have to do a
|
|
<PRE>
|
|
chmod 755 LinuxToday.py
|
|
</PRE>
|
|
on the script to make it executable.
|
|
|
|
<A HREF="misc/nielsen/LinuxToday.py.txt">[Text version of this listing.]</A>
|
|
|
|
<pre>
|
|
#!/usr/bin/python
|
|
|
|
# One obvious thing to do is apply error checking for url download,
|
|
# download must contain at least one entry, and we are able to create the
|
|
# new file. This will be done later.
|
|
|
|
### import the web module, string module, regular expression, module
|
|
### and the os module
|
|
import urllib, string, re, os
|
|
|
|
### define the new webpage we create and where to get the info
|
|
Download_Location = "/tmp/lthead.html"
|
|
Url = "http://linuxtoday.com/backend/lthead.txt"
|
|
|
|
#-----------------------------------------------------------
|
|
### Create a web object with the Url
|
|
LinuxToday = urllib.urlopen( Url )
|
|
### Grab all the info into an array (if big, change to do one line at a time)
|
|
Text_Array = LinuxToday.readlines()
|
|
|
|
New_File = open(Download_Location + "_new", 'w');
|
|
New_File.write("<ul>\n")
|
|
### Set the default to be invalid
|
|
Valid = 0
|
|
### Record the number of valid entries
|
|
Entry_No = 0;
|
|
Entry_Valid = 0
|
|
### Setup the defaults
|
|
Date = ""
|
|
Link = ""
|
|
Header = ""
|
|
Count = 0
|
|
### Create the mattern matching expression
|
|
Match = re.compile ("^\&\&")
|
|
|
|
### Append && to make sure we parse the last entry
|
|
Text_Array.append('&&')
|
|
### For each line, do the following
|
|
for Line in Text_Array :
|
|
### If && exists, start from scratch, add last entry
|
|
if Match.search(Line) :
|
|
### If the current entry is valid and we have skipped the first one,
|
|
if (Entry_No > 1) and (Entry_Valid > 0) :
|
|
### One thing that Perl does better than Python is the print command. I
|
|
### don't like how Python prints (no variable interpolation).
|
|
New_File.write('<li> <a href="' + Link + '">' + Header + '</a>. ' + Date + "</li>\n")
|
|
## Reset the values to nothing.
|
|
Header = ""; Link = ""; Date = ""; Entry_Valid = 0
|
|
Count = 0
|
|
|
|
### Delete whitespace at end of line
|
|
Line = string.rstrip(Line)
|
|
|
|
### If count is equal to 1, header, 2 link, 3 date
|
|
if Count == 1: Header = Line
|
|
elif Count == 2: Link = Line
|
|
elif Count == 3:
|
|
Date = Line
|
|
### If all fields are done, we have a valid entry
|
|
if (Header != "") or (Link != "") or (Date != "") :
|
|
Entry_No = Entry_No + 1
|
|
Entry_Valid = 1
|
|
|
|
### Add one to Count
|
|
Count = Count + 1
|
|
|
|
New_File.write("</ul>\n")
|
|
|
|
New_File.close()
|
|
|
|
### If we have valid entries, move the new file to the real location
|
|
if Entry_No > 0 :
|
|
### We could just do:
|
|
### os.rename(Download_Location + "_new", Download_Location)
|
|
### But here's how to do it with an external command.
|
|
Command = "mv " + Download_Location + "_new " + Download_Location
|
|
os.system( Command )
|
|
</pre>
|
|
|
|
|
|
<h3>
|
|
<a NAME="cron"></a>The Cron Script to make it run nightly</h3>
|
|
Not the best crontab file, but it will do.
|
|
|
|
<pre>
|
|
#/bin/sh
|
|
|
|
### Crontab file
|
|
### Name the file "Crontab" and execute with "crontab Crontab"
|
|
|
|
### Download every two hours
|
|
*/2 * * * * /www/Cron/LinuxToday.py >> /www/Cron/out 2>&1
|
|
</pre>
|
|
|
|
|
|
<h3>
|
|
<a NAME="perl"></a>A Perl Script I wrote to download Linux Gazette TOC</h3>
|
|
|
|
Just so you can compare this to a Perl script, I created a Perl script
|
|
which downloads the <I>LG</I>'s TOC for the latest edition.
|
|
|
|
<A HREF="misc/nielsen/LinuxGazette.pl.txt">[Text version of this listing.]</A>
|
|
|
|
<pre>
|
|
#!/usr/bin/perl
|
|
# Copyright Mark Nielsen January 20001
|
|
# Copyrighted under the GPL license.
|
|
|
|
# I am proud of this script.
|
|
# I wrote it from scratch with only 2 minor errors when I first tested it.
|
|
|
|
system ("lynx --source http://www.linuxgazette.com/ftpfiles.txt > /tmp/List.txt");
|
|
|
|
### Open up the webpage we just downloaded and put it into an array.
|
|
open(FILE,'/tmp/List.txt'); my @Lines = <FILE>; close FILE;
|
|
### Filter out lines that don't contain magic letters.
|
|
my @Lines = grep(($_ =~ /lg\-issue/) || ($_ =~ /\.tar\.gz/), @Lines );
|
|
|
|
my @Numbers = ();
|
|
foreach my $Line (@Lines)
|
|
{
|
|
## Throw away the stuff to the left
|
|
my ($Junk,$Good) = split(/lg\-issue/,$Line,2);
|
|
## Throw away the stuff to the right
|
|
($Good,$Junk) = split(/\.tar\.gz/,$Good,2);
|
|
## If it is a valid number, it is greater than 1, save it
|
|
if ($Good > 0) {push (@Numbers,$Good);}
|
|
}
|
|
|
|
### Sort the numbers and pop off the highest
|
|
@Numbers = sort {$a<=>$b} @Numbers;
|
|
my $Highest = pop @Numbers;
|
|
## Create the url we are going to download
|
|
my $Url = "http://www.linuxgazette.com/issue$Highest/index.html";
|
|
## Download it
|
|
system ("lynx --source $Url > /tmp/LG_index.html");
|
|
|
|
### Open up the index.
|
|
open(FILE,"/tmp/LG_index.html"); my @Lines = <FILE>; close FILE;
|
|
### Extract out the parts that are between beginning and end of TOC.
|
|
my @TOC = ();
|
|
my $Count = 0;
|
|
my $Start = '<!-- *** BEGIN toc *** -->';
|
|
my $End = '<!-- *** END toc *** -->';
|
|
foreach my $Line (@Lines)
|
|
{
|
|
if ($Line =~ /\Q$End\E/) {$Count = 2;}
|
|
if ($Count == 1) {push(@TOC, $Line);}
|
|
if ($Line =~ /\Q$Start\E/) {$Count = 1;}
|
|
}
|
|
|
|
### Relink all the links to point to the Linux Gazette magazine
|
|
my $Relink = "http://www.linuxgazette.com/issue$Highest/";
|
|
grep($_ =~ s/HREF\=\"/HREF\=\"$Relink/g, @TOC);
|
|
|
|
### Save the output
|
|
open(FILE,">/tmp/TOC.html"); print FILE @TOC; close FILE;
|
|
|
|
### Done!
|
|
</pre>
|
|
|
|
|
|
<h3>
|
|
<a NAME="perl2"></a>A Perl Script I wrote to download Debian Weekly News</h3>
|
|
|
|
I like to keep track of Debian Weekly News, so I wrote this one also.
|
|
One bad thing about programming, is that when you get really good
|
|
at programming in a certain way, it is hard to switch to another
|
|
programming language. These two Perl scripts I did without
|
|
looking at any code. The Python code took me a while, because I am still
|
|
not used to it.
|
|
|
|
<A HREF="misc/nielsen/DebianWeeklyNews.pl.txt">[Text version of this listing.]</A>
|
|
|
|
<pre>
|
|
#!/usr/bin/perl
|
|
# Copyright Mark Nielsen January 20001
|
|
# Copyright under the GPL license.
|
|
|
|
system ("lynx --source http://www.debian.org/News/weekly/index.html > /tmp/List2.txt");
|
|
|
|
### Open up the webpage we just downloaded and put it into an array.
|
|
open(FILE,'/tmp/List2.txt'); my @Lines = <FILE>; close FILE;
|
|
### Extract out the parts that are between beginning and end of TOC.
|
|
my @TOC = ();
|
|
my $Count = 0;
|
|
my $Start = 'Recent issues of Debian Weekly News';
|
|
my $End = '</p>';
|
|
foreach my $Line (@Lines)
|
|
{
|
|
if (($Line =~ /\Q$End\E/i) && ($Count > 0)) {$Count = 2;}
|
|
if ($Count == 1) {push(@TOC, $Line);}
|
|
if ($Line =~ /^\Q$Start\E/i) {$Count = 1;}
|
|
}
|
|
|
|
### Relink all the links to point to the DWN
|
|
my $Relink = "http://www.debian.org/News/weekly/";
|
|
grep($_ =~ s/HREF\=\"/HREF\=\"$Relink/ig, @TOC);
|
|
grep($_ =~ s/\"\>/\" target=_external\>/ig, @TOC);
|
|
|
|
### Save the output
|
|
open(FILE,">/tmp/D.html"); print FILE @TOC; close FILE;
|
|
|
|
### Done!
|
|
</pre>
|
|
|
|
|
|
<h3>
|
|
<a NAME="Conclusion"></a>Conclusion</h3>
|
|
The Python script actually is made much more complex than it needs to be.
|
|
The reason why I made it longer was to introduce various modules and to be
|
|
flexible in case LinuxToday's format changes someday. The only thing the script
|
|
lacks is error detection in case it can't download the web page, write the new
|
|
file or rename it. Also, watch the regular-expression modules in Python,
|
|
because they have been changing in recent versions to increase efficiency and
|
|
incorporate Unicode support.
|
|
<p>
|
|
Python rules as a programming language. I found it very easy to use
|
|
the Python modules. It seems like the Python module for handling webpages
|
|
is easier than the LWP module in Perl. Because of the many possibilities of
|
|
Python, I plan on creating a Python script which will download many webpages
|
|
at the same time using Python's threading capbilities.
|
|
|
|
|
|
<h3>
|
|
<a NAME="REF"></a>References</h3>
|
|
|
|
<ol>
|
|
<li><a href="http://linuxtoday.com/backend/lthead.txt"> LinuxToday's links</a></li>
|
|
<li> <a href="http://www.python.org/doc/current/lib/module-urllib.html">Python's urllib module</a></li>
|
|
<li><a href="http://www.gnujobs.com/Articles/13/LT_Python.html">Original site for this article</a> (any updates will be here)
|
|
</ol>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<!-- *** BEGIN copyright *** -->
|
|
<P> <hr> <!-- P -->
|
|
<H5 ALIGN=center>
|
|
|
|
Copyright © 2001, Mark Nielsen.<BR>
|
|
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
|
|
Published in Issue 63 of <i>Linux Gazette</i>, Mid-February (EXTRA) 2001</H5>
|
|
<!-- *** END copyright *** -->
|
|
|
|
<!--startcut ==========================================================-->
|
|
<HR><P>
|
|
<CENTER>
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="gibbs.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue63/nielsen.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="nielsen2.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
</CENTER>
|
|
</BODY></HTML>
|
|
<!--endcut ============================================================-->
|