old-www/LDP/LG/issue63/nielsen.html

<!--startcut  ==============================================-->
<!-- *** BEGIN HTML header *** -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML><HEAD>
<title>Downloading LinuxToday links and Linux Gazette's TOC with Python (and Perl) LG #63</title>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
ALINK="#FF0000">
<!-- *** END HTML header *** -->

<CENTER>
<A HREF="http://www.linuxgazette.com/">
<H1><IMG ALT="LINUX GAZETTE" SRC="../gx/lglogo.png"
	WIDTH="600" HEIGHT="124" border="0"></H1></A>

<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="gibbs.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue63/nielsen.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="nielsen2.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
<P>
</CENTER>

<!--endcut ============================================================-->

<H4 ALIGN="center">
"Linux Gazette...<I>making Linux just a little more fun!</I>"
</H4>

<P> <HR> <P>
<!--===================================================================-->

<center>
<H1><font color="maroon">Downloading LinuxToday links and Linux Gazette's TOC with Python (and Perl)</font></H1>
<H4>By <a href="mailto:articles@gnujobs.com">Mark Nielsen</a></H4>
</center>
<P> <HR> <P>

<!-- END header -->


<h3>Contents</h3>

<ol>
<li>
<a href="#Introduction">Introduction</a></li>

<li>
<a href="#python">The Python Script</a></li>

<li>
<a href="#cron">Setting up a cron job</a></li>

<li>
<a href="#perl"> A Perl Script I wrote</a> to download <I>Linux Gazette</I> TOC.

<li>
<a href="#perl2">A Perl Script I wrote to download Debian Weekly News</a>

<li>
<a href="#Conclusion">Conclusion</a></li>

<li>
<a href="#REF">References</a></li>

</ol>

<h3>
<a NAME="Introduction"></a>Introduction</h3>
I wanted to add Linux Today's links to my website
<a href="http://www.gnujobs.com">GNUJobs.com</a>, just for the fun of it.
Later, I want to add more headlines from other websites, and perhaps <I>LG</I>'s
latest edition.
I had a choice of Perl or Python. I choose Python because I have been
using it for quite a while for a mathematical project, and it has proven
quite useful. I want to make a habit of using Python now. It tends to be
easier for me to program in Python than Perl. Also, in the future, I wish
to use threading to download many webpages at the same time,
 which Python does very well. I might as well
do it in Python now since I know I will use it later.
<p>
Both Perl and Python will let you download webpages off of the internet.
You can do more than just download webpages, such as ftp, gopher, and connect
to other services. Downloading a webpage is just one thing these languages
can do.
<p>
There are several things the programming language has to do:
<ul>
<li> Download the webpage</li>
<li> Parse the data correctly to reformat the data</li>
<li> Reformat the data</li>
<li> Replace the old file with the new file only if it contains valid data</li>
</ul>
<p>
This article isn't going to be too long. I commented the Python code.


<h3>
<a NAME="python"></a>The Python Script</h3>

If you want to include the output of this script to a webpage, then you can use
the Server-Side Include (SSI) module in the Apache webserver and use a command
like:
<pre>
&lt;!--#include virtual="/lthead.html" --&gt;
</pre>
in your webpage. Various programming languages (like PHP, Perl ASP, Perl Mason, etc)
can also include files.

<p>
 It is assumed you are using a GNU/Linux
operating system.
Also, I was using Python 1.5.2, which is not the latest version.
You might have to do a
<PRE>
chmod 755 LinuxToday.py
</PRE>
on the script to make it executable.

<A HREF="misc/nielsen/LinuxToday.py.txt">[Text version of this listing.]</A>

<pre>
#!/usr/bin/python

# One obvious thing to do is apply error checking for url download,
# download must contain at least one entry, and we are able to create the
# new file. This will be done later.

  ### import the web module, string module, regular expression,  module
  ### and the os module
import urllib, string, re, os

  ### define the new webpage we create and where to get the info
Download_Location = &quot;/tmp/lthead.html&quot;
Url = &quot;http://linuxtoday.com/backend/lthead.txt&quot;

#-----------------------------------------------------------
  ### Create a web object with the Url
LinuxToday = urllib.urlopen( Url )
  ### Grab all the info into an array (if big, change to do one line at a time)
Text_Array =  LinuxToday.readlines()

New_File  = open(Download_Location + &quot;_new&quot;, 'w');
New_File.write(&quot;&lt;ul&gt;\n&quot;)
  ### Set the default to be invalid
Valid = 0
  ### Record the number of valid entries
Entry_No = 0;
Entry_Valid = 0
  ### Setup the defaults
Date = &quot;&quot;
Link = &quot;&quot;
Header = &quot;&quot;
Count = 0
  ### Create the mattern matching expression
Match = re.compile (&quot;^\&\&&quot;)

  ### Append &amp;&amp; to make sure we parse the last entry
Text_Array.append('&amp;&amp;')
  ### For each line, do the following
for Line in Text_Array :
    ### If &amp;&amp; exists, start from scratch, add last entry
  if Match.search(Line) :
      ### If the current entry is valid and we have skipped the first one,
    if (Entry_No &gt; 1) and (Entry_Valid &gt; 0) :
	### One thing that Perl does better than Python is the print command. I
	### don't like how Python prints (no variable interpolation).
      New_File.write('&lt;li&gt; &lt;a href=&quot;' + Link + '&quot;&gt;' + Header + '&lt;/a&gt;. ' + Date + &quot;&lt;/li&gt;\n&quot;)
      ## Reset the values to nothing.
    Header = &quot;&quot;; Link = &quot;&quot;; Date = &quot;&quot;; Entry_Valid = 0
    Count = 0

    ### Delete whitespace at end of line
  Line = string.rstrip(Line)

    ### If count is equal to 1, header, 2 link, 3 date
  if Count == 1:    Header = Line
  elif Count == 2:  Link = Line
  elif Count == 3:
    Date = Line
      ### If all fields are done, we have a valid entry
    if  (Header != &quot;&quot;) or (Link != &quot;&quot;) or (Date != &quot;&quot;) :
      Entry_No = Entry_No + 1
      Entry_Valid = 1

    ### Add one to Count
  Count = Count + 1

New_File.write(&quot;&lt;/ul&gt;\n&quot;)

New_File.close()

  ### If we have valid entries, move the new file to the real location
if Entry_No &gt; 0 :
    ### We could just do:
    ### os.rename(Download_Location + &quot;_new&quot;, Download_Location)
    ### But here's how to do it with an external command.
  Command = &quot;mv &quot; + Download_Location + &quot;_new &quot; + Download_Location
  os.system( Command )
</pre>


<h3>
<a NAME="cron"></a>The Cron Script to make it run nightly</h3>
Not the best crontab file, but it will do.

<pre>
#/bin/sh

### Crontab file
### Name the file &quot;Crontab&quot; and execute with &quot;crontab Crontab&quot;

  ### Download every two hours
*/2 * * * *   /www/Cron/LinuxToday.py &gt;&gt; /www/Cron/out  2&gt;&amp;1
</pre>


<h3>
<a NAME="perl"></a>A Perl Script I wrote to download Linux Gazette TOC</h3>

Just so you can compare this to a Perl script, I created a Perl script
which downloads the <I>LG</I>'s TOC for the latest edition.

<A HREF="misc/nielsen/LinuxGazette.pl.txt">[Text version of this listing.]</A>

<pre>
#!/usr/bin/perl
# Copyright Mark Nielsen January 20001
# Copyrighted under the GPL license.

# I am proud of this script.
# I wrote it from scratch with only 2 minor errors when I first tested it.

system (&quot;lynx --source http://www.linuxgazette.com/ftpfiles.txt &gt; /tmp/List.txt&quot;);

  ### Open up the webpage we just downloaded and put it into an array.
open(FILE,'/tmp/List.txt'); my @Lines = &lt;FILE&gt;; close FILE;
  ### Filter out lines that don't contain magic letters.
my @Lines = grep(($_ =~ /lg\-issue/) || ($_ =~ /\.tar\.gz/), @Lines );

my @Numbers = ();
foreach my $Line (@Lines)
  {
    ## Throw away the stuff to the left
  my ($Junk,$Good) = split(/lg\-issue/,$Line,2);
    ## Throw away the stuff to the right
  ($Good,$Junk) = split(/\.tar\.gz/,$Good,2);
    ## If it is a valid number, it is greater than 1, save it
  if ($Good &gt; 0) {push (@Numbers,$Good);}
  }

   ### Sort the numbers and pop off the highest
@Numbers = sort {$a&lt;=&gt;$b} @Numbers;
my $Highest = pop @Numbers;
   ## Create the url we are going to download
my $Url = &quot;http://www.linuxgazette.com/issue$Highest/index.html&quot;;
   ## Download it
system (&quot;lynx --source $Url &gt; /tmp/LG_index.html&quot;);

   ### Open up the index.
open(FILE,&quot;/tmp/LG_index.html&quot;); my @Lines = &lt;FILE&gt;; close FILE;
   ### Extract out the parts that are between beginning and end of TOC.
my @TOC = ();
my $Count = 0;
my $Start = '&lt;!-- *** BEGIN toc *** --&gt;';
my $End = '&lt;!-- *** END toc *** --&gt;';
foreach my $Line (@Lines)
  {
  if ($Line =~ /\Q$End\E/) {$Count = 2;}
  if ($Count == 1) {push(@TOC, $Line);}
  if ($Line =~ /\Q$Start\E/) {$Count = 1;}
  }

  ### Relink all the links to point to the Linux Gazette magazine
my $Relink = &quot;http://www.linuxgazette.com/issue$Highest/&quot;;
grep($_ =~ s/HREF\=\&quot;/HREF\=\&quot;$Relink/g, @TOC);

  ### Save the output
open(FILE,&quot;&gt;/tmp/TOC.html&quot;); print FILE @TOC; close FILE;

  ### Done!
</pre>


<h3>
<a NAME="perl2"></a>A Perl Script I wrote to download Debian Weekly News</h3>

I like to keep track of Debian Weekly News, so I wrote this one also.
One bad thing about programming, is that when you get really good
at programming in a certain way, it is hard to switch to another
programming language. These two Perl scripts I did without
looking at any code. The Python code took me a while, because I am still
not used to it.

<A HREF="misc/nielsen/DebianWeeklyNews.pl.txt">[Text version of this listing.]</A>

<pre>
#!/usr/bin/perl
# Copyright Mark Nielsen January 20001
# Copyright under the GPL license.

system (&quot;lynx --source http://www.debian.org/News/weekly/index.html &gt; /tmp/List2.txt&quot;);

  ### Open up the webpage we just downloaded and put it into an array.
open(FILE,'/tmp/List2.txt'); my @Lines = &lt;FILE&gt;; close FILE;
   ### Extract out the parts that are between beginning and end of TOC.
my @TOC = ();
my $Count = 0;
my $Start = 'Recent issues of Debian Weekly News';
my $End = '&lt;/p&gt;';
foreach my $Line (@Lines)
  {
  if (($Line =~ /\Q$End\E/i) && ($Count &gt; 0)) {$Count = 2;}
  if ($Count == 1) {push(@TOC, $Line);}
  if ($Line =~ /^\Q$Start\E/i) {$Count = 1;}
  }

  ### Relink all the links to point to the DWN
my $Relink = &quot;http://www.debian.org/News/weekly/&quot;;
grep($_ =~ s/HREF\=\&quot;/HREF\=\&quot;$Relink/ig, @TOC);
grep($_ =~ s/\&quot;\&gt;/\&quot; target=_external\&gt;/ig, @TOC);

  ### Save the output
open(FILE,&quot;&gt;/tmp/D.html&quot;); print FILE @TOC; close FILE;

  ### Done!
</pre>


<h3>
<a NAME="Conclusion"></a>Conclusion</h3>
The Python script actually is made much more complex than it needs to be.
The reason why I made it longer was to introduce various modules and to be
flexible in case LinuxToday's format changes someday. The only thing the script
lacks is error detection in case it can't download the web page, write the new
file or rename it.  Also, watch the regular-expression modules in Python,
because they have been changing in recent versions to increase efficiency and
incorporate Unicode support.
<p>
Python rules as a programming language. I found it very easy to use
the Python modules. It seems like the Python module for handling webpages
is easier than the LWP module in Perl. Because of the many possibilities of
Python, I plan on creating a Python script which will download many webpages
at the same time using Python's threading capbilities.


<h3>
<a NAME="REF"></a>References</h3>

<ol>
<li><a href="http://linuxtoday.com/backend/lthead.txt"> LinuxToday's links</a></li>
<li> <a href="http://www.python.org/doc/current/lib/module-urllib.html">Python's urllib module</a></li>
<li><a href="http://www.gnujobs.com/Articles/13/LT_Python.html">Original site for this article</a> (any updates will be here)
</ol>


<!-- *** BEGIN copyright *** -->
<P> <hr> <!-- P -->
<H5 ALIGN=center>

Copyright &copy; 2001, Mark Nielsen.<BR>
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
Published in Issue 63 of <i>Linux Gazette</i>, Mid-February (EXTRA) 2001</H5>
<!-- *** END copyright *** -->

<!--startcut ==========================================================-->
<HR><P>
<CENTER>
<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="gibbs.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue63/nielsen.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="nielsen2.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
</CENTER>
</BODY></HTML>
<!--endcut ============================================================-->