529 lines
13 KiB
HTML
529 lines
13 KiB
HTML
<!--startcut ==============================================-->
|
|
<!-- *** BEGIN HTML header *** -->
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML><HEAD>
|
|
<title>Apache Log Analysis Using Python LG #83</title>
|
|
</HEAD>
|
|
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
|
|
ALINK="#FF0000">
|
|
<!-- *** END HTML header *** -->
|
|
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="thangaraju.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue83/tougher.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../lg_faq.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="ward.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
|
|
<!--endcut ============================================================-->
|
|
|
|
<TABLE BORDER><TR><TD WIDTH="200">
|
|
<A HREF="http://www.linuxgazette.com/">
|
|
<IMG ALT="LINUX GAZETTE" SRC="../gx/2002/lglogo_200x41.png"
|
|
WIDTH="200" HEIGHT="41" border="0"></A>
|
|
<BR CLEAR="all">
|
|
<SMALL>...<I>making Linux just a little more fun!</I></SMALL>
|
|
</TD><TD WIDTH="380">
|
|
|
|
|
|
<center>
|
|
<BIG><BIG><STRONG><FONT COLOR="maroon">Apache Log Analysis Using Python</FONT></STRONG></BIG></BIG><BR>
|
|
<STRONG>By <A HREF="../authors/tougher.html">Rob Tougher</A></STRONG></BIG>
|
|
|
|
</TD></TR>
|
|
</TABLE>
|
|
<P>
|
|
|
|
<!-- END header -->
|
|
|
|
|
|
|
|
|
|
<dl>
|
|
<dt><a href=#1>1. Introduction</a>
|
|
<dt><a href=#2>2. The Framework</a>
|
|
<dd><a href=#2.1>2.1 First pass - Awk attempt</a>
|
|
<dd><a href=#2.2>2.2 Next pass - Python to the rescue</a>
|
|
<dt><a href=#3>3. Example Handlers</a>
|
|
<dd><a href=#3.1>3.1 Return visitors</a>
|
|
<dd><a href=#3.2>3.2 Referring domains</a>
|
|
<dt><a href=#4>4. Files</a>
|
|
<dt><a href=#5>5. Conclusion</a>
|
|
</dl>
|
|
|
|
<a name=1></a>
|
|
<h3>1. Introduction</h3>
|
|
|
|
<p>
|
|
I use the <a href="http://httpd.apache.org/">
|
|
Apache HTTP Server</a> to run my
|
|
<a href="http://www.robtougher.com">web site</a>.
|
|
When a visitor requests a page from the site, Apache
|
|
records the following information in a file named "access_log":
|
|
</p>
|
|
|
|
<ul>
|
|
<li>The IP address of the computer requesting the page
|
|
<li>The name of the page being requested
|
|
<li>The date and time of the request
|
|
<li>The page that referred the visitor to the requested page
|
|
</ul>
|
|
|
|
|
|
<p>
|
|
Until recently I used a combination of command line utilities
|
|
(grep, tail, sort, wc, less, awk) to extract
|
|
this information from the access log.
|
|
But some complex calculations were difficult and time-consuming
|
|
to perform using these tools.
|
|
I needed a more powerful solution -
|
|
a programming language to crunch the data.
|
|
</p>
|
|
|
|
<p>
|
|
Enter <a href="http://www.python.org">Python</a>. Python is
|
|
fast becoming my favorite language, and was the perfect tool
|
|
for solving this problem. I created a framework in Python for
|
|
performing generic text file analysis, and then utilized
|
|
this framework to glean information from my Apache access log.
|
|
</p>
|
|
|
|
<p>
|
|
This article first explains the framework, and
|
|
then describes two examples that use it. My hope
|
|
is that by the end of this article you will be able
|
|
to use this framework for analyzing your own text files.
|
|
</p>
|
|
|
|
<a name=2></a>
|
|
<h3>2. The Framework</h3>
|
|
|
|
<a name=2.1></a>
|
|
<h4>2.1 First pass - Awk attempt</h4>
|
|
|
|
<p>
|
|
When trying to solve this problem I initially turned to
|
|
<a href="http://www.gnu.org/software/gawk/gawk.html">Gawk</a>,
|
|
an implementation of the Awk language. Awk is primarily used
|
|
to search text files for certain pieces of data. The following is
|
|
a basic Awk script:
|
|
</p>
|
|
|
|
|
|
|
|
<p>Listing 1:
|
|
<a href="misc/tougher/count_lines.awk.txt">count_lines.awk</a></p>
|
|
<pre>
|
|
#!/usr/bin/awk -f
|
|
|
|
BEGIN {
|
|
count = 0
|
|
}
|
|
|
|
{ count++ }
|
|
|
|
END {
|
|
print count
|
|
}
|
|
</pre>
|
|
|
|
|
|
<p>
|
|
This script prints the number of lines in a file. You can
|
|
run it by typing the following at a command prompt:
|
|
</p>
|
|
|
|
<pre>
|
|
prompt$ ./count_lines.awk access_log
|
|
</pre>
|
|
|
|
<p>
|
|
Awk reads in the script, and does the following:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>Runs the code in the BEGIN block.
|
|
<li>Runs the middle block of code for each line in "access_log".
|
|
<li>Runs the code in the END block.
|
|
</ul>
|
|
|
|
<p>
|
|
I liked this processing model. It made sense to me -
|
|
first run some initialization code,
|
|
next process the file line by line,
|
|
and finally run some cleanup code. It seemed perfectly
|
|
suited to the task of analyzing text files.
|
|
</p>
|
|
|
|
<p>
|
|
Awk gave me trouble, though. It was very difficult to create
|
|
complex data structures - I was jumping through hoops for
|
|
tasks that should have been much more straightforward.
|
|
So after some time I started looking for an alternative.
|
|
</p>
|
|
|
|
<a name=2.2></a>
|
|
<h4>2.2 Next pass - Python to the rescue</h4>
|
|
|
|
<p>
|
|
My situation was this: I liked the Awk processing model,
|
|
but I didn't like the language itself. And I liked Python, but
|
|
it didn't have Awk's processing model. So I decided
|
|
to combine the two, and came up with the current framework.
|
|
</p>
|
|
|
|
<p>
|
|
The framework resides in
|
|
<a href="misc/tougher/awk.py.txt">awk.py</a>.
|
|
This module contains one class, <code>controller</code>, which
|
|
implements the following methods:
|
|
</p>
|
|
|
|
<ul>
|
|
<li><code>__init__(file)</code> - the constructor, which takes a
|
|
file object to process.
|
|
<li><code>subscribe(handler)</code> - subscribes a handler to the controller.
|
|
<li><code>run()</code> - processes the file.
|
|
<li><code>print_results()</code> - prints the results of the process.
|
|
</ul>
|
|
|
|
|
|
<p>
|
|
A <i>handler</i> is a class that implements a defined set
|
|
of methods. Multiple handlers
|
|
can be subscribed to the controller at any given time. Every
|
|
handler must implement the following methods:
|
|
</p>
|
|
|
|
|
|
<ul>
|
|
<li><code>begin()</code> - gets called once before the file is processed.
|
|
<li><code>process_line(line)</code> - gets called for each line of the file.
|
|
<li><code>end()</code> - gets called after the file is processed.
|
|
<li><code>description()</code> - gets called from
|
|
<code>controller.print_results()</code>. It should
|
|
return a description of the handler.
|
|
<li><code>result()</code> - also called from
|
|
<code>controller.print_results()</code>.
|
|
It should return the results of the class' calculations.
|
|
</ul>
|
|
|
|
|
|
<p>
|
|
You create handlers, subscribe them to the controller, and then
|
|
run the controller. The following is a simple example with one handler:
|
|
</p>
|
|
|
|
|
|
|
|
<p>Listing 2: <a href="misc/tougher/count_lines.py.txt">count_lines.py</a></p>
|
|
<pre>
|
|
# Standard sys module
|
|
import sys
|
|
|
|
# Custom awk.py module
|
|
import awk
|
|
|
|
class count_lines:
|
|
|
|
def begin(self):
|
|
self.m_count = 0
|
|
|
|
def process_line(self, s):
|
|
self.m_count += 1
|
|
|
|
def end(self):
|
|
pass
|
|
|
|
def description(self):
|
|
return "# of lines in the file"
|
|
|
|
def result(self):
|
|
return self.m_count
|
|
|
|
|
|
#
|
|
# Step 1: Create the Awk controller
|
|
#
|
|
ac = awk.controller(sys.stdin)
|
|
|
|
#
|
|
# Step 2: Subscribe the handler
|
|
#
|
|
ac.subscribe(count_lines())
|
|
|
|
#
|
|
# Step 3: Run
|
|
#
|
|
ac.run()
|
|
|
|
#
|
|
# Step 4: Print the results
|
|
#
|
|
ac.print_results()
|
|
</pre>
|
|
|
|
|
|
|
|
<p>
|
|
You can run this script using the following command:
|
|
</p>
|
|
|
|
<pre>
|
|
prompt$ cat access_log | python count_lines.py
|
|
</pre>
|
|
|
|
<p>
|
|
The results of the script should be printed to the console.
|
|
</p>
|
|
|
|
|
|
|
|
<a name=3></a>
|
|
<h3>3. Example Handlers</h3>
|
|
|
|
<p>
|
|
Now that the framework was in place, I had to figure
|
|
out how I was going to use it. I came up with many ideas, but
|
|
the following two were the top priorities.
|
|
</p>
|
|
|
|
<a name=3.1></a>
|
|
<h4>3.1 Return visitors</h4>
|
|
|
|
<p>
|
|
The first question that I wanted to answer using my new framework
|
|
was the following:
|
|
</p>
|
|
|
|
<ul>
|
|
<li><i>How many people have returned to the site more than N times?</i>
|
|
</ul>
|
|
|
|
|
|
<p>
|
|
My thinking was this: if people return often, they must enjoy
|
|
the site, right? The following script answers the
|
|
above question:
|
|
</p>
|
|
|
|
|
|
<p>
|
|
Listing 3: return_visitors (can be found in
|
|
<a href="misc/tougher/handlers.py.txt">handlers.py</a>)
|
|
</p>
|
|
|
|
<pre>
|
|
class return_visitors:
|
|
|
|
def __init__(self, n):
|
|
self.m_n = n
|
|
self.m_ip_days = {}
|
|
|
|
def begin(self):
|
|
pass
|
|
|
|
def process_line(self, s):
|
|
|
|
try:
|
|
array = s.split()
|
|
ip = array[0]
|
|
day = array[3][1:7]
|
|
|
|
if self.m_ip_days.has_key(ip):
|
|
|
|
if day not in self.m_ip_days[ip]:
|
|
self.m_ip_days[ip].append(day)
|
|
|
|
else:
|
|
self.m_ip_days[ip] = []
|
|
self.m_ip_days[ip].append(day)
|
|
|
|
except IndexError:
|
|
pass
|
|
|
|
|
|
|
|
def end(self):
|
|
|
|
ips = self.m_ip_days.keys()
|
|
count = 0
|
|
|
|
for ip in ips:
|
|
|
|
if len(self.m_ip_days[ip]) > self.m_n:
|
|
count += 1
|
|
|
|
self.m_count = count
|
|
|
|
|
|
def description(self):
|
|
return "# of IP addresses that visited more than %s days" % self.m_n
|
|
|
|
def result(self):
|
|
return self.m_count
|
|
</pre>
|
|
|
|
|
|
<p>
|
|
The script stores the number of days that each IP address has visited
|
|
the site. When the file is finished processing, it returns how
|
|
many IP addresses have visited more than N times.
|
|
</p>
|
|
|
|
|
|
<a name=3.2></a>
|
|
<h4>3.2 Referring domains</h4>
|
|
|
|
<p>
|
|
Another thing I wanted to know was how people found out about the
|
|
site. I was getting a decent amount of traffic, and I wasn't sure
|
|
why. I kept asking myself:
|
|
</p>
|
|
|
|
|
|
<ul>
|
|
<li><i>Where are all these people coming from?</i>
|
|
</ul>
|
|
|
|
<p>
|
|
I guess you shouldn't argue with a site that's popular. But
|
|
I was curious to know how people were learning about my site.
|
|
So I wrote the following script:
|
|
</p>
|
|
|
|
|
|
<p>Listing 4: referring_domains (can be found in
|
|
<a href="misc/tougher/handlers.py.txt">handlers.py</a>)
|
|
</p>
|
|
|
|
<pre>
|
|
class referring_domains:
|
|
|
|
def __init__(self):
|
|
self.m_domains = {}
|
|
|
|
def begin(self):
|
|
pass
|
|
|
|
def process_line(self, line):
|
|
|
|
try:
|
|
array = line.split()
|
|
referrer = array[10]
|
|
|
|
m = re.search('//[a-zA-Z0-9\-\.]*\.[a-zA-z]{2,3}/',
|
|
referrer)
|
|
|
|
length = len(m.group(0))
|
|
domain = m.group(0)[2:length-1]
|
|
|
|
if self.m_domains.has_key(domain):
|
|
self.m_domains[domain] += 1
|
|
else:
|
|
self.m_domains[domain] = 1
|
|
|
|
except AttributeError:
|
|
pass
|
|
except IndexError:
|
|
pass
|
|
|
|
|
|
def end(self):
|
|
pass
|
|
|
|
|
|
def description(self):
|
|
return "Referring domains"
|
|
|
|
|
|
def sort(self, key1, key2):
|
|
if self.m_domains[key1] > self.m_domains[key2]:
|
|
return -1
|
|
elif self.m_domains[key1] == self.m_domains[key2]:
|
|
return 0
|
|
else:
|
|
return 1
|
|
|
|
|
|
def result(self):
|
|
|
|
s = ""
|
|
keys = self.m_domains.keys()
|
|
keys.sort(self.sort)
|
|
|
|
for domain in keys:
|
|
s += domain
|
|
s += " "
|
|
s += str(self.m_domains[domain])
|
|
s += "\n"
|
|
|
|
s += "\n\n"
|
|
|
|
return s
|
|
</pre>
|
|
|
|
|
|
<p>
|
|
This script stores the referral information
|
|
for each request, and generates a list of
|
|
referring domains, sorted by frequency.
|
|
</p>
|
|
|
|
<p>
|
|
I ran the script and found that most of the referrals came from my own site.
|
|
This makes sense - when a visitor moves from one page to another on
|
|
the site, the referring domain for the page is my web site's
|
|
domain. But I did find some interesting entries in the referral
|
|
list, and my question about site traffic was answered.
|
|
</p>
|
|
|
|
|
|
<a name=4></a>
|
|
<h3>4. Files</h3>
|
|
|
|
<p>
|
|
The following files contain the code from this article:
|
|
</p>
|
|
|
|
<ul>
|
|
<li><a href="misc/tougher/count_lines.awk.txt">count_lines.awk</a> -
|
|
a basic Awk script
|
|
<li><a href="misc/tougher/awk.py.txt">awk.py</a> -
|
|
the <code>controller</code> class
|
|
<li><a href="misc/tougher/count_lines.py.txt">count_lines.py</a> -
|
|
<code>count_lines</code> handler
|
|
<li><a href="misc/tougher/handlers.py.txt">handlers.py</a> -
|
|
<code>return_visitors</code> and <code>referring_domains</code> handlers
|
|
</ul>
|
|
|
|
<a name=5></a>
|
|
<h3>5. Conclusion</h3>
|
|
|
|
|
|
<p>
|
|
In this article I described how I use Python to process my
|
|
Apache HTTP Server access log. Hopefully I explained my techniques
|
|
clearly enough so that you can use them for your text files.
|
|
</p>
|
|
|
|
|
|
|
|
|
|
<!-- *** BEGIN copyright *** -->
|
|
<hr>
|
|
<CENTER><SMALL><STRONG>
|
|
|
|
Copyright © 2002, Rob Tougher.
|
|
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
|
|
Published in Issue 83 of <i>Linux Gazette</i>, October 2002</H5>
|
|
</STRONG></SMALL></CENTER>
|
|
<!-- *** END copyright *** -->
|
|
<HR>
|
|
|
|
<!--startcut ==========================================================-->
|
|
<CENTER>
|
|
<!-- *** BEGIN navbar *** -->
|
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="thangaraju.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue83/tougher.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../lg_faq.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="ward.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|
<!-- *** END navbar *** -->
|
|
</CENTER>
|
|
</BODY></HTML>
|
|
<!--endcut ============================================================-->
|