Efficiently monitoring access_log using the command line

There are many tools available for analyzing "NCSA combined" format logs, but the standard unix text tools are fast and flexible for extracting info from them. In this article I describe scripts I find useful for monitoring the latest info from web server logs on the linux command line.

Realtime filtered view

The standard tool to view the entries as they are added to a log file is tail -f. For web server logs though, one really needs to filter and format this further. There are dedicated tools for this like apachetop and wtop, or even a 3D graphical viewer. But for sites with modest traffic (say up to 1000 hits per hour), I find the following script very useful for use on a terminal.

#need to source this script so that $COLUMNS available
#Note -n10 is an implicit option, which one can override.

deny="`tput bold; tput setaf 1`" #bright red
high="`tput bold; tput setaf 3`" #bright yellow
norm="`tput sgr0`"

tail "$@" -s.1 -f access_log |
grep -Ev --line-buffered -f agents_to_ignore -f files_to_ignore |
sed -u "s/\(.*\) - \[[^]]*\]\(.*\)/\1\2/" | #strip some fields
#make google searches easier to interpret
sed -u \
's#\(.*\) "http://www\.\(google\.[^/]*\).*[?&_]q=\([^&"]*\)[^"]*"#\1 "\2 (\3)" #' |
#strip common redundant info
sed -u 's/ HTTP\/1.[01]//;
        s/.NET CLR [0-9.]*//g;
        s/Gecko\/[0-9]*//;
        s/rv:[0-9.]*//;
        s/Mozilla\/[0-9.]* //' |
sed -u "s/^/        /; s/^ *\([ 0-9.]\{15,\}\) -/\1/" | #align IP addresses
sed -u "s/\(.\{$COLUMNS\}\).*/\1/" | #keep to width of terminal
#highlight referrer column
sed -u "
s/\([^\"]*\"[^\"]*\" 40[34] [0-9]* \"\)\([^\"]*\)\(\".*\)/\1$deny\2$norm\3/;t;
s/\([^\"]*\"[^\"]*\" [0-9 -]* \"\)\([^\"]*\)\(\".*\)/\1$high\2$norm\3/;t;
"

There are a couple of things to note about this script. The first is that all the commands are run in line buffered mode so that appropriate entries are displayed immediately upon being written to the log file. The second is that this script uses the $COLUMNS environment variable which is not exported from the interactive shell. Hence the above script must be sourced by the interactive shell as follows.

$ . tail_access_log

I usually leave the above running in a full screen xterm connected to my web server. Note you can use the return key to delimit with blank lines, items that you've already looked at.

Efficiently monitoring the last time period

Web logs in general tend to grow quite large, so it would be nice to be able to process the last part of the log with performance being independent of the size of the log. For instance, I only cycle my access_log once a year which currently equates to it growing to around 700MB. This is more than enough to empty the existing file and application cache on my system, while also requiring many disk seeks to read the data.

The technique used in the get_last_days script below, is to use the tac command to seek to the end of the file and only read the required part of the file from disk (or a little more for efficiency reasons). sed is used to terminate the read (or grep -m1 is an alternative for positive matches).

#!/bin/sh

#return the last x days from an "NCSA combined" format HTTP log

days=$1
log="$2"

export LANG=C #for speed
export TZ=UTC0

last_log_date=`tail -1 "$log" |
               sed 's#.*\[\([^:]*\):\([^ ]*\) .*#\1 \2#g' |
               tr '/' ' '`
yesterday=`date --date="$last_log_date $days day ago" +"%d/%b/%Y:%H:%M"`

#match to within 10 mins (assuming a log entry every min is too much in general)
yesterday=`echo $yesterday | cut -b-16`
yesterday="$yesterday[0-9]"

tac "$log" | sed "\#$yesterday#Q"

So now that we can efficiently read the last time period of data from the log file, lets show an example of how we can use that data. The subscribers script below quantifies the subscribers to your blog by using various standard unix command line tools to show the number of subscribers grouped by the feed reader they use. [Update: I had an older script which ignored all web browsers, but Manfred Schwarb who subscribes to my feed using Opera, informed me that firefox, IE7 and Opera at least, can be used to subscribe to blogs, so I try to handle that in the script below]

#!/bin/sh

export LANG=C #for speed

feed="/feed/rss2.xml"

#assume all subscribers check at least once a week
./get_last_days 7 access_log |
#filter on those accessing feed URL
grep -F "GET $feed" |
#exclude browsers that refer to (click) feed from site
grep -vE "pixelbeat.org.*(rv:|MSIE|AppleWebKit/|Konqueror|Opera) .* " |
#extract first 16 bits of ip & user_agent
sed 's/\([0-9]*\.[0-9]*\)\.[0-9]*\.[0-9]* .*"\([^"]*\)"$/\1\t\2/' |
#sort by agent, then by ip net
sort -k2 -k1,1 |
#merge and count all requests from same user agent at a particular net
uniq -c |
#ignore single requests from browsers?
grep -vE "      1 .*(rv:|MSIE|AppleWebKit/|Konqueror|Opera).*" |
#ignore bots
grep -vE -f agents_to_ignore |
#Merge reader variants
sed '
 s/\([^\t]\)\t.*Firefox.*/\1\tFirefox/;
 s/\([^\t]\)\t.*MSIE 7.0.*/\1\tIE7/;
 s/\([^\t]\)\t.*Opera.*/\1\tOpera/;
 s/\([^\t]\)\t.*Akregator.*/\1\tAkregator/;
 s/\([^\t]\)\t.*Thunderbird.*/\1\tThunderbird/;
 s/\([^\t]\)\t.*Liferea.*/\1\tLiferea/;
 s/\([^\t]\)\t.*Google Desktop.*/\1\tGoogle Desktop/;
 ' |
#select just agent strings
cut -d"`echo -e '\t'`" -f2 |
#group agent strings
sort |
#count number of subscribers using each agent
uniq -c |
#uniquely identify different feeds read by google
sed 's/\(.*\)\(feedfetcher.html\)\(.*\)id=\([0-9]*\).*/\1\2.\4\3/' |
#move subscribers counts of online readers to first column
sed 's/ *[0-9]* .*\(http[^;]*\).* \([0-9]*\) subscriber.*/     \2 \1/' |
#merge agents again, in case there were increasing subscribers during week
uniq -f1 |
#sort by subscriber numbers
sort -k1,1n |
#right align numbers
sed "s/^/      /; s/ *\([ 0-9]\{7,\}\) \([^ ].*\)/\1 \2/" |
#truncate lines to 80 chars
sed "s/\(.\{80\}\).*/\1/" #note $COLUMNS not exported

So running the above script takes only milliseconds on my 700MB access_log, and produces the following,
which one could filter further for example by adding the subscriber numbers.

      1 Abilon
      1 AppleSyndication/54
      1 GreatNews/1.0
      1 Hatena RSS/0.3 (http://r.hatena.ne.jp)
      1 Mozilla/3.01 (compatible;)
      1 Raggle/0.4.4 (i486-linux; Ruby/1.8.5)
      1 Rome Client (http://tinyurl.com/64t5n) Ver: 0.7
      1 Rome Client (http://tinyurl.com/64t5n) Ver: 0.9
      1 SharpReader/0.9.7.0 (.NET CLR 1.1.4322.2407; WinNT 5.1.2600.0)
      1 Snownews/1.5.7 (Linux; en_US; http://snownews.kcore.de/)
      1 Vienna/2.2.1.2210
      1 Zhuaxia.com 2 Subscribers
      1 panscient.com
      1 radianrss-1.0
      1 topicblogs/0.9
      2 AideRSS/1.0 (aiderss.com); 2 subscribers
      2 BuzzTracker/1.01 +http://www.buzztracker.com
      2 Mozilla/5.0 (Sage)
      2 Opera
      2 RSSOwl/1.2.4 2007-11-26 (Windows; U; de)
      2 xianguo 1 subscribers
      2 http://www.simplyheadlines.com
      3 Feedreader 3.11 (Powered by Newsbrain)
      3 IE7
      3 Mozilla/4.0 (compatible;)
      3 Thunderbird
      4 Liferea
      5 Mozilla/3.0 (compatible)
      7 http://www.netvibes.com/
      8 http://www.bloglines.com
      9 Akregator
     55 Firefox
     55 http://www.rojo.com/corporate/help/agg/
     82 Google Desktop
    100 http://www.google.com/feedfetcher.html

The above shows a few interesting things for me at least. First of all is that there are over 400 subscribers to my "blog", while I would have guessed there were around 20. This is especially surprising as I only post page titles and descriptions from my feed. Also one can see how popular the google products are for reading blogs, and how unpopular Mozilla thunderbird is, which is the feed reader I use.