There are many tools available for analyzing "NCSA combined"
format logs, but the standard unix text tools are fast and flexible for extracting info from them.
In this article I describe scripts I find useful for monitoring the latest info from
web server logs on the linux command line.
which one could filter further for example by adding the subscriber numbers.
Realtime filtered view
The standard tool to view the entries as they are added to a log file is tail -f. For web server logs though, one really needs to filter and format this further. There are dedicated tools for this like apachetop and wtop, or even a 3D graphical viewer. But for sites with modest traffic (say up to 1000 hits per hour), I find the following script very useful for use on a terminal.#need to source this script so that $COLUMNS available #Note -n10 is an implicit option, which one can override. deny="`tput bold; tput setaf 1`" #bright red high="`tput bold; tput setaf 3`" #bright yellow norm="`tput sgr0`" tail "$@" -s.1 -f access_log | grep -Ev --line-buffered -f agents_to_ignore -f files_to_ignore | sed -u "s/\(.*\) - \[[^]]*\]\(.*\)/\1\2/" | #strip some fields #make google searches easier to interpret sed -u \ 's#\(.*\) "http://www\.\(google\.[^/]*\).*[?&_]q=\([^&"]*\)[^"]*"#\1 "\2 (\3)" #' | #strip common redundant info sed -u 's/ HTTP\/1.[01]//; s/.NET CLR [0-9.]*//g; s/Gecko\/[0-9]*//; s/rv:[0-9.]*//; s/Mozilla\/[0-9.]* //' | sed -u "s/^/ /; s/^ *\([ 0-9.]\{15,\}\) -/\1/" | #align IP addresses sed -u "s/\(.\{$COLUMNS\}\).*/\1/" | #keep to width of terminal #highlight referrer column sed -u " s/\([^\"]*\"[^\"]*\" 40[34] [0-9]* \"\)\([^\"]*\)\(\".*\)/\1$deny\2$norm\3/;t; s/\([^\"]*\"[^\"]*\" [0-9 -]* \"\)\([^\"]*\)\(\".*\)/\1$high\2$norm\3/;t; "There are a couple of things to note about this script. The first is that all the commands are run in line buffered mode so that appropriate entries are displayed immediately upon being written to the log file. The second is that this script uses the $COLUMNS environment variable which is not exported from the interactive shell. Hence the above script must be sourced by the interactive shell as follows.
$ . tail_access_logI usually leave the above running in a full screen xterm connected to my web server. Note you can use the return key to delimit with blank lines, items that you've already looked at.
Efficiently monitoring the last time period
Web logs in general tend to grow quite large, so it would be nice to be able to process the last part of the log with performance being independent of the size of the log. For instance, I only cycle my access_log once a year which currently equates to it growing to around 700MB. This is more than enough to empty the existing file and application cache on my system, while also requiring many disk seeks to read the data.The technique used in the get_last_days script below, is to use the tac command to seek to the end of the file and only read the required part of the file from disk (or a little more for efficiency reasons). sed is used to terminate the read (or grep -m1 is an alternative for positive matches).
#!/bin/sh #return the last x days from an "NCSA combined" format HTTP log days=$1 log="$2" export LANG=C #for speed export TZ=UTC0 last_log_date=`tail -1 "$log" | sed 's#.*\[\([^:]*\):\([^ ]*\) .*#\1 \2#g' | tr '/' ' '` yesterday=`date --date="$last_log_date $days day ago" +"%d/%b/%Y:%H:%M"` #match to within 10 mins (assuming a log entry every min is too much in general) yesterday=`echo $yesterday | cut -b-16` yesterday="$yesterday[0-9]" tac "$log" | sed "\#$yesterday#Q"So now that we can efficiently read the last time period of data from the log file, lets show an example of how we can use that data. The subscribers script below quantifies the subscribers to your blog by using various standard unix command line tools to show the number of subscribers grouped by the feed reader they use. [Update: I had an older script which ignored all web browsers, but Manfred Schwarb who subscribes to my feed using Opera, informed me that firefox, IE7 and Opera at least, can be used to subscribe to blogs, so I try to handle that in the script below]
#!/bin/sh export LANG=C #for speed feed="/feed/rss2.xml" #assume all subscribers check at least once a week ./get_last_days 7 access_log | #filter on those accessing feed URL grep -F "GET $feed" | #exclude browsers that refer to (click) feed from site grep -vE "pixelbeat.org.*(rv:|MSIE|AppleWebKit/|Konqueror|Opera) .* " | #extract first 16 bits of ip & user_agent sed 's/\([0-9]*\.[0-9]*\)\.[0-9]*\.[0-9]* .*"\([^"]*\)"$/\1\t\2/' | #sort by agent, then by ip net sort -k2 -k1,1 | #merge and count all requests from same user agent at a particular net uniq -c | #ignore single requests from browsers? grep -vE " 1 .*(rv:|MSIE|AppleWebKit/|Konqueror|Opera).*" | #ignore bots grep -vE -f agents_to_ignore | #Merge reader variants sed ' s/\([^\t]\)\t.*Firefox.*/\1\tFirefox/; s/\([^\t]\)\t.*MSIE 7.0.*/\1\tIE7/; s/\([^\t]\)\t.*Opera.*/\1\tOpera/; s/\([^\t]\)\t.*Akregator.*/\1\tAkregator/; s/\([^\t]\)\t.*Thunderbird.*/\1\tThunderbird/; s/\([^\t]\)\t.*Liferea.*/\1\tLiferea/; s/\([^\t]\)\t.*Google Desktop.*/\1\tGoogle Desktop/; ' | #select just agent strings cut -d"`echo -e '\t'`" -f2 | #group agent strings sort | #count number of subscribers using each agent uniq -c | #uniquely identify different feeds read by google sed 's/\(.*\)\(feedfetcher.html\)\(.*\)id=\([0-9]*\).*/\1\2.\4\3/' | #move subscribers counts of online readers to first column sed 's/ *[0-9]* .*\(http[^;]*\).* \([0-9]*\) subscriber.*/ \2 \1/' | #merge agents again, in case there were increasing subscribers during week uniq -f1 | #sort by subscriber numbers sort -k1,1n | #right align numbers sed "s/^/ /; s/ *\([ 0-9]\{7,\}\) \([^ ].*\)/\1 \2/" | #truncate lines to 80 chars sed "s/\(.\{80\}\).*/\1/" #note $COLUMNS not exportedSo running the above script takes only milliseconds on my 700MB access_log, and produces the following,
which one could filter further for example by adding the subscriber numbers.
1 Abilon 1 AppleSyndication/54 1 GreatNews/1.0 1 Hatena RSS/0.3 (http://r.hatena.ne.jp) 1 Mozilla/3.01 (compatible;) 1 Raggle/0.4.4 (i486-linux; Ruby/1.8.5) 1 Rome Client (http://tinyurl.com/64t5n) Ver: 0.7 1 Rome Client (http://tinyurl.com/64t5n) Ver: 0.9 1 SharpReader/0.9.7.0 (.NET CLR 1.1.4322.2407; WinNT 5.1.2600.0) 1 Snownews/1.5.7 (Linux; en_US; http://snownews.kcore.de/) 1 Vienna/2.2.1.2210 1 Zhuaxia.com 2 Subscribers 1 panscient.com 1 radianrss-1.0 1 topicblogs/0.9 2 AideRSS/1.0 (aiderss.com); 2 subscribers 2 BuzzTracker/1.01 +http://www.buzztracker.com 2 Mozilla/5.0 (Sage) 2 Opera 2 RSSOwl/1.2.4 2007-11-26 (Windows; U; de) 2 xianguo 1 subscribers 2 http://www.simplyheadlines.com 3 Feedreader 3.11 (Powered by Newsbrain) 3 IE7 3 Mozilla/4.0 (compatible;) 3 Thunderbird 4 Liferea 5 Mozilla/3.0 (compatible) 7 http://www.netvibes.com/ 8 http://www.bloglines.com 9 Akregator 55 Firefox 55 http://www.rojo.com/corporate/help/agg/ 82 Google Desktop 100 http://www.google.com/feedfetcher.htmlThe above shows a few interesting things for me at least. First of all is that there are over 400 subscribers to my "blog", while I would have guessed there were around 20. This is especially surprising as I only post page titles and descriptions from my feed. Also one can see how popular the google products are for reading blogs, and how unpopular Mozilla thunderbird is, which is the feed reader I use.
© Dec 17 2007