Access Log reports -- Top entry and Exit pages

N

Naveen

Hello,
I'm working on the access logs of our website (approximately 4-6 GB
Each day) and generating various kinds of reports. Till now i'v done
referer and referring URLs report, hits report and a couple of others.
Now i gotta make one for Top entry pages and top exit pages of the
website and also one for the most path throughs, i.e. the complete stay
history of a particular user on the site, where he entered, then to
which page he went and so on, till he exited. I know various programs
like AWStats or others can do the job for me but i need to do it from
scratch. I tried to look for a module in CPAN for the entry and exit
report but couldn't seem to find one. I feel that the IPs of the users
have to be tracked and some kind of logic be applied to calculate the
same. But not getting from where to start.
Any pointers to the above would be great.
Thanks
 
L

l v

Naveen said:
Hello,
I'm working on the access logs of our website (approximately 4-6 GB
Each day) and generating various kinds of reports. Till now i'v done
referer and referring URLs report, hits report and a couple of others.
Now i gotta make one for Top entry pages and top exit pages of the
website and also one for the most path throughs, i.e. the complete stay
history of a particular user on the site, where he entered, then to
which page he went and so on, till he exited. I know various programs
like AWStats or others can do the job for me but i need to do it from
scratch. I tried to look for a module in CPAN for the entry and exit
report but couldn't seem to find one. I feel that the IPs of the users
have to be tracked and some kind of logic be applied to calculate the
same. But not getting from where to start.
Any pointers to the above would be great.
Thanks

Since you are rolling your own solution, you might want to use hashs or
HOH's by client IP address then by URI stem and query. The client IP
address should be in your web logs in addition to the URI info, if not
you need have it added by your webmaster. Be mindful of those surfing
from behind a proxy server.

Len
 
N

Naveen

Yes i do have the URI info in the log files and i too guessed that
these would be required in addition to client IP and time of access of
each resource. But sorry for sounding naive (I'm just less than 4
months into Perl, infact programming), my biggest problem here is not
knowing what i require but knowing how to use that information. To be
precise, i'm aware that hashes need to be built and so on, but at my
wit's end reagrding what logic to be applied on which to build such
hashes or HOHs. Top entry and exit pages can only be extracted out of a
list of all the entry and exit pages, each of which definitely will be
a fraction of all the access log entries. How do i get to identify the
entry pages and exit pages? Do i put to use the time at which an IP
first accessed any resource and last accessed a resource?
I hope i made myself clear enough.
Thanks
 
A

Anno Siegel

Naveen said:
Hello,
I'm working on the access logs of our website (approximately 4-6 GB
Each day) and generating various kinds of reports. Till now i'v done
referer and referring URLs report, hits report and a couple of others.
Now i gotta make one for Top entry pages and top exit pages of the
website and also one for the most path throughs, i.e. the complete stay
history of a particular user on the site, where he entered, then to
which page he went and so on, till he exited. I know various programs

[...]

You say that as if "entry to" and "exit from" a website were well-defined
events that can be identified in the log. But it isn't so. As far as
the log file goes, every visit is one singular event. If a request
from the same machine reaches you again you can't even be sure it's
the same user, but assuming it is that doesn't mean they "stayed on
your site" (whatever that means) for the time in between.

Unless your server establishes something like a session for every user,
with some kind of login and logout, you will have a hard time defining
what an entry and an exit is.

Anno
 
N

Naveen

Anno said:
You say that as if "entry to" and "exit from" a website were well-defined
events that can be identified in the log. But it isn't so. As far as
the log file goes, every visit is one singular event. If a request
from the same machine reaches you again you can't even be sure it's
the same user, but assuming it is that doesn't mean they "stayed on
your site" (whatever that means) for the time in between.

Unless your server establishes something like a session for every user,
with some kind of login and logout, you will have a hard time defining
what an entry and an exit is.

Anno

No we don't have a session tracking mechanism as in login and logout.
Moreover i have only the access logs at my disposal through which i got
to achieve the purpose. I understand what you said is completely true,
but isn't it possible to track entry and exit pages based on the data
from access logs only? I guess i have seen some statistics which show
the same!
 
B

Brian Wakem

Naveen said:
No we don't have a session tracking mechanism as in login and logout.
Moreover i have only the access logs at my disposal through which i got
to achieve the purpose. I understand what you said is completely true,
but isn't it possible to track entry and exit pages based on the data
from access logs only? I guess i have seen some statistics which show
the same!


Webalizer does a reasonable job of finding entry and exit pages. I wouldn't
attempt to write my own, programs like webalizer are designed for this
purpose and have been developed over a number of years.

You say your log files are 4-6GB a piece. You are going to have to be very
careful with what you store in memory at any one time or you'll bring the
machine to its knees. Again, use some software that is designed to cope
with such things.
 
N

Naveen

Webalizer does a reasonable job of finding entry and exit pages. I wouldn't
attempt to write my own, programs like webalizer are designed for this
purpose and have been developed over a number of years.

I know Webalizer and other similar programs can do what i want. But the
point is i need to implement an IN-HOUSE APLLICATION IN PERL, that does
whatever is demanded from time to time. Nothing more or nothing less.
webalizer and others produce a lot of reports with lots of data, most
of which would be immaterial to us. Also, we need to customise the
reports based on our requirements which may not always be the same and
like those of the softwares. Moreover programs like AW Stats and others
which are written in perl and do the same thing are eventually doing
the stuff, and pretty well at that. So also out of academic interest it
would be great to learn the tricks of the trade.
You say your log files are 4-6GB a piece. You are going to have to be very
careful with what you store in memory at any one time or you'll bring the
machine to its knees. Again, use some software that is designed to cope
with such things.

Actually, we have log files for a single day totalling to 4-6 gb but
not a single file. there are more than twenty servers on which the site
is distributed and every server has its own log file for the day. So i
combine all the log files and run the program on the whole. Thanks for
your suggestion about being cautious, but i'm pretty well through with
the iinitial hiccups regarding memory and resource management handling
huge data.
 
A

Alan J. Flavell

Webalizer does a reasonable job of finding entry and exit pages.

This is seriously off-topic here, which is a recipe for misleading
answers and lack of proper refutation. So no-one reading this should
really take any positive results away from here without properly
confirming them with more-reliable sources.

But without detailed qualification of what you would be prepared to
accept as a "reasonable job", and independent verification of the
results against some deterministic benchmark, I'd say such results
would be only slightly more convincing than a Ouija board.
programs like webalizer are designed for this purpose and have been
developed over a number of years.

There is a well-reputed log analyzer for Apache and similar logs:

http://www.analog.cx/

It has been "developed over a number of years" too.

See: http://www.analog.cx/docs/webworks.html

As its author says (but please, read the full story, not only this
conclusion):

The bottom line is that HTTP is a stateless protocol. That means that
people don't log in and retrieve several documents: they make a
separate connection for each file they want. And a lot of the time
they don't even behave as if they were logged into one site. The
world is a lot messier than this naïve view implies. That's why
analog reports requests, i.e. what is going on at your server, which
you know, rather than guessing what the users are doing.


Some other analyzers respond to user demands by applying guesswork to
produce the answers which the users say they want. This makes the
users very happy, and in some cases even induces them to part with
serious money for the software; but does not amuse a professional
statistician. "Hence or otherwise deduce...".
 
L

l v

Naveen said:
Yes i do have the URI info in the log files and i too guessed that
these would be required in addition to client IP and time of access of
each resource. But sorry for sounding naive (I'm just less than 4
months into Perl, infact programming), my biggest problem here is not
knowing what i require but knowing how to use that information. To be
precise, i'm aware that hashes need to be built and so on, but at my
wit's end reagrding what logic to be applied on which to build such
hashes or HOHs. Top entry and exit pages can only be extracted out of a
list of all the entry and exit pages, each of which definitely will be
a fraction of all the access log entries. How do i get to identify the
entry pages and exit pages? Do i put to use the time at which an IP
first accessed any resource and last accessed a resource?
I hope i made myself clear enough.
Thanks

You are clear. However, I only responded to the Perl aspects of your
question since this group is for the assistance with the Perl language.

<*not*_related_to_Perl>
Entry and exit pages based on the web logs is not an exact science.
What defines an exit page? The last URI based on IP in the log? What
if that same IP is in the next days log? Is the exit page based on time
since the last hit? 10 minutes? 20 minutes?

Is the entry page a hit when the referer is not from one of you sites?

</*not*_related_to_Perl>


<related_to_Perl>
My first quick thought is to create a hash of arrays. The hash is keyed
by IP, the first entry in the array is whatever you define the entry
page, the second entry in the array is whatever you define the exit page.

While you are coding to track entry and exit pages, post a small snipped
of your code showing your Perl problem according the posting guidelines
of this news group and this group will chomp at the bit to assist.
</related_to_Perl>
 
L

l v

Naveen wrote:

[snip]
but isn't it possible to track entry and exit pages based on the data
from access logs only? I guess i have seen some statistics which show
the same!

They are making assumptions about how the web user surfs. For example,
an exit page is the last page an IP address, or session ID, access
within the last 10 minutes, or something like that.

Len
 
L

l v

Naveen wrote:
[snip]
like those of the softwares. Moreover programs like AW Stats and others
which are written in perl and do the same thing are eventually doing
the stuff, and pretty well at that. So also out of academic interest it
would be great to learn the tricks of the trade.

Then download AWStats, crack open the code and the modules and learn
from their tricks regarding this topic.

Len
 
N

Naveen

l said:
They are making assumptions about how the web user surfs. For example,
an exit page is the last page an IP address, or session ID, access
within the last 10 minutes, or something like that.

Len

I got it Len. Maybe even i have to make assumptions (which may produce
results miles away from reality) if the job has to be done.

And thanks Len for pointing us to "How the web works" link on the
analog site. It really cleared a few points.
 
N

Naveen

Naveen said:
I got it Len. Maybe even i have to make assumptions (which may produce
results miles away from reality) if the job has to be done.

Oops..a goof up! I meant IV up here instead of Len.
 
A

axel

Naveen said:
Yes i do have the URI info in the log files and i too guessed that
these would be required in addition to client IP and time of access of
each resource. But sorry for sounding naive (I'm just less than 4
months into Perl, infact programming), my biggest problem here is not
knowing what i require but knowing how to use that information. To be
precise, i'm aware that hashes need to be built and so on, but at my
wit's end reagrding what logic to be applied on which to build such
hashes or HOHs. Top entry and exit pages can only be extracted out of a
list of all the entry and exit pages, each of which definitely will be
a fraction of all the access log entries. How do i get to identify the
entry pages and exit pages? Do i put to use the time at which an IP
first accessed any resource and last accessed a resource?
I hope i made myself clear enough.

I recently wrote something similar... although it is for a rarely
visited site and would need to be altered for your logsize.

You would need to pick up the first entry and last by a particular client
and probably rearrange what is printed.



#!/usr/bin/perl

# Check a current log by retrieving it from onsite and processing it

$|++;

use strict;
use warnings;

use Socket;
use Text::parseWords;
use LWP::UserAgent;

# Hardwired

my $curr_log = 'XXX';
my $username = 'XXX';
my $password ='XXX';

# Global

my %paths; # Path hits
my %ips; # IP hits
my %ips_hosts; # IP hostnames
my %ips_to_hosts; # IP to hostnames hostnames

# Header

print "Content-type: text/plain\n\n";

# Get logfile

$curr_log .= $ARGV[0] if $ARGV[0];
my $r_log = get_log();

print "Stats\n\n";
print $#$r_log;
print " records\n\n";

print "Raw Log\n\n";

foreach (@$r_log) {
# Throw out spurious data immediately - must investigate why seg fault caused
next if length > 255;

# Split up the data
chomp;
my ($ip, $junk1, $junk2, $date, $offset, $request, $result, $size, $ref, $browser)
= quotewords('\s', 0, $_);
my ($method, $path, $http) = split / /, $request;

# Only valid 200 requests for now
next if $result ne '200';

# Ignore graphics, css, js
next if $path =~ /jpg$|gif$|png$|ico$|js$|css$/;

# Some spurious http:// requests - to investiagte
next if $path =~ /^http/;

# Stuff some data into hashes
unless (exists $ips_to_hosts{$ip}) {
my $hostname = get_hostname($ip) || $ip;
$ips_to_hosts{$ip} = $hostname;
$ips_hosts{$hostname} = \$ips{$ip};
}
$ips{$ip}++;
$paths{$path}++;

$date =~ s/\[//;
$date =~ s/:/ /;
print "$ips_to_hosts{$ip}\t$path\t$date\n";

}
print "\n\n";

# Print some totals

print "Pages requested\n\n";

for my $k (sort keys %paths) {
print $k,
' ' x (30 - length $k), "\t",
$paths{$k}, "\n";
}
print "\n\n";

print "Calling Hosts\n\n";

for my $host (sort dom_sort values %ips_to_hosts) {
print $host,
' ' x (40 - length $host), "\t",
${$ips_hosts{$host}}, "\n";
}

sub dom_sort {
# Sort a domain
no warnings;
my @dom_a = reverse split /\./, $a;
my @dom_b = reverse split /\./, $b;
$dom_a[0] = '~' . $dom_a[0] if $dom_a[0] =~ /^\d/;
$dom_b[0] = '~' . $dom_b[0] if $dom_b[0] =~ /^\d/;
return $dom_a[0] cmp $dom_b[0] ||
return $dom_a[1] cmp $dom_b[1] ||
return $dom_a[2] cmp $dom_b[2] ||
return $dom_a[3] cmp $dom_b[3];
}

sub get_hostname {
my $ip =shift;
my $ipaddr = inet_aton($ip);
return undef unless $ipaddr;
my $hostname = gethostbyaddr($ipaddr, AF_INET);
return $hostname;
}

sub get_log {
my $ua = LWP::UserAgent->new;
$ua->agent("NOM Qget/0.1 ");

# Create a request
my $req = HTTP::Request->new(GET => $curr_log);
$req->authorization_basic($username, $password);

# Pass request to the user agent and get a response back
my $res = $ua->request($req);

# Check the outcome of the response
unless ($res->is_success) {
die "Unable to collect log from server: $res->status_line";
}

my @inarr = split /\n/, $res->content;
return \@inarr;
}
 
N

Naveen

I recently wrote something similar... although it is for a rarely
visited site and would need to be altered for your logsize.

You would need to pick up the first entry and last by a particular client
and probably rearrange what is printed.



#!/usr/bin/perl

# Check a current log by retrieving it from onsite and processing it

$|++;

use strict;
use warnings;

use Socket;
use Text::parseWords;
use LWP::UserAgent;

.....
.....
.....
.....
.....

my @inarr = split /\n/, $res->content;
return \@inarr;
}

Thanks... I'll give it a try
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top