Apache log page counter?

T

Tuxedo

I would like to implement a web page counter to measure 'unique' visitors
p/page and p/IP in 24 hour cycles, so if several visits to a particular
page comes from a same IP in a 24 hours time frame, it counts as one
visitor to that page. While if the same visitor comes back to the same
page, say 25 hours later counting from his previous IP and page count, the
count can increment for that page again. In reality, this may represent a
repeat visitor or another visitor having been assigned the same IP.

Depending on type of web traffic and overall rotating IPs cycles by
different net connection service providers, this type of solution may at
best provide a highly approximate overview of a number of unique visitors.

The source data is rotating Apache access logs in the format:
192.114.71.13 - - [05/Jan/2014:19:10:19 +0100] "GET / HTTP/1.1" 302 186
"http:.../ref_if_transmitted.html" "Mozilla, browser version etc...."

Also, I prefer to avoid external services, free or otherwise, which do data
collection or simply isn't what I need, not to mention they can slow a site
down while connecting to their external servers unnecessarily sharing
visitor data to third-parties using cookies etc., which in all defeats the
general purpose of facilitating a positive user experience that should
ideally help build up traffic in the first place....

Any ideas, including home-grown open-source perl based logfile processing
solutions, would be most welcome!

Many thanks,
Tuxedo
 
D

Dr.Ruud

I would like to implement a web page counter to measure 'unique' visitors
p/page and p/IP in 24 hour cycles, so if several visits to a particular
page comes from a same IP in a 24 hours time frame, it counts as one
visitor to that page. While if the same visitor comes back to the same
page, say 25 hours later counting from his previous IP and page count, the
count can increment for that page again. In reality, this may represent a
repeat visitor or another visitor having been assigned the same IP.

Depending on type of web traffic and overall rotating IPs cycles by
different net connection service providers, this type of solution may at
best provide a highly approximate overview of a number of unique visitors.

The source data is rotating Apache access logs in the format:
192.114.71.13 - - [05/Jan/2014:19:10:19 +0100] "GET / HTTP/1.1" 302 186
"http:.../ref_if_transmitted.html" "Mozilla, browser version etc...."

Also, I prefer to avoid external services, free or otherwise, which do data
collection or simply isn't what I need, not to mention they can slow a site
down while connecting to their external servers unnecessarily sharing
visitor data to third-parties using cookies etc., which in all defeats the
general purpose of facilitating a positive user experience that should
ideally help build up traffic in the first place....

Any ideas, including home-grown open-source perl based logfile processing
solutions, would be most welcome!

Many hundreds of visitors can share the same IP-nr.
Other visitors can have a fresh IP-nr for every request.

So you need to store some ID in the client-side cookie.
And store further details server-side (consider Apache Cassandra).

That client-side cookie-ID needs protection against tampering and
duplication. Combine it with a browser fingerprint, and create a fresh
ID if the fingerprint no longer matches good enough.

If you implement this all properly, you can use the factoids
dynamically, meaning that you can make them directly influence the
content of the served pages.
 
T

Tuxedo

Ben said:
Quoth Tuxedo said:
I would like to implement a web page counter to measure 'unique'
visitors p/page and p/IP in 24 hour cycles, so if several visits to a
particular page comes from a same IP in a 24 hours time frame, it counts
as one visitor to that page. [...]

The source data is rotating Apache access logs in the format:
192.114.71.13 - - [05/Jan/2014:19:10:19 +0100] "GET / HTTP/1.1" 302 186
"http:.../ref_if_transmitted.html" "Mozilla, browser version etc...."

Doesn't sound too hard: parse the log lines, extract the IP, page and
timestamp, and throw away any pair that is less than 24h newer than an
equivalent pair. What have you tried so far?

Ben

So far I've not tried anything. I'm just looking into and testing various
off-the-shelf solutions, such as Webalizer.org, Analog.cx, W3Perl.com,
Piwik.org etc. Perhaps some can do what I need, as well as much more....

Tuxedo
 
T

Tuxedo

Tuxedo wrote:

[... ]
So far I've not tried anything. I'm just looking into and testing various
off-the-shelf solutions, such as Webalizer.org, Analog.cx, W3Perl.com,
Piwik.org etc.

Anyone knows of other tried-and-tested systems, please post any links here.

Thanks,
Tuxedo
 
R

Rainer Weikusat

Ben Morrow said:
Quoth Tuxedo said:
I would like to implement a web page counter to measure 'unique' visitors
p/page and p/IP in 24 hour cycles, so if several visits to a particular
page comes from a same IP in a 24 hours time frame, it counts as one
visitor to that page. [...]

The source data is rotating Apache access logs in the format:
192.114.71.13 - - [05/Jan/2014:19:10:19 +0100] "GET / HTTP/1.1" 302 186
"http:.../ref_if_transmitted.html" "Mozilla, browser version etc...."

Doesn't sound too hard: parse the log lines, extract the IP, page and
timestamp, and throw away any pair that is less than 24h newer than an
equivalent pair. What have you tried so far?

I'd like to second Dr Ruud on that -- just using the IPs is not a
suitable way to count visitors because loads and loads of networks used
by many users sharing one external IP exist, eg, all 'public access'
WiFi networks, all reasonably small corporate networks and all
households with more than one member using a device capable of 'acessing
the internet'. Also, I know from experience that some phone companies
put their customers into RFC1918-networks when using 3G (or 4G). Because
of this, referring to the number determined in this way as 'higly
approximative' is just a more scientifically sounding variant of
'totally boguse', ie, whatever the actual visitor number happens to be,
it is certainly not this number.

http://hitchhikers.wikia.com/wiki/Bistromatics

see also

I still don't get locking. I just want to increment the number
in the file. How can I do this?

Didn't anyone ever tell you web-page hit counters were useless?
[perldoc perlfaq5]
 
B

Ben Bacarisse

Rainer Weikusat said:
Ben Morrow said:
Quoth Tuxedo said:
I would like to implement a web page counter to measure 'unique' visitors
p/page and p/IP in 24 hour cycles, so if several visits to a particular
page comes from a same IP in a 24 hours time frame, it counts as one
visitor to that page. [...]

The source data is rotating Apache access logs in the format:
192.114.71.13 - - [05/Jan/2014:19:10:19 +0100] "GET / HTTP/1.1" 302 186
"http:.../ref_if_transmitted.html" "Mozilla, browser version etc...."

Doesn't sound too hard: parse the log lines, extract the IP, page and
timestamp, and throw away any pair that is less than 24h newer than an
equivalent pair. What have you tried so far?

I'd like to second Dr Ruud on that -- just using the IPs is not a
suitable way to count visitors because loads and loads of networks used
by many users sharing one external IP exist, eg, all 'public access'
WiFi networks, all reasonably small corporate networks and all
households with more than one member using a device capable of 'acessing
the internet'.

My historical experience of this technique has not been quite so
negative, though I don't use it anymore, partly because the "client"
stopped asking for it and partly because it's getting less and less
reliable.

If you have a site with only a few thousand visits a day, the number
that will come from shared IP addresses is not going to be vary large.
It's going to be larger than "random" since people on public WiFi or in
small businesses may well be browsing together or all working on
something that causes them to visit the same site.

Some while ago I did test using those clients that seem to provide a
referrer with each request. The idea was to see how many chains of
pages visits were inconsistent with referrer information when that was
present. I don't still have the data (and it was long enough ago to be
useless now due to the growth in mobile) but it showed very little
interleaving of "shared IP" visits. Of course, this can't be directly
extrapolated to clients that don't provide referrer info because they
may well be more common in shared IP environments.

The a reason this matters (or used to at least) is that many people
demand stats no matter how often you tell them they are just semi-random
data. The site in question was a partially public-funded charity, and
the public funding body demanded all sorts of data. Sometimes you can
just provide unidentified data -- on one site I used to simply report
the relative size of the log files and everyone was happy with that --
but other times someone else has reported "visitors" and you are pretty
much obliged to do something along those lines.
Also, I know from experience that some phone companies
put their customers into RFC1918-networks when using 3G (or 4G).

Just before I stopped doing this sort of logging, I started to see quite
a few visits from mobile devices where multiple IP addresses were
involved in what was very clearly a single "visit". This is something
that I'd not seen before. Is this a common feature of mobile IP
networks? The more common manifestation of RFC1918-networks is one IP
address being used for multiple visits. Either way, large scale use of
mobile networks is going to make the technique nearly useless. I don't
think I'd even try these days.

<snip>
 
P

Peter J. Holzer

Just before I stopped doing this sort of logging, I started to see quite
a few visits from mobile devices where multiple IP addresses were
involved in what was very clearly a single "visit". This is something
that I'd not seen before. Is this a common feature of mobile IP
networks? The more common manifestation of RFC1918-networks is one IP
address being used for multiple visits.

If you have a lot of customers behind NAT, you will have more
than 64 k connections (maybe even more than 64 k connections to the same
server!), so you need to use more than 1 public IP address. Depending on
how connections are mapped to public IP addresses, different connections
from the same client may wind up with different public IP addresses.
Either way, large scale use of mobile networks is going to make the
technique nearly useless. I don't think I'd even try these days.

It may become useful again with IPv6, Even with Privacy Extensions.

hp
 
C

Charlton Wilbur

BB> The a reason this matters (or used to at least) is that many
BB> people demand stats no matter how often you tell them they are
BB> just semi-random data. The site in question was a partially
BB> public-funded charity, and the public funding body demanded all
BB> sorts of data. Sometimes you can just provide unidentified data
BB> -- on one site I used to simply report the relative size of the
BB> log files and everyone was happy with that -- but other times
BB> someone else has reported "visitors" and you are pretty much
BB> obliged to do something along those lines.

Some years back I worked for a startup that had this kind of problem.
I realized a few things:

A good director of marketing understands both the needs of the
particular business for data and the limitations of the data gathering
methods, and figures out ways to connect them. Good marketing is
data-driven and instinct-driven in roughly equal measures because of the
limitations of data gathering.

From an engineer's point of view, web analytics data is "semi-random."
From the viewpoint of people who have done advertising on TV or in
print, web analytics data is insanely precise and detailed.

After the director of marketing has departed (which is a tale in and of
itself) the executives have no idea what data they want or need, but
they will ask for it in the vaguest of terms while demanding that it be
simultaneously objective and as favorable to what they want to do as
possible.

Without a good director of marketing and a competent CTO, a startup is
doomed to failure.

Charlton
 
T

Tuxedo

Dr.Ruud wrote:

[...]
Many hundreds of visitors can share the same IP-nr.
Other visitors can have a fresh IP-nr for every request.

So you need to store some ID in the client-side cookie.
And store further details server-side (consider Apache Cassandra).

That client-side cookie-ID needs protection against tampering and
duplication. Combine it with a browser fingerprint, and create a fresh
ID if the fingerprint no longer matches good enough.

If you implement this all properly, you can use the factoids
dynamically, meaning that you can make them directly influence the
content of the served pages.

Many thanks for pointing out the above facts. I guess the only way to have
semi-reliable unique visitors data is through the use of cookies combined
with browser fingerprinting, which was something I was hoping to avoid.

Tuxedo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top