Apache log page counter?

Discussion in 'Perl Misc' started by Tuxedo, Jan 5, 2014.

  1. Tuxedo

    Tuxedo Guest

    I would like to implement a web page counter to measure 'unique' visitors
    p/page and p/IP in 24 hour cycles, so if several visits to a particular
    page comes from a same IP in a 24 hours time frame, it counts as one
    visitor to that page. While if the same visitor comes back to the same
    page, say 25 hours later counting from his previous IP and page count, the
    count can increment for that page again. In reality, this may represent a
    repeat visitor or another visitor having been assigned the same IP.

    Depending on type of web traffic and overall rotating IPs cycles by
    different net connection service providers, this type of solution may at
    best provide a highly approximate overview of a number of unique visitors.

    The source data is rotating Apache access logs in the format:
    192.114.71.13 - - [05/Jan/2014:19:10:19 +0100] "GET / HTTP/1.1" 302 186
    "http:.../ref_if_transmitted.html" "Mozilla, browser version etc...."

    Also, I prefer to avoid external services, free or otherwise, which do data
    collection or simply isn't what I need, not to mention they can slow a site
    down while connecting to their external servers unnecessarily sharing
    visitor data to third-parties using cookies etc., which in all defeats the
    general purpose of facilitating a positive user experience that should
    ideally help build up traffic in the first place....

    Any ideas, including home-grown open-source perl based logfile processing
    solutions, would be most welcome!

    Many thanks,
    Tuxedo
     
    Tuxedo, Jan 5, 2014
    #1
    1. Advertisements

  2. Tuxedo

    Dr.Ruud Guest

    Many hundreds of visitors can share the same IP-nr.
    Other visitors can have a fresh IP-nr for every request.

    So you need to store some ID in the client-side cookie.
    And store further details server-side (consider Apache Cassandra).

    That client-side cookie-ID needs protection against tampering and
    duplication. Combine it with a browser fingerprint, and create a fresh
    ID if the fingerprint no longer matches good enough.

    If you implement this all properly, you can use the factoids
    dynamically, meaning that you can make them directly influence the
    content of the served pages.
     
    Dr.Ruud, Jan 5, 2014
    #2
    1. Advertisements

  3. Tuxedo

    Tuxedo Guest

    So far I've not tried anything. I'm just looking into and testing various
    off-the-shelf solutions, such as Webalizer.org, Analog.cx, W3Perl.com,
    Piwik.org etc. Perhaps some can do what I need, as well as much more....

    Tuxedo
     
    Tuxedo, Jan 6, 2014
    #3
  4. Tuxedo

    Tuxedo Guest

    Tuxedo wrote:

    [... ]
    Anyone knows of other tried-and-tested systems, please post any links here.

    Thanks,
    Tuxedo
     
    Tuxedo, Jan 6, 2014
    #4
  5. I'd like to second Dr Ruud on that -- just using the IPs is not a
    suitable way to count visitors because loads and loads of networks used
    by many users sharing one external IP exist, eg, all 'public access'
    WiFi networks, all reasonably small corporate networks and all
    households with more than one member using a device capable of 'acessing
    the internet'. Also, I know from experience that some phone companies
    put their customers into RFC1918-networks when using 3G (or 4G). Because
    of this, referring to the number determined in this way as 'higly
    approximative' is just a more scientifically sounding variant of
    'totally boguse', ie, whatever the actual visitor number happens to be,
    it is certainly not this number.

    http://hitchhikers.wikia.com/wiki/Bistromatics

    see also

    I still don't get locking. I just want to increment the number
    in the file. How can I do this?

    Didn't anyone ever tell you web-page hit counters were useless?
    [perldoc perlfaq5]
     
    Rainer Weikusat, Jan 6, 2014
    #5
  6. My historical experience of this technique has not been quite so
    negative, though I don't use it anymore, partly because the "client"
    stopped asking for it and partly because it's getting less and less
    reliable.

    If you have a site with only a few thousand visits a day, the number
    that will come from shared IP addresses is not going to be vary large.
    It's going to be larger than "random" since people on public WiFi or in
    small businesses may well be browsing together or all working on
    something that causes them to visit the same site.

    Some while ago I did test using those clients that seem to provide a
    referrer with each request. The idea was to see how many chains of
    pages visits were inconsistent with referrer information when that was
    present. I don't still have the data (and it was long enough ago to be
    useless now due to the growth in mobile) but it showed very little
    interleaving of "shared IP" visits. Of course, this can't be directly
    extrapolated to clients that don't provide referrer info because they
    may well be more common in shared IP environments.

    The a reason this matters (or used to at least) is that many people
    demand stats no matter how often you tell them they are just semi-random
    data. The site in question was a partially public-funded charity, and
    the public funding body demanded all sorts of data. Sometimes you can
    just provide unidentified data -- on one site I used to simply report
    the relative size of the log files and everyone was happy with that --
    but other times someone else has reported "visitors" and you are pretty
    much obliged to do something along those lines.
    Just before I stopped doing this sort of logging, I started to see quite
    a few visits from mobile devices where multiple IP addresses were
    involved in what was very clearly a single "visit". This is something
    that I'd not seen before. Is this a common feature of mobile IP
    networks? The more common manifestation of RFC1918-networks is one IP
    address being used for multiple visits. Either way, large scale use of
    mobile networks is going to make the technique nearly useless. I don't
    think I'd even try these days.

    <snip>
     
    Ben Bacarisse, Jan 6, 2014
    #6
  7. If you have a lot of customers behind NAT, you will have more
    than 64 k connections (maybe even more than 64 k connections to the same
    server!), so you need to use more than 1 public IP address. Depending on
    how connections are mapped to public IP addresses, different connections
    from the same client may wind up with different public IP addresses.
    It may become useful again with IPv6, Even with Privacy Extensions.

    hp
     
    Peter J. Holzer, Jan 6, 2014
    #7
  8. BB> The a reason this matters (or used to at least) is that many
    BB> people demand stats no matter how often you tell them they are
    BB> just semi-random data. The site in question was a partially
    BB> public-funded charity, and the public funding body demanded all
    BB> sorts of data. Sometimes you can just provide unidentified data
    BB> -- on one site I used to simply report the relative size of the
    BB> log files and everyone was happy with that -- but other times
    BB> someone else has reported "visitors" and you are pretty much
    BB> obliged to do something along those lines.

    Some years back I worked for a startup that had this kind of problem.
    I realized a few things:

    A good director of marketing understands both the needs of the
    particular business for data and the limitations of the data gathering
    methods, and figures out ways to connect them. Good marketing is
    data-driven and instinct-driven in roughly equal measures because of the
    limitations of data gathering.

    From an engineer's point of view, web analytics data is "semi-random."
    From the viewpoint of people who have done advertising on TV or in
    print, web analytics data is insanely precise and detailed.

    After the director of marketing has departed (which is a tale in and of
    itself) the executives have no idea what data they want or need, but
    they will ask for it in the vaguest of terms while demanding that it be
    simultaneously objective and as favorable to what they want to do as
    possible.

    Without a good director of marketing and a competent CTO, a startup is
    doomed to failure.

    Charlton
     
    Charlton Wilbur, Jan 6, 2014
    #8
  9. Tuxedo

    Tuxedo Guest

    Dr.Ruud wrote:

    [...]
    Many thanks for pointing out the above facts. I guess the only way to have
    semi-reliable unique visitors data is through the use of cookies combined
    with browser fingerprinting, which was something I was hoping to avoid.

    Tuxedo
     
    Tuxedo, Jan 6, 2014
    #9
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.