Apache log page counter?

Discussion in 'Perl Misc' started by Tuxedo, Jan 5, 2014.

  1. Tuxedo

    Tuxedo Guest

    I would like to implement a web page counter to measure 'unique' visitors
    p/page and p/IP in 24 hour cycles, so if several visits to a particular
    page comes from a same IP in a 24 hours time frame, it counts as one
    visitor to that page. While if the same visitor comes back to the same
    page, say 25 hours later counting from his previous IP and page count, the
    count can increment for that page again. In reality, this may represent a
    repeat visitor or another visitor having been assigned the same IP.

    Depending on type of web traffic and overall rotating IPs cycles by
    different net connection service providers, this type of solution may at
    best provide a highly approximate overview of a number of unique visitors.

    The source data is rotating Apache access logs in the format:
    192.114.71.13 - - [05/Jan/2014:19:10:19 +0100] "GET / HTTP/1.1" 302 186
    "http:.../ref_if_transmitted.html" "Mozilla, browser version etc...."

    Also, I prefer to avoid external services, free or otherwise, which do data
    collection or simply isn't what I need, not to mention they can slow a site
    down while connecting to their external servers unnecessarily sharing
    visitor data to third-parties using cookies etc., which in all defeats the
    general purpose of facilitating a positive user experience that should
    ideally help build up traffic in the first place....

    Any ideas, including home-grown open-source perl based logfile processing
    solutions, would be most welcome!

    Many thanks,
    Tuxedo
    Tuxedo, Jan 5, 2014
    #1
    1. Advertising

  2. Tuxedo

    Dr.Ruud Guest

    On 2014-01-05 19:47, Tuxedo wrote:
    > I would like to implement a web page counter to measure 'unique' visitors
    > p/page and p/IP in 24 hour cycles, so if several visits to a particular
    > page comes from a same IP in a 24 hours time frame, it counts as one
    > visitor to that page. While if the same visitor comes back to the same
    > page, say 25 hours later counting from his previous IP and page count, the
    > count can increment for that page again. In reality, this may represent a
    > repeat visitor or another visitor having been assigned the same IP.
    >
    > Depending on type of web traffic and overall rotating IPs cycles by
    > different net connection service providers, this type of solution may at
    > best provide a highly approximate overview of a number of unique visitors.
    >
    > The source data is rotating Apache access logs in the format:
    > 192.114.71.13 - - [05/Jan/2014:19:10:19 +0100] "GET / HTTP/1.1" 302 186
    > "http:.../ref_if_transmitted.html" "Mozilla, browser version etc...."
    >
    > Also, I prefer to avoid external services, free or otherwise, which do data
    > collection or simply isn't what I need, not to mention they can slow a site
    > down while connecting to their external servers unnecessarily sharing
    > visitor data to third-parties using cookies etc., which in all defeats the
    > general purpose of facilitating a positive user experience that should
    > ideally help build up traffic in the first place....
    >
    > Any ideas, including home-grown open-source perl based logfile processing
    > solutions, would be most welcome!


    Many hundreds of visitors can share the same IP-nr.
    Other visitors can have a fresh IP-nr for every request.

    So you need to store some ID in the client-side cookie.
    And store further details server-side (consider Apache Cassandra).

    That client-side cookie-ID needs protection against tampering and
    duplication. Combine it with a browser fingerprint, and create a fresh
    ID if the fingerprint no longer matches good enough.

    If you implement this all properly, you can use the factoids
    dynamically, meaning that you can make them directly influence the
    content of the served pages.

    --
    Ruud
    Dr.Ruud, Jan 5, 2014
    #2
    1. Advertising

  3. Tuxedo

    Tuxedo Guest

    Ben Morrow wrote:

    >
    > Quoth Tuxedo <>:
    > > I would like to implement a web page counter to measure 'unique'
    > > visitors p/page and p/IP in 24 hour cycles, so if several visits to a
    > > particular page comes from a same IP in a 24 hours time frame, it counts
    > > as one visitor to that page.

    > [...]
    > >
    > > The source data is rotating Apache access logs in the format:
    > > 192.114.71.13 - - [05/Jan/2014:19:10:19 +0100] "GET / HTTP/1.1" 302 186
    > > "http:.../ref_if_transmitted.html" "Mozilla, browser version etc...."

    >
    > Doesn't sound too hard: parse the log lines, extract the IP, page and
    > timestamp, and throw away any pair that is less than 24h newer than an
    > equivalent pair. What have you tried so far?
    >
    > Ben
    >


    So far I've not tried anything. I'm just looking into and testing various
    off-the-shelf solutions, such as Webalizer.org, Analog.cx, W3Perl.com,
    Piwik.org etc. Perhaps some can do what I need, as well as much more....

    Tuxedo
    Tuxedo, Jan 6, 2014
    #3
  4. Tuxedo

    Tuxedo Guest

    Tuxedo wrote:

    [... ]

    > So far I've not tried anything. I'm just looking into and testing various
    > off-the-shelf solutions, such as Webalizer.org, Analog.cx, W3Perl.com,
    > Piwik.org etc.


    Anyone knows of other tried-and-tested systems, please post any links here.

    Thanks,
    Tuxedo
    Tuxedo, Jan 6, 2014
    #4
  5. Ben Morrow <> writes:
    > Quoth Tuxedo <>:
    >> I would like to implement a web page counter to measure 'unique' visitors
    >> p/page and p/IP in 24 hour cycles, so if several visits to a particular
    >> page comes from a same IP in a 24 hours time frame, it counts as one
    >> visitor to that page.

    > [...]
    >>
    >> The source data is rotating Apache access logs in the format:
    >> 192.114.71.13 - - [05/Jan/2014:19:10:19 +0100] "GET / HTTP/1.1" 302 186
    >> "http:.../ref_if_transmitted.html" "Mozilla, browser version etc...."

    >
    > Doesn't sound too hard: parse the log lines, extract the IP, page and
    > timestamp, and throw away any pair that is less than 24h newer than an
    > equivalent pair. What have you tried so far?


    I'd like to second Dr Ruud on that -- just using the IPs is not a
    suitable way to count visitors because loads and loads of networks used
    by many users sharing one external IP exist, eg, all 'public access'
    WiFi networks, all reasonably small corporate networks and all
    households with more than one member using a device capable of 'acessing
    the internet'. Also, I know from experience that some phone companies
    put their customers into RFC1918-networks when using 3G (or 4G). Because
    of this, referring to the number determined in this way as 'higly
    approximative' is just a more scientifically sounding variant of
    'totally boguse', ie, whatever the actual visitor number happens to be,
    it is certainly not this number.

    http://hitchhikers.wikia.com/wiki/Bistromatics

    see also

    I still don't get locking. I just want to increment the number
    in the file. How can I do this?

    Didn't anyone ever tell you web-page hit counters were useless?
    [perldoc perlfaq5]
    Rainer Weikusat, Jan 6, 2014
    #5
  6. Rainer Weikusat <> writes:

    > Ben Morrow <> writes:
    >> Quoth Tuxedo <>:
    >>> I would like to implement a web page counter to measure 'unique' visitors
    >>> p/page and p/IP in 24 hour cycles, so if several visits to a particular
    >>> page comes from a same IP in a 24 hours time frame, it counts as one
    >>> visitor to that page.

    >> [...]
    >>>
    >>> The source data is rotating Apache access logs in the format:
    >>> 192.114.71.13 - - [05/Jan/2014:19:10:19 +0100] "GET / HTTP/1.1" 302 186
    >>> "http:.../ref_if_transmitted.html" "Mozilla, browser version etc...."

    >>
    >> Doesn't sound too hard: parse the log lines, extract the IP, page and
    >> timestamp, and throw away any pair that is less than 24h newer than an
    >> equivalent pair. What have you tried so far?

    >
    > I'd like to second Dr Ruud on that -- just using the IPs is not a
    > suitable way to count visitors because loads and loads of networks used
    > by many users sharing one external IP exist, eg, all 'public access'
    > WiFi networks, all reasonably small corporate networks and all
    > households with more than one member using a device capable of 'acessing
    > the internet'.


    My historical experience of this technique has not been quite so
    negative, though I don't use it anymore, partly because the "client"
    stopped asking for it and partly because it's getting less and less
    reliable.

    If you have a site with only a few thousand visits a day, the number
    that will come from shared IP addresses is not going to be vary large.
    It's going to be larger than "random" since people on public WiFi or in
    small businesses may well be browsing together or all working on
    something that causes them to visit the same site.

    Some while ago I did test using those clients that seem to provide a
    referrer with each request. The idea was to see how many chains of
    pages visits were inconsistent with referrer information when that was
    present. I don't still have the data (and it was long enough ago to be
    useless now due to the growth in mobile) but it showed very little
    interleaving of "shared IP" visits. Of course, this can't be directly
    extrapolated to clients that don't provide referrer info because they
    may well be more common in shared IP environments.

    The a reason this matters (or used to at least) is that many people
    demand stats no matter how often you tell them they are just semi-random
    data. The site in question was a partially public-funded charity, and
    the public funding body demanded all sorts of data. Sometimes you can
    just provide unidentified data -- on one site I used to simply report
    the relative size of the log files and everyone was happy with that --
    but other times someone else has reported "visitors" and you are pretty
    much obliged to do something along those lines.

    > Also, I know from experience that some phone companies
    > put their customers into RFC1918-networks when using 3G (or 4G).


    Just before I stopped doing this sort of logging, I started to see quite
    a few visits from mobile devices where multiple IP addresses were
    involved in what was very clearly a single "visit". This is something
    that I'd not seen before. Is this a common feature of mobile IP
    networks? The more common manifestation of RFC1918-networks is one IP
    address being used for multiple visits. Either way, large scale use of
    mobile networks is going to make the technique nearly useless. I don't
    think I'd even try these days.

    <snip>
    --
    Ben.
    Ben Bacarisse, Jan 6, 2014
    #6
  7. On 2014-01-06 15:15, Ben Bacarisse <> wrote:
    > Rainer Weikusat <> writes:
    >> Also, I know from experience that some phone companies
    >> put their customers into RFC1918-networks when using 3G (or 4G).

    >
    > Just before I stopped doing this sort of logging, I started to see quite
    > a few visits from mobile devices where multiple IP addresses were
    > involved in what was very clearly a single "visit". This is something
    > that I'd not seen before. Is this a common feature of mobile IP
    > networks? The more common manifestation of RFC1918-networks is one IP
    > address being used for multiple visits.


    If you have a lot of customers behind NAT, you will have more
    than 64 k connections (maybe even more than 64 k connections to the same
    server!), so you need to use more than 1 public IP address. Depending on
    how connections are mapped to public IP addresses, different connections
    from the same client may wind up with different public IP addresses.

    > Either way, large scale use of mobile networks is going to make the
    > technique nearly useless. I don't think I'd even try these days.


    It may become useful again with IPv6, Even with Privacy Extensions.

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaƟt. -- Ralph Babel
    Peter J. Holzer, Jan 6, 2014
    #7
  8. >>>>> "BB" == Ben Bacarisse <> writes:

    BB> The a reason this matters (or used to at least) is that many
    BB> people demand stats no matter how often you tell them they are
    BB> just semi-random data. The site in question was a partially
    BB> public-funded charity, and the public funding body demanded all
    BB> sorts of data. Sometimes you can just provide unidentified data
    BB> -- on one site I used to simply report the relative size of the
    BB> log files and everyone was happy with that -- but other times
    BB> someone else has reported "visitors" and you are pretty much
    BB> obliged to do something along those lines.

    Some years back I worked for a startup that had this kind of problem.
    I realized a few things:

    A good director of marketing understands both the needs of the
    particular business for data and the limitations of the data gathering
    methods, and figures out ways to connect them. Good marketing is
    data-driven and instinct-driven in roughly equal measures because of the
    limitations of data gathering.

    From an engineer's point of view, web analytics data is "semi-random."
    From the viewpoint of people who have done advertising on TV or in
    print, web analytics data is insanely precise and detailed.

    After the director of marketing has departed (which is a tale in and of
    itself) the executives have no idea what data they want or need, but
    they will ask for it in the vaguest of terms while demanding that it be
    simultaneously objective and as favorable to what they want to do as
    possible.

    Without a good director of marketing and a competent CTO, a startup is
    doomed to failure.

    Charlton





    --
    Charlton Wilbur
    Charlton Wilbur, Jan 6, 2014
    #8
  9. Tuxedo

    Tuxedo Guest

    Dr.Ruud wrote:

    [...]

    > Many hundreds of visitors can share the same IP-nr.
    > Other visitors can have a fresh IP-nr for every request.
    >
    > So you need to store some ID in the client-side cookie.
    > And store further details server-side (consider Apache Cassandra).
    >
    > That client-side cookie-ID needs protection against tampering and
    > duplication. Combine it with a browser fingerprint, and create a fresh
    > ID if the fingerprint no longer matches good enough.
    >
    > If you implement this all properly, you can use the factoids
    > dynamically, meaning that you can make them directly influence the
    > content of the served pages.
    >


    Many thanks for pointing out the above facts. I guess the only way to have
    semi-reliable unique visitors data is through the use of cookies combined
    with browser fingerprinting, which was something I was hoping to avoid.

    Tuxedo
    Tuxedo, Jan 6, 2014
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Henrik_the_boss
    Replies:
    0
    Views:
    2,643
    Henrik_the_boss
    Nov 5, 2003
  2. Amratash
    Replies:
    0
    Views:
    515
    Amratash
    Apr 13, 2004
  3. The Eeediot
    Replies:
    3
    Views:
    2,225
    =?Utf-8?B?UnVsaW4gSG9uZw==?=
    Dec 22, 2004
  4. George2
    Replies:
    1
    Views:
    791
    Alf P. Steinbach
    Jan 31, 2008
  5. Leena Jethwa

    Create a log-in & log-out page

    Leena Jethwa, May 1, 2009, in forum: Ruby
    Replies:
    2
    Views:
    101
    Mike Stephens
    May 1, 2009
Loading...

Share This Page