Interpreting Web statistics

Discussion in 'HTML' started by JDS, Feb 20, 2006.

  1. JDS

    JDS Guest

    Hi, all. I am constantly butting heads with others in my department about
    the interpretation of web log statistics regarding viewership of a
    website. "Page views" "path through a site" "exit points" and that sort
    of thing.

    On the web, there are two diametrically opposing views on the value of web
    server log stats.

    1) Web server log analysis is very useful and can provide detailed,
    usable, accurate statistics

    AND

    2) It can't

    Well, which is it?

    Typically, companies (e.g. Webtrends) that sell analysis software say
    the first. However, there are a number of articles pointing to the
    second. Notably, the author of "analog", one of the original web log
    analysis tools, says that you can't *really* get too much meaningful
    analysis out of your server logs.

    Well, the problem that I see is that the articles pointing to the
    uselessness of web log analysis tend to be OLD. REALLY REALLY old in
    internet years -- ca. 1994 and 1995!

    examples:
    http://www.analog.cx/docs/webworks.html
    http://www.goldmark.org/netrants/webstats/#whykeep
    http://www.ario.ch/etc/webstats.html


    Now, technology has moved along since the WWW first hit the streets, so to
    speak, and my question(s) is(are) simple:

    What techniques exist to overcome the problems inherent in Web Server Log
    Analysis?

    I know there *must* be some techniques! Things like tracking users via
    cookies and using "tracker" URLs (a server script that gets URLS and
    redirects the browser, thus writing a log of what was clicked and where),
    that sort of thing.

    If anyone can provide some insight on the following, that's be great:

    What techniques exist to improve Web Sever Log analysis?

    How good are they?

    What can I do to implement them?

    How do different log analysis tools compare? (Examples I have considered
    using are Analog, AWStats, Webtrends, and Sawmill.)

    (All these factors are important to me in gauging a tools quality:
    accuracy and usefulness of reports, "prettiness" of reports, ease of use,
    flexibility, speed, and cost)

    Golly, thanks, web denizens. I look forward to your responses. Have a
    nice day.

    --
    JDS | lid
    | http://www.newtnotes.com
    DJMBS | http://newtnotes.com/doctor-jeff-master-brainsurgeon/
    JDS, Feb 20, 2006
    #1
    1. Advertising

  2. JDS

    Mark Parnell Guest

    Mark Parnell, Feb 20, 2006
    #2
    1. Advertising

  3. JDS

    Steve Pugh Guest

    JDS wrote:
    > Hi, all. I am constantly butting heads with others in my department about
    > the interpretation of web log statistics regarding viewership of a
    > website. "Page views" "path through a site" "exit points" and that sort
    > of thing.


    Those things are useful if set up and interpreted with care, but are
    not 100% definitive.

    > On the web, there are two diametrically opposing views on the value of web
    > server log stats.
    >
    > 1) Web server log analysis is very useful and can provide detailed,
    > usable, accurate statistics
    >
    > AND
    >
    > 2) It can't
    >
    > Well, which is it?


    Both.

    > Typically, companies (e.g. Webtrends) that sell analysis software say
    > the first.


    Read the small print. WebTrends use cookies and JavaScript instead
    of/as well as server logs. They have a number of products and services
    which offer differing levels of accuracy. But at the end of the day
    they can not be 100% accurate. Think of them as providing information
    on general trends rather than precise detail on every user (if a user
    has a static IP and/or accepts and keeps cookies and enables JavaScript
    then you can study them very accurately).

    > However, there are a number of articles pointing to the
    > second. Notably, the author of "analog", one of the original web log
    > analysis tools, says that you can't *really* get too much meaningful
    > analysis out of your server logs.


    Yes, Analog reads server logs alone. It doesn't try to do anything with
    JavaScript, cookies, etc.

    > Well, the problem that I see is that the articles pointing to the
    > uselessness of web log analysis tend to be OLD. REALLY REALLY old in
    > internet years -- ca. 1994 and 1995!


    Server logs haven't changed.

    > Now, technology has moved along since the WWW first hit the streets, so to
    > speak, and my question(s) is(are) simple:
    >
    > What techniques exist to overcome the problems inherent in Web Server Log
    > Analysis?


    Cookies, JavaScript, guesswork.

    > I know there *must* be some techniques! Things like tracking users via
    > cookies and using "tracker" URLs (a server script that gets URLS and
    > redirects the browser, thus writing a log of what was clicked and where),
    > that sort of thing.
    >
    > If anyone can provide some insight on the following, that's be great:
    >
    > What techniques exist to improve Web Sever Log analysis?
    >
    > How good are they?
    >
    > What can I do to implement them?
    >
    > How do different log analysis tools compare? (Examples I have considered
    > using are Analog, AWStats, Webtrends, and Sawmill.)


    How much money do you have to spend?

    Steve
    Steve Pugh, Feb 20, 2006
    #3
  4. On Mon, 20 Feb 2006, Steve Pugh wrote:

    > Read the small print. WebTrends use cookies and JavaScript instead
    > of/as well as server logs.


    Both of which, discerning users have been selectively blocking for
    many years. What was that program we had back in Win95 days, which
    blocked such things from any browser? I've actually forgotten its
    name, and my old '95 PC has long since gone to the knacker's yard, but
    it was definitely there; and nowadays such functions come built-in to
    any decent browser.

    However, servers which insist on using such techniques are inhibiting
    cacheability, and thus ensuring a less responsive web, and thus are
    interfering in a negative way with the results which their users
    experience (*all* of their users - not only those discerning users who
    block these attempts to peek into their activities).

    This is, in effect, the Heisenberg law of web statistics - the harder
    you try to get accurate answers, the more you interfere with the way
    that the web works (recalling that HTTP was quite deliberately
    designed to be "stateless"), and the worse you are able to serve the
    requests of your users. And so, you end up getting more-accurate
    measurements of something that would be working much better if only
    you'd stop trying so hard to measure it.

    > They have a number of products and services which offer differing
    > levels of accuracy. But at the end of the day they can not be 100%
    > accurate.


    Worse than that: they aren't just "inaccurate", they are seriously
    "biased", but you have no way of estimating the bias.

    For example, if you improved your cacheability, your users would get
    faster responses, and you might get more users sticking around to read
    your site, whereas your server statistics would show fewer hits thanks
    to all those folks who were getting the pages out of an intermediate
    cache. And would show gaps in your statistics because they revisit
    pages in their *own* browser cache, whereas previously they were
    having to wait to re-fetch the same page from your server on every
    revisit.

    > Think of them as providing information on general trends


    Yeah, such as when a certain large ISP deployed a new bank of cache
    servers, and the "trends" apparently showed that users had
    mysteriously lost interest in the web site in question. Strangely,
    each popular page that was hit on the server was being hit exactly
    once every 24 hours, after which nothing was heard again from that ISP
    for another 24 hours. Yup, that ISP was callously ignoring everything
    that the server told it in terms of this page is uncacheable, expires
    in January 1970, etc. etc., and was cacheing each page for 24 hours
    without appeal. No, I'm sorry: those "trends" don't really show very
    much, unless and until you really know what's happening OUT THERE.
    But your server statistics have no way to tell you what's happening
    out there. They're selective, and biased, and, often enough, if
    interpreted to show what people demand to know - rather than
    interpreted in terms of the information they really contain - can
    appear to show the opposite of the truth.

    Let us consider for example those misguided folks who notice that >70%
    of their users appear (according to the logged user agent) to be using
    MSIE, so they "optimise" their site specifically for MSIE, and,
    surprise surprise, the proportion of MSIE users rises. So would you
    say they acted correctly, when most everyone else reports that the
    proportion appearing to use MSIE is falling? For one thing, Opera
    users are starting to stand up for themselves - many of them are no
    longer willing to hide behind a user agent string which pretends to be
    MSIE.

    Many other changes are happening "out there", which make those numbers
    viewed down the wrong end of the telescope at your server log into
    highly misleading indicators of anything - except your server load,
    and possibly a handy way to identify broken links.

    > > However, there are a number of articles pointing to the second.
    > > Notably, the author of "analog", one of the original web log
    > > analysis tools, says that you can't *really* get too much
    > > meaningful analysis out of your server logs.


    The author of Analog works in statistics, AIUI, and is determined to
    tell the truth about web servers, no matter how much some web server
    operators insist that they prefer to be fooled by convincing-looking
    numbers about the behaviour of their visitors. Good for him.
    Alan J. Flavell, Feb 20, 2006
    #4
  5. JDS

    Ed Mullen Guest

    JDS wrote:
    > Hi, all. I am constantly butting heads with others in my department about
    > the interpretation of web log statistics regarding viewership of a
    > website. "Page views" "path through a site" "exit points" and that sort
    > of thing.
    >


    The very simplest thing that occurs to me is from a user's standpoint.
    As one user, I sometimes hit a given page many times but for a variety
    of reasons. Stats won't tell you /why/ I hit that page. It could be
    because I got distracted and went somewhere else for some totally
    different purpose. It could be because the page didn't load fully
    (images, etc.) and I left and came back. Maybe I looked at it on Tuesday
    and thought "Crap, I just don't have time now, I'll bookmark it in my
    "temps" folder and check it tomorrow (or in a month). Perhaps I landed
    there by accident, by clicking on the wrong link in a Google result page
    or the wrong link in someone else's page. Or, maybe, I did a Google
    search, went to that particular page and thought: Ohmigod! just what I
    was looking for!!! Or not. How do any of the page stats tell you that?

    I look at some stats for my site and don't take them all that seriously
    other than aggregate changes from one month to the next, figuring that,
    given all the variables, at least I can see what page is the most
    accessed, the second-most accessed, the third-most, from month to month
    ... but that's about it. Heck, my checking my own site can skew the
    stats depending on the total number of hits. At some point it becomes a
    bit silly to chase after the chimera.

    "There are lies, damned lies, and then there are statistics."

    --
    Ed Mullen
    http://edmullen.net
    http://mozilla.edmullen.net
    http://abington.edmullen.net
    Ed Mullen, Feb 21, 2006
    #5
  6. JDS

    Nick Kew Guest

    Steve Pugh wrote:

    > Those things are useful if set up and interpreted with care, but are
    > not 100% definitive.


    Treat them as you would viewing figures for a TV show.

    >>Typically, companies (e.g. Webtrends) that sell analysis software say
    >>the first.

    >
    >
    > Read the small print. WebTrends use cookies and JavaScript instead
    > of/as well as server logs.


    Spammers. 'nuff said (or should be - Alan expanded on some
    more technical reasons).

    >> However, there are a number of articles pointing to the
    >>second. Notably, the author of "analog", one of the original web log
    >>analysis tools, says that you can't *really* get too much meaningful
    >>analysis out of your server logs.

    >
    >
    > Yes, Analog reads server logs alone. It doesn't try to do anything with
    > JavaScript, cookies, etc.


    No spam, no snake oil. No surprise.

    I rather suspect the author of analog may even understand the subject.
    Unlike those outfits where anyone who understands the issues is firmly
    ignored and probably laughed at as a nerdy loser behind their backs.

    >>What techniques exist to improve Web Sever Log analysis?
    >>
    >>How good are they?
    >>
    >>What can I do to implement them?


    Hire a statistician. And make it someone who understands the
    infrastructure of the Web. There are very few people who
    qualify on both counts.

    Now you need to add *knowledge* of the web's infrastructure.
    That's different from the *principles*, and much harder to
    collect. In fact it's impossible to collect at the level
    that would be required for the likes of webtrends to work -
    you have to apply the kind of techniques that broadcasters
    use. I haven't worked for a broadcaster myself, but I
    strongly suspect *they* rely on some pretty ropey assumptions,
    too[1].

    [1] I have worked as a statistician, and I've seen how things
    happen when there is *no data* to validate some part of the
    underlying model used. It goes like this:
    - Someone picks a figure effectively at random on a 'seems
    reasonable' basis just to have something to work with.
    That enables them to derive numbers from the model.
    - They also try the model with different figures, to test
    the effect of varying the unknown. This leads to a perfectly
    valid set of "if [value1] then [result1]" results.
    - BUT that's too complex for a soundbite culture, so only the
    first figure gets reported as a headline conclusion.
    - Now, a future practitioner has NO DATA to validate this part
    of the model, but has the first paper as a reference to cite.
    The assumption is peripheral to the study, so the 'headline'
    figure is simply used without question.
    - Over time it is much-cited because nobody wants to get involved
    in something that cannot be verified. The first researcher's
    still totally untested working hypothesis becomes common knowledge,
    and 'obviously correct' because everyone uses it.

    --
    Nick Kew
    Nick Kew, Feb 21, 2006
    #6
  7. JDS

    JDS Guest

    On Tue, 21 Feb 2006 11:17:31 +0000, Nick Kew wrote:

    > [1] I have worked as a statistician, and I've seen how things happen when
    > there is *no data* to validate some part of the underlying model used. It
    > goes like this:
    > - Someone picks a figure effectively at random on a 'seems
    > reasonable' basis just to have something to work with. That enables
    > them to derive numbers from the model.
    > - They also try the model with different figures, to test
    > the effect of varying the unknown. This leads to a perfectly valid
    > set of "if [value1] then [result1]" results.
    > - BUT that's too complex for a soundbite culture, so only the
    > first figure gets reported as a headline conclusion.
    > - Now, a future practitioner has NO DATA to validate this part
    > of the model, but has the first paper as a reference to cite. The
    > assumption is peripheral to the study, so the 'headline' figure is
    > simply used without question.
    > - Over time it is much-cited because nobody wants to get involved
    > in something that cannot be verified. The first researcher's still
    > totally untested working hypothesis becomes common knowledge, and
    > 'obviously correct' because everyone uses it.



    Riiiight. Well, that's what I am afraid of, altough that scenario sounds
    all too realistic.

    Well, I've come to the conclusion (and my new boss agrees) that we can
    (and will) use (read: "shamelessly manipulate") server log stats to help
    justify any direction we decide to take with our web presence. Frankly,
    it's not like we are going to be making such huge mistakes or life and
    death decisions based on server log stats, so a smidge of data
    manipluation in our favor isn't too much of a problem. A lot of the
    marketing, user interface, and design decisions to be made for the web are
    often common-sensical[1] anyways.

    But I guess an important point is that server log statistics are only one
    part of a complex whole when trying to make decisions about one's web
    presence or infrastructure.

    Allrighty, all, thanks for the information! This was a helpful Usenet
    dialog.

    Later...



    [1] Not that "common" sense is all that common.
    --
    JDS | lid
    | http://www.newtnotes.com
    DJMBS | http://newtnotes.com/doctor-jeff-master-brainsurgeon/
    JDS, Feb 21, 2006
    #7
  8. >Well, I've come to the conclusion (and my new boss agrees) that we can
    >(and will) use (read: "shamelessly manipulate") server log stats to help
    >justify any direction we decide to take with our web presence.


    Then you need the book "How to Lie with Statistics"
    by Darrell Huff. ISBN: 0393310728

    For example, the average daily temerature in Death Valley
    is 72F. (It's 0 at night and 144 by day).

    (Yes I know the difference bewteen average, meadian and mean, but it's
    a joke; work with me here)

    --
    Need Mercedes parts ? - http://parts.mbz.org
    Richard Sexton | Mercedes stuff: http://mbz.org
    1970 280SE, 72 280SE | Home page: http://rs79.vrx.net
    633CSi 250SE/C 300SD | http://aquaria.net http://killi.net
    Richard Sexton, Feb 21, 2006
    #8
  9. JDS

    Mark Parnell Guest

    Mark Parnell, Feb 21, 2006
    #9
  10. JDS

    Pete Gray Guest

    In article <>,
    says...

    > However, servers which insist on using such techniques are inhibiting
    > cacheability, and thus ensuring a less responsive web, and thus are
    > interfering in a negative way with the results which their users
    > experience (*all* of their users - not only those discerning users who
    > block these attempts to peek into their activities).
    >
    > This is, in effect, the Heisenberg law of web statistics - the harder
    > you try to get accurate answers, the more you interfere with the way
    > that the web works (recalling that HTTP was quite deliberately
    > designed to be "stateless"), and the worse you are able to serve the
    > requests of your users. And so, you end up getting more-accurate
    > measurements of something that would be working much better if only
    > you'd stop trying so hard to measure it.
    >


    [snipped]

    Audit Scotland didn't listen when I said as much in the consultation on
    the new Statutory Performance Indicator for museums in Scotland:
    <http://www.scottishmuseums.org.uk/areas_of_work/spi_intro.asp>

    I believe they're also sending us the 'Magic Eye' mind-reader plugin so
    we'll know what the purpose of a visit to the web site was. And what can
    you say about an indicator that talks about 'hits' as a measure? Idiots.

    --
    Pete Gray

    notes from a small bedroom
    <http://www.redbadge.co.uk/notes>
    Pete Gray, Feb 24, 2006
    #10
  11. On Fri, 24 Feb 2006, Pete Gray wrote:

    > Audit Scotland didn't listen when I said as much in the consultation on
    > the new Statutory Performance Indicator for museums in Scotland:
    > <http://www.scottishmuseums.org.uk/areas_of_work/spi_intro.asp>


    [servers group snipped for this f'up...]

    I just *knew* that when I disabled Mozilla's minimum font size, I was
    going to get microfonts. As indeed I did, "thanks", as I now see, to
    their:
    BODY, TH, TD { font-size: x-small; }

    Makes viewing in Lynx an attractive alternative. Hmmm, I see that
    their "alt" text for an image of the words "powered by GoogleTM" is:
    "Google logo". Bleagh.

    > I believe they're also sending us the 'Magic Eye' mind-reader plugin so
    > we'll know what the purpose of a visit to the web site was.


    You mean like this bit? -

    ||Website hits require that museum IT systems can differentiate
    ||between those users only making general enquiries about the museum
    ||and its services, and those searching web pages relating to the
    ||resources or collection.

    (boggle)
    Alan J. Flavell, Feb 24, 2006
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rob Meade

    Interpreting the error message?

    Rob Meade, Jan 27, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    384
    Rob Meade
    Jan 28, 2004
  2. =?Utf-8?B?QWxleCBNYWdoZW4=?=

    Pre-Interpreting a Request

    =?Utf-8?B?QWxleCBNYWdoZW4=?=, Nov 14, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    317
    John Saunders
    Nov 14, 2004
  3. yzzzzz

    Interpreting JSP code

    yzzzzz, Apr 19, 2005, in forum: Java
    Replies:
    4
    Views:
    537
    Tom Arne Orthe
    Apr 21, 2005
  4. Alec S.
    Replies:
    5
    Views:
    667
    Alec S.
    Sep 11, 2004
  5. royg567
    Replies:
    9
    Views:
    136
    Dr J R Stockton
    Feb 20, 2011
Loading...

Share This Page