Interpreting Web statistics

J

JDS

Hi, all. I am constantly butting heads with others in my department about
the interpretation of web log statistics regarding viewership of a
website. "Page views" "path through a site" "exit points" and that sort
of thing.

On the web, there are two diametrically opposing views on the value of web
server log stats.

1) Web server log analysis is very useful and can provide detailed,
usable, accurate statistics

AND

2) It can't

Well, which is it?

Typically, companies (e.g. Webtrends) that sell analysis software say
the first. However, there are a number of articles pointing to the
second. Notably, the author of "analog", one of the original web log
analysis tools, says that you can't *really* get too much meaningful
analysis out of your server logs.

Well, the problem that I see is that the articles pointing to the
uselessness of web log analysis tend to be OLD. REALLY REALLY old in
internet years -- ca. 1994 and 1995!

examples:
http://www.analog.cx/docs/webworks.html
http://www.goldmark.org/netrants/webstats/#whykeep
http://www.ario.ch/etc/webstats.html


Now, technology has moved along since the WWW first hit the streets, so to
speak, and my question(s) is(are) simple:

What techniques exist to overcome the problems inherent in Web Server Log
Analysis?

I know there *must* be some techniques! Things like tracking users via
cookies and using "tracker" URLs (a server script that gets URLS and
redirects the browser, thus writing a log of what was clicked and where),
that sort of thing.

If anyone can provide some insight on the following, that's be great:

What techniques exist to improve Web Sever Log analysis?

How good are they?

What can I do to implement them?

How do different log analysis tools compare? (Examples I have considered
using are Analog, AWStats, Webtrends, and Sawmill.)

(All these factors are important to me in gauging a tools quality:
accuracy and usefulness of reports, "prettiness" of reports, ease of use,
flexibility, speed, and cost)

Golly, thanks, web denizens. I look forward to your responses. Have a
nice day.
 
S

Steve Pugh

JDS said:
Hi, all. I am constantly butting heads with others in my department about
the interpretation of web log statistics regarding viewership of a
website. "Page views" "path through a site" "exit points" and that sort
of thing.

Those things are useful if set up and interpreted with care, but are
not 100% definitive.
On the web, there are two diametrically opposing views on the value of web
server log stats.

1) Web server log analysis is very useful and can provide detailed,
usable, accurate statistics

AND

2) It can't

Well, which is it?
Both.

Typically, companies (e.g. Webtrends) that sell analysis software say
the first.

Read the small print. WebTrends use cookies and JavaScript instead
of/as well as server logs. They have a number of products and services
which offer differing levels of accuracy. But at the end of the day
they can not be 100% accurate. Think of them as providing information
on general trends rather than precise detail on every user (if a user
has a static IP and/or accepts and keeps cookies and enables JavaScript
then you can study them very accurately).
However, there are a number of articles pointing to the
second. Notably, the author of "analog", one of the original web log
analysis tools, says that you can't *really* get too much meaningful
analysis out of your server logs.

Yes, Analog reads server logs alone. It doesn't try to do anything with
JavaScript, cookies, etc.
Well, the problem that I see is that the articles pointing to the
uselessness of web log analysis tend to be OLD. REALLY REALLY old in
internet years -- ca. 1994 and 1995!

Server logs haven't changed.
Now, technology has moved along since the WWW first hit the streets, so to
speak, and my question(s) is(are) simple:

What techniques exist to overcome the problems inherent in Web Server Log
Analysis?

Cookies, JavaScript, guesswork.
I know there *must* be some techniques! Things like tracking users via
cookies and using "tracker" URLs (a server script that gets URLS and
redirects the browser, thus writing a log of what was clicked and where),
that sort of thing.

If anyone can provide some insight on the following, that's be great:

What techniques exist to improve Web Sever Log analysis?

How good are they?

What can I do to implement them?

How do different log analysis tools compare? (Examples I have considered
using are Analog, AWStats, Webtrends, and Sawmill.)

How much money do you have to spend?

Steve
 
A

Alan J. Flavell

Read the small print. WebTrends use cookies and JavaScript instead
of/as well as server logs.

Both of which, discerning users have been selectively blocking for
many years. What was that program we had back in Win95 days, which
blocked such things from any browser? I've actually forgotten its
name, and my old '95 PC has long since gone to the knacker's yard, but
it was definitely there; and nowadays such functions come built-in to
any decent browser.

However, servers which insist on using such techniques are inhibiting
cacheability, and thus ensuring a less responsive web, and thus are
interfering in a negative way with the results which their users
experience (*all* of their users - not only those discerning users who
block these attempts to peek into their activities).

This is, in effect, the Heisenberg law of web statistics - the harder
you try to get accurate answers, the more you interfere with the way
that the web works (recalling that HTTP was quite deliberately
designed to be "stateless"), and the worse you are able to serve the
requests of your users. And so, you end up getting more-accurate
measurements of something that would be working much better if only
you'd stop trying so hard to measure it.
They have a number of products and services which offer differing
levels of accuracy. But at the end of the day they can not be 100%
accurate.

Worse than that: they aren't just "inaccurate", they are seriously
"biased", but you have no way of estimating the bias.

For example, if you improved your cacheability, your users would get
faster responses, and you might get more users sticking around to read
your site, whereas your server statistics would show fewer hits thanks
to all those folks who were getting the pages out of an intermediate
cache. And would show gaps in your statistics because they revisit
pages in their *own* browser cache, whereas previously they were
having to wait to re-fetch the same page from your server on every
revisit.
Think of them as providing information on general trends

Yeah, such as when a certain large ISP deployed a new bank of cache
servers, and the "trends" apparently showed that users had
mysteriously lost interest in the web site in question. Strangely,
each popular page that was hit on the server was being hit exactly
once every 24 hours, after which nothing was heard again from that ISP
for another 24 hours. Yup, that ISP was callously ignoring everything
that the server told it in terms of this page is uncacheable, expires
in January 1970, etc. etc., and was cacheing each page for 24 hours
without appeal. No, I'm sorry: those "trends" don't really show very
much, unless and until you really know what's happening OUT THERE.
But your server statistics have no way to tell you what's happening
out there. They're selective, and biased, and, often enough, if
interpreted to show what people demand to know - rather than
interpreted in terms of the information they really contain - can
appear to show the opposite of the truth.

Let us consider for example those misguided folks who notice that >70%
of their users appear (according to the logged user agent) to be using
MSIE, so they "optimise" their site specifically for MSIE, and,
surprise surprise, the proportion of MSIE users rises. So would you
say they acted correctly, when most everyone else reports that the
proportion appearing to use MSIE is falling? For one thing, Opera
users are starting to stand up for themselves - many of them are no
longer willing to hide behind a user agent string which pretends to be
MSIE.

Many other changes are happening "out there", which make those numbers
viewed down the wrong end of the telescope at your server log into
highly misleading indicators of anything - except your server load,
and possibly a handy way to identify broken links.

The author of Analog works in statistics, AIUI, and is determined to
tell the truth about web servers, no matter how much some web server
operators insist that they prefer to be fooled by convincing-looking
numbers about the behaviour of their visitors. Good for him.
 
E

Ed Mullen

JDS said:
Hi, all. I am constantly butting heads with others in my department about
the interpretation of web log statistics regarding viewership of a
website. "Page views" "path through a site" "exit points" and that sort
of thing.

The very simplest thing that occurs to me is from a user's standpoint.
As one user, I sometimes hit a given page many times but for a variety
of reasons. Stats won't tell you /why/ I hit that page. It could be
because I got distracted and went somewhere else for some totally
different purpose. It could be because the page didn't load fully
(images, etc.) and I left and came back. Maybe I looked at it on Tuesday
and thought "Crap, I just don't have time now, I'll bookmark it in my
"temps" folder and check it tomorrow (or in a month). Perhaps I landed
there by accident, by clicking on the wrong link in a Google result page
or the wrong link in someone else's page. Or, maybe, I did a Google
search, went to that particular page and thought: Ohmigod! just what I
was looking for!!! Or not. How do any of the page stats tell you that?

I look at some stats for my site and don't take them all that seriously
other than aggregate changes from one month to the next, figuring that,
given all the variables, at least I can see what page is the most
accessed, the second-most accessed, the third-most, from month to month
... but that's about it. Heck, my checking my own site can skew the
stats depending on the total number of hits. At some point it becomes a
bit silly to chase after the chimera.

"There are lies, damned lies, and then there are statistics."
 
N

Nick Kew

Steve said:
Those things are useful if set up and interpreted with care, but are
not 100% definitive.

Treat them as you would viewing figures for a TV show.
Read the small print. WebTrends use cookies and JavaScript instead
of/as well as server logs.

Spammers. 'nuff said (or should be - Alan expanded on some
more technical reasons).
Yes, Analog reads server logs alone. It doesn't try to do anything with
JavaScript, cookies, etc.

No spam, no snake oil. No surprise.

I rather suspect the author of analog may even understand the subject.
Unlike those outfits where anyone who understands the issues is firmly
ignored and probably laughed at as a nerdy loser behind their backs.

Hire a statistician. And make it someone who understands the
infrastructure of the Web. There are very few people who
qualify on both counts.

Now you need to add *knowledge* of the web's infrastructure.
That's different from the *principles*, and much harder to
collect. In fact it's impossible to collect at the level
that would be required for the likes of webtrends to work -
you have to apply the kind of techniques that broadcasters
use. I haven't worked for a broadcaster myself, but I
strongly suspect *they* rely on some pretty ropey assumptions,
too[1].

[1] I have worked as a statistician, and I've seen how things
happen when there is *no data* to validate some part of the
underlying model used. It goes like this:
- Someone picks a figure effectively at random on a 'seems
reasonable' basis just to have something to work with.
That enables them to derive numbers from the model.
- They also try the model with different figures, to test
the effect of varying the unknown. This leads to a perfectly
valid set of "if [value1] then [result1]" results.
- BUT that's too complex for a soundbite culture, so only the
first figure gets reported as a headline conclusion.
- Now, a future practitioner has NO DATA to validate this part
of the model, but has the first paper as a reference to cite.
The assumption is peripheral to the study, so the 'headline'
figure is simply used without question.
- Over time it is much-cited because nobody wants to get involved
in something that cannot be verified. The first researcher's
still totally untested working hypothesis becomes common knowledge,
and 'obviously correct' because everyone uses it.
 
J

JDS

[1] I have worked as a statistician, and I've seen how things happen when
there is *no data* to validate some part of the underlying model used. It
goes like this:
- Someone picks a figure effectively at random on a 'seems
reasonable' basis just to have something to work with. That enables
them to derive numbers from the model.
- They also try the model with different figures, to test
the effect of varying the unknown. This leads to a perfectly valid
set of "if [value1] then [result1]" results.
- BUT that's too complex for a soundbite culture, so only the
first figure gets reported as a headline conclusion.
- Now, a future practitioner has NO DATA to validate this part
of the model, but has the first paper as a reference to cite. The
assumption is peripheral to the study, so the 'headline' figure is
simply used without question.
- Over time it is much-cited because nobody wants to get involved
in something that cannot be verified. The first researcher's still
totally untested working hypothesis becomes common knowledge, and
'obviously correct' because everyone uses it.


Riiiight. Well, that's what I am afraid of, altough that scenario sounds
all too realistic.

Well, I've come to the conclusion (and my new boss agrees) that we can
(and will) use (read: "shamelessly manipulate") server log stats to help
justify any direction we decide to take with our web presence. Frankly,
it's not like we are going to be making such huge mistakes or life and
death decisions based on server log stats, so a smidge of data
manipluation in our favor isn't too much of a problem. A lot of the
marketing, user interface, and design decisions to be made for the web are
often common-sensical[1] anyways.

But I guess an important point is that server log statistics are only one
part of a complex whole when trying to make decisions about one's web
presence or infrastructure.

Allrighty, all, thanks for the information! This was a helpful Usenet
dialog.

Later...



[1] Not that "common" sense is all that common.
 
R

Richard Sexton

Well, I've come to the conclusion (and my new boss agrees) that we can
(and will) use (read: "shamelessly manipulate") server log stats to help
justify any direction we decide to take with our web presence.

Then you need the book "How to Lie with Statistics"
by Darrell Huff. ISBN: 0393310728

For example, the average daily temerature in Death Valley
is 72F. (It's 0 at night and 144 by day).

(Yes I know the difference bewteen average, meadian and mean, but it's
a joke; work with me here)
 
P

Pete Gray

However, servers which insist on using such techniques are inhibiting
cacheability, and thus ensuring a less responsive web, and thus are
interfering in a negative way with the results which their users
experience (*all* of their users - not only those discerning users who
block these attempts to peek into their activities).

This is, in effect, the Heisenberg law of web statistics - the harder
you try to get accurate answers, the more you interfere with the way
that the web works (recalling that HTTP was quite deliberately
designed to be "stateless"), and the worse you are able to serve the
requests of your users. And so, you end up getting more-accurate
measurements of something that would be working much better if only
you'd stop trying so hard to measure it.

[snipped]

Audit Scotland didn't listen when I said as much in the consultation on
the new Statutory Performance Indicator for museums in Scotland:
<http://www.scottishmuseums.org.uk/areas_of_work/spi_intro.asp>

I believe they're also sending us the 'Magic Eye' mind-reader plugin so
we'll know what the purpose of a visit to the web site was. And what can
you say about an indicator that talks about 'hits' as a measure? Idiots.
 
A

Alan J. Flavell

Audit Scotland didn't listen when I said as much in the consultation on
the new Statutory Performance Indicator for museums in Scotland:
<http://www.scottishmuseums.org.uk/areas_of_work/spi_intro.asp>

[servers group snipped for this f'up...]

I just *knew* that when I disabled Mozilla's minimum font size, I was
going to get microfonts. As indeed I did, "thanks", as I now see, to
their:
BODY, TH, TD { font-size: x-small; }

Makes viewing in Lynx an attractive alternative. Hmmm, I see that
their "alt" text for an image of the words "powered by GoogleTM" is:
"Google logo". Bleagh.
I believe they're also sending us the 'Magic Eye' mind-reader plugin so
we'll know what the purpose of a visit to the web site was.

You mean like this bit? -

||Website hits require that museum IT systems can differentiate
||between those users only making general enquiries about the museum
||and its services, and those searching web pages relating to the
||resources or collection.

(boggle)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top