You appear have decided to dismiss the "opinion" of HTTP experts on the
grounds that they are not statisticians (or, more perversely, that they
do not understand how HTTP works, which wouldn't be a rational
conclusion). In practice HTTP experts are responsible for tasks such al
load balancing servers, which they do, at least in part, based on the
results of a statistical analyses of logged data. Of course for load
balancing the pertinent data relates only to the servers, and can be
gathered accurately on those servers. And some effort is expended
examining the best strategies for gathering and analysing server logged
data.
HTTP experts are not antagonistic towards the notion of deriving client
statistics from server logs because they are ignorant of statistical
analysis (or distrust it). They don't believe it can be done because the
_understand_ the mechanisms of HTTP. And they conclude from that
understanding of the mechanism that the unknowables as so significant in
the problem of making deductions about the clients that the results of
any such attempt must be meaningless.
Taking, for example, just on aspect of HTTP communication; a request
from a client at point A is addressed to a resource on a sever on the
network at point B. What factors determine the route it will take? The
network was very explicitly designed such that the exact route taken by
any packet of data is unimportant, the decisions are made by a wide
diversity of software implementations based on conditions that are local
and transient. The request may take any available route, and subsequent
requests will not necessarily follow the same route.
Does the route matter? Yes, it must if intervening caches are going to
influence the likelihood of a request from point A making it as far as
the server at point B in order to be logged. You might decide that some
sort of 'average' route could be used in the statistical analyses, but
given a global network the permutations of possible routes is extremely
large (to say the least) so an average will significantly differ from
reality most of the time because of the range involved.
Having blurred the path taken by an HTTP request into some sort of
average or model it is necessary to apply the influence of the caches.
Do you know what caching software exists, in what versions, with what
sort of distribution, and in which configurations? No? Well nobody does,
there is no requirement to disclose (and the majority of operators of
such software war likely to regard the information as confidential).
And this is the nature of HTTP, layers of unknown influences sitting on
top of layers of unknown influences. The reality is that modelling the
Internet from server logs is going to be like trying to make a
mathematical model of a cloud, from the inside.
Incidentally, I like the notion of a "suitably motivated" statistician.
There are people selling, and people buying, browser usage statistics
that they maintain are statistically accurate, regardless of
impossibility of acquiring such statistics (and without saying a word as
to how they overcome (or claim to have overcome) the issues). But in a
world where people are willing to exchange money for such statistics
maybe some are "suitably motivated" to produce numbers regardless. And
so long as those numbers correspond with the expectations of the people
paying will their veracity be questioned? I am always reminded of Hand
Christian Anderson's "The Emperor's new clothes".
... . The other half is whether applied mathematics
can create a model of the system and accurately
predict outcomes based on data collected.
You cannot deny that there are systems where mathematical modelling
cannot predict outcomes based on data. You cannot predict the outcome of
the next dice roll from any number of observations of preceding dice
rolls, and chaos makes weather systems no more than broadly predictable
over relatively short periods.
I do not doubt your knowledge of Internet systems,
nor your ability to apply that to problems within
your realm if expertise, but I find your lack of
faith in statistical modeling disturbing...
<snip>
I think maybe you should do some research into HTTP before you place too
much faith in the applicability of statistical modelling to it.
...so I'll bet you aren't a statistician.
I think maybe you should do some research into HTTP before you place too
much faith in the applicability of statistical modelling to it.
No, it means you can't conceive a model that allows
for them (the issues).
Who would be the best people to conceive a model that took the issues
into account? Wouldn't that be the HTTP experts who understand the
system? The people most certain that it cannot be done.
Measurements made and analyzed without regard for
errors inherent in the system will be useless,
Useless is what they should be (though some may choose to employ them
regardless).
but the fact that you claim intimate knowledge
of those very errors means it is highly likely that
an accurate measurement system can be devised.
What is being clamed is not ultimate knowledge of errors but the
knowledge that the factors influencing those errors are both not
quantifiable and significant.
All that is required
All?
is a properly configured web page
"web page"? Are we talking HTML then?
that gets perhaps a few thousand hits per
day from a suitably representative sample
of the web surfer population.
"suitably representative" is a bit of a vague sampling criteria. But if
a requirement for gathering accurate client statistics is to determine
what a "suitably representative" sample would be, don't you need some
sort of accurate client statistics to work out what constitutes
representative?
But, assuming it will work, what is it exactly that you propose can be
learnt from these sttistics?
Richard.