Tracking Users By IP Address

F

Fuzzyman

Ok so this is really an HTTP question... but I'd far rather ask it
here !! If I don't get an answer I'll have to try elsewhere I guess
:-(

I'm developing a simple web user statistics program and I'd like to
gather information like number of pages a user views per visit. This
way I can distinguish between number of visits and number of page
views.

Assuming that each visitor will have a unique IP address is probably
not 100% accurate but will have a reasonably low margin of error !! Is
there any reason not to use this approach ?

I could also set a cookie, but thescript works by embedding some
javascript in each page. The javascript does an image fetch which
sends the information to anotehr server. I believe setting cookies in
an image fetch to a different server to the main page request is
likely to be blocked by many people ?

Regards,

Fuzzy

http://www.voidspace.org.uk/pythonutils.html
 
P

Peter L Hansen

Fuzzyman said:
Ok so this is really an HTTP question... but I'd far rather ask it
here !! If I don't get an answer I'll have to try elsewhere I guess
:-(

I'm developing a simple web user statistics program and I'd like to
gather information like number of pages a user views per visit. This
way I can distinguish between number of visits and number of page
views.

Assuming that each visitor will have a unique IP address is probably
not 100% accurate but will have a reasonably low margin of error !! Is
there any reason not to use this approach ?

Depends on your user base, I guess. If I expected many visitors
from corporate environments, I'd not think this approach would
have a "reasonably low margin of error". Almost any such
environment will have a firewall and masquerade multiple users
as the same address.

If you used this approach but made sure to have a reasonably
short timeout before you discarded the address information, say
about ten minutes of "no access" from the address, then you
would be less likely to mix up multiple users from the same
site accessing your web pages in quick succession.

You could also try setting a cookie to help track them, just
within a session, and I suspect most corporate environments
still have basic cookies turned on...

-Peter
 
R

Richie Hindle

[Fuzzyman]
Assuming that each visitor will have a unique IP address is probably
not 100% accurate but will have a reasonably low margin of error !!
[Peter]
If I expected many visitors
from corporate environments, I'd not think this approach would
have a "reasonably low margin of error". Almost any such
environment will have a firewall and masquerade multiple users
as the same address.

Some (many? all? negligibly few? I don't know) proxy firewalls will tell
you which internal address the request came from, so you can combine the
two to get a unique address for that user.

http://xhaus.com/headers is a useful tool for checking your HTTP headers
(thanks, Alan!). From where I am at the moment, that tells me:

Remote IP Address: 213.38.135.130
X-Forwarded-For: 172.16.3.123

The combination of those IP addresses world-wide-uniquely identifies
this PC, even though it's behind a firewall.

I can't speak for the other billion PCs in similar situations. :cool:
 
P

Paul McNett

Richie said:
[Peter]
If I expected many visitors
from corporate environments, I'd not think this approach
would have a "reasonably low margin of error". Almost any
such environment will have a firewall and masquerade
multiple users as the same address.

Some (many? all? negligibly few? I don't know) proxy
firewalls will tell you which internal address the request
came from, so you can combine the two to get a unique address
for that user.

I'd say "negligibly few", since there really isn't good reason
to give the world your internal ip address. Anyway, you can't
count on it, and you wouldn't have a basis to know what
percentage weren't unique.

Cookies won't work for you?
 
N

Nick Craig-Wood

Paul McNett said:
I'd say "negligibly few", since there really isn't good reason
to give the world your internal ip address. Anyway, you can't
count on it, and you wouldn't have a basis to know what
percentage weren't unique.

An awful lot of proxies do do exactly that (I used this technique
before).

HTTP_X_FORWARDED_FOR and REMOTE_ADDR are the headers you want

Its not safe to use this as a unique address for a user but if you are
interested in exactly where your users are (for fraud checking
purposes) its a useful technique.
 
P

Paul Rubin

Fuzzyman said:
Assuming that each visitor will have a unique IP address is probably
not 100% accurate but will have a reasonably low margin of error !!

Nah, you can't rely on that at all. For example, all N zillion AOL
users come through only a few hundred different IP addresses. And
SOMETIMES you get those proxy headers other people have mentioned, but
not all that often.
 
A

Alan Kennedy

[Fuzzyman said:
Assuming that each visitor will have a unique IP address is probably
not 100% accurate but will have a reasonably low margin of error !!

[Paul Rubin]
Nah, you can't rely on that at all. For example, all N zillion AOL
users come through only a few hundred different IP addresses. And
SOMETIMES you get those proxy headers other people have mentioned, but
not all that often.

That's probably the biggest problem, but I would also add the following
other problems which will further widen the error margin:-

1. Modem and DSL users most often get dynamically allocated addresses,
which means that two entirely different users can get the same address
within seconds of each other: one disconnects, the other reconnects to
the same endpoint. If they both visit your web site, and you're tracking
only IPs, then you'll think they're both the same user. If your content
raises a lot of interest in a specific phone-dial-code area, for
example, this could seriously skew your data.

2. With corporate networks, local IP addresses (i.e. the ones you see in
X-FORWARDED-FOR headers, etc) tend to be handed out by DHCP-style
dynamic allocation. Which means that the head-of-IT's address this week
could be the receptionist's address next week.

regards,
 
M

Michael Foord

Paul McNett said:
Richie said:
[Peter]
If I expected many visitors
from corporate environments, I'd not think this approach
would have a "reasonably low margin of error". Almost any
such environment will have a firewall and masquerade
multiple users as the same address.

Some (many? all? negligibly few? I don't know) proxy
firewalls will tell you which internal address the request
came from, so you can combine the two to get a unique address
for that user.

I'd say "negligibly few", since there really isn't good reason
to give the world your internal ip address. Anyway, you can't
count on it, and you wouldn't have a basis to know what
percentage weren't unique.

Cookies won't work for you?

The sort of websites that will use this are going to be reasonably
low-traffic, 1000 hits a day or less. Bigger websites will find it
easier to generate user statistics from server logs. My script
duplicates the kind of thing thst http://www.sitemeter.com does. In
this case the margin of errror from using ip address will probably be
very low *however* cookies may still be a better solution as it would
be really nice to track who is a first time visitor and who is a
returning visitor.

This means setting cookies with a really long expire date though.....

Hmm... I think I'll just have to play around and see how they work.
Now it comes to implementing it my *real* problem is doing the data
analysis. I need a nice database engines that will answer questions
like 'how many visits from this refferer between the dates of 1st
October to the 8th of october'.....

Oh and a pure python image library would be nice !! Looks like I'll
have to hit google a bit. Thanks for your help everyone.

Regards,

Fuzzy
 
G

Grant Edwards

Assuming that each visitor will have a unique IP address is
probably not 100% accurate but will have a reasonably low
margin of error!!

Not. A significant portion (I'd guess a vast majority) of the
machines out there have either dynamically allocated IP
addresses or are behind a proxy or NAT firewall. Even if that
weren't true, you'd be keeping track of computers, not of
people. I assume it's the latter that matter to you?
Is there any reason not to use this approach ?

Because it's going to be wrong most of the time?
 
M

Michael Sparks

Fuzzyman wrote:
....
Assuming that each visitor will have a unique IP address is probably
not 100% accurate but will have a reasonably low margin of error !! Is
there any reason not to use this approach ?

For this task things are worse than it sounds:
* Many users are behind either firewalls and/or proxies
* In the case of a simple NATting firewall all your accesses from offices
will be from a single IP (or small set of IPs).
* In the case of proxies that set either Client-IP or X-Forwarded-For
headers you can't guarantee that the IPs are passed through intact
(depending on the paranoia/privacy settings of the proxy)
* Even if you *can* see the IP, many people connect from systems where
their IP changes regularly - meaning you break your request streams so
you don't get the full behaviour over time and means that people
overlap.
* Many ISPs that generate large amounts of traffic proxy all their
traffic. Meaning you have millions of users from tens or hundreds of
IPs.
* The "My TiVO thinks I'm Gay" effect. PVRs often have settings to say "I
like", "I hate", which allows the PVR to determine the tastes of the
owner. In theory this sounds great, but because it assumes everyone in
the same household likes the same thing it jumps to the wrong
conclusions. If you consider homes with several people, you can't even
rely on accesses from the same _machine_ to come from the same user.
(If they share the same login there's nothing you can do obviously)
etc...

When you put all these togethrer using IPs _looks_ very bad.

What's an alternative? First a few points:
* Well, you want to track users. This means by EU regs (based on your
email this might apply) you have to let users of your site know you're
doing this.
* Cookies can get by alot of the issues listed above
* You're only really interested in people who can reliably send you the
same cookie twice. (If they refuse cookies, you can't track them using
cookies obviously, and the above implies IP won't work brilliantly for
you)
* Relying on cookies also means you can allow your users to opt out from
being tracked.

One alternative: (pseudocode)

Recieve request
If no-cookie-received:
Set Cookie: "NEWUSER"
else:
if cookie-recieved == "NEWUSER":
# We know they can send us cookies back
id = gen-id()
Set Cookie: id

Then just log requests with the recieved cookie, trackable users will have
a unique id, whether their IP changes, share a system, behind nat'ing
firewalls etc. This allows you to track unique users that are trackable
using cookies. If you have a particularly large number of users accessing
your site you can tie in sampling (perhaps something like density biased
sampling) in there as well something like this:

new-cookie = None
If no-cookie-received:
new-cookie = "NEWUSER"
else:
if cookie-recieved == "NEWUSER":
# We know they can send us cookies back
id = gen-id()
new-cookie = id

if add-to-sample-set(request):
tag = "SAMPLE"
new-cookie = current-cookie or new-cookie
else:
tag = "NOSAMPLE"

if new-cookie:
Set Cookie: tag new-cookie

(Or something like that IYSWIM - ie get the user population to indicate if
they're being sampled - again, this allows your users to easily opt out,
and also means the memory/etc required to determine whether to track the
user or not isn't dependent on the number of requests your site gets -
meaning that you can keep analysis costs for your site under control. If
you've only got a small site this probably doesn't matter to you, but
worth bearing in mind).

The interesting thing about this from my perspective is that if you do
take a cookie approach like this, it actually allows you to figure out how
much error there actually is between IP and cookie - rather than just guess.
The other nicety is it allows your users to opt-out very easily - since they
can either switch off cookies, or you can send them a "NOSAMPLE" cookie.

Also, at present comments in this thread revolve around "this isn't
reliable because of x,y and z". If you take this sort of approach you
can find out the margin of error and then decide whether you're happy
with it or not. Also as you can see from above this doesn't really have
to be a very complex operation (unless you're in a high volume scenario
with lots of distinct users and need to add in the sampling aspect).

Best Regards,


Michael
 
B

Bryan Olson

Nick said:
>
> An awful lot of proxies do do exactly that (I used this technique
> before).
>
> HTTP_X_FORWARDED_FOR and REMOTE_ADDR are the headers you want

Many open proxies do that to avoid being bad net-citizens. It
is not common practice for NAT gateways and enterprise proxies.
> Its not safe to use this as a unique address for a user but if
> you are interested in exactly where your users are (for fraud
> checking purposes) its a useful technique.

It can detect fraud, but there are enough transparent proxies
that it cannot detect absence of fraud.
 
?

=?ISO-8859-1?Q?Michael_Str=F6der?=

Grant said:
Not. A significant portion (I'd guess a vast majority) of the
machines out there have either dynamically allocated IP
addresses or are behind a proxy or NAT firewall.

Even worse: HTTP proxy networks (e.g. AOL) will make hits in one session
from the very same user appear with *different* IP addresses.

Ciao, Michael.
 
P

Peter L Hansen

Alan said:
[Fuzzyman <[email protected]>]
That's probably the biggest problem, but I would also add the following
other problems which will further widen the error margin:-

2. With corporate networks, local IP addresses (i.e. the ones you see in
X-FORWARDED-FOR headers, etc) tend to be handed out by DHCP-style
dynamic allocation. Which means that the head-of-IT's address this week
could be the receptionist's address next week.

On the other hand, however, in a DHCP environment where machines are
typically turned off overnight but all on during the day, and where
you aren't running with 0 spare addresses, most machines will be given
the same address as they had the day before when the server processes
the request... but your point is valid, it will increase the error
margin.

Cookies are the way to go.

-Peter
 
M

Michael Foord

[thanks but snip.. ;-) ]
One alternative: (pseudocode)

Recieve request
If no-cookie-received:
Set Cookie: "NEWUSER"
else:
if cookie-recieved == "NEWUSER":
# We know they can send us cookies back
id = gen-id()
Set Cookie: id

Yep.. this I understand and will try.
Thanks
Then just log requests with the recieved cookie, trackable users will have
a unique id, whether their IP changes, share a system, behind nat'ing
firewalls etc. This allows you to track unique users that are trackable
using cookies. If you have a particularly large number of users accessing
your site you can tie in sampling (perhaps something like density biased
sampling) in there as well something like this:

new-cookie = None
If no-cookie-received:
new-cookie = "NEWUSER"
else:
if cookie-recieved == "NEWUSER":
# We know they can send us cookies back
id = gen-id()
new-cookie = id

if add-to-sample-set(request):
tag = "SAMPLE"
new-cookie = current-cookie or new-cookie
else:
tag = "NOSAMPLE"

if new-cookie:
Set Cookie: tag new-cookie


Sorry... :-( don't get it.
What is add-to-sample-set(request) doing ? Is it simply choosing a
proportion of our users to sample ?

If this is only a 'do if you have too many users' kind-of-thing then
unfortunately it won't be a problem for me !!
(Or something like that IYSWIM - ie get the user population to indicate if
they're being sampled - again, this allows your users to easily opt out,

As above... I don't get it, so I don't see how it achieves this ?
and also means the memory/etc required to determine whether to track the
user or not isn't dependent on the number of requests your site gets -
meaning that you can keep analysis costs for your site under control. If
you've only got a small site this probably doesn't matter to you, but
worth bearing in mind).

The interesting thing about this from my perspective is that if you do
take a cookie approach like this, it actually allows you to figure out how
much error there actually is between IP and cookie - rather than just guess.

One last question. You didn't explicitly say this, but I was thinking
of doing it anyway. Are you suggesting to store USERID *and* IP
address and compare the results of anylysing by IP and analysing by
cookie.... Sounds worthwhile...

Thanks for your help - very interesting.

Regards,

Fuzzy
 
M

Michael Foord

[thanks but snip.. ;-) ]
One alternative: (pseudocode)

Recieve request
If no-cookie-received:
Set Cookie: "NEWUSER"
else:
if cookie-recieved == "NEWUSER":
# We know they can send us cookies back
id = gen-id()
Set Cookie: id

Yep.. this I understand and will try.
Thanks
Then just log requests with the recieved cookie, trackable users will have
a unique id, whether their IP changes, share a system, behind nat'ing
firewalls etc. This allows you to track unique users that are trackable
using cookies. If you have a particularly large number of users accessing
your site you can tie in sampling (perhaps something like density biased
sampling) in there as well something like this:

new-cookie = None
If no-cookie-received:
new-cookie = "NEWUSER"
else:
if cookie-recieved == "NEWUSER":
# We know they can send us cookies back
id = gen-id()
new-cookie = id

if add-to-sample-set(request):
tag = "SAMPLE"
new-cookie = current-cookie or new-cookie
else:
tag = "NOSAMPLE"

if new-cookie:
Set Cookie: tag new-cookie


Sorry... :-( don't get it.
What is add-to-sample-set(request) doing ? Is it simply choosing a
proportion of our users to sample ?

If this is only a 'do if you have too many users' kind-of-thing then
unfortunately it won't be a problem for me !!
(Or something like that IYSWIM - ie get the user population to indicate if
they're being sampled - again, this allows your users to easily opt out,

As above... I don't get it, so I don't see how it achieves this ?
and also means the memory/etc required to determine whether to track the
user or not isn't dependent on the number of requests your site gets -
meaning that you can keep analysis costs for your site under control. If
you've only got a small site this probably doesn't matter to you, but
worth bearing in mind).

The interesting thing about this from my perspective is that if you do
take a cookie approach like this, it actually allows you to figure out how
much error there actually is between IP and cookie - rather than just guess.

One last question. You didn't explicitly say this, but I was thinking
of doing it anyway. Are you suggesting to store USERID *and* IP
address and compare the results of anylysing by IP and analysing by
cookie.... Sounds worthwhile...

Thanks for your help - very interesting.

Regards,

Fuzzy
 
M

Michael Foord

[thanks but snip.. ;-) ]
One alternative: (pseudocode)

Recieve request
If no-cookie-received:
Set Cookie: "NEWUSER"
else:
if cookie-recieved == "NEWUSER":
# We know they can send us cookies back
id = gen-id()
Set Cookie: id

Yep.. this I understand and will try.
Thanks
Then just log requests with the recieved cookie, trackable users will have
a unique id, whether their IP changes, share a system, behind nat'ing
firewalls etc. This allows you to track unique users that are trackable
using cookies. If you have a particularly large number of users accessing
your site you can tie in sampling (perhaps something like density biased
sampling) in there as well something like this:

new-cookie = None
If no-cookie-received:
new-cookie = "NEWUSER"
else:
if cookie-recieved == "NEWUSER":
# We know they can send us cookies back
id = gen-id()
new-cookie = id

if add-to-sample-set(request):
tag = "SAMPLE"
new-cookie = current-cookie or new-cookie
else:
tag = "NOSAMPLE"

if new-cookie:
Set Cookie: tag new-cookie


Sorry... :-( don't get it.
What is add-to-sample-set(request) doing ? Is it simply choosing a
proportion of our users to sample ?

If this is only a 'do if you have too many users' kind-of-thing then
unfortunately it won't be a problem for me !!
(Or something like that IYSWIM - ie get the user population to indicate if
they're being sampled - again, this allows your users to easily opt out,

As above... I don't get it, so I don't see how it achieves this ?
and also means the memory/etc required to determine whether to track the
user or not isn't dependent on the number of requests your site gets -
meaning that you can keep analysis costs for your site under control. If
you've only got a small site this probably doesn't matter to you, but
worth bearing in mind).

The interesting thing about this from my perspective is that if you do
take a cookie approach like this, it actually allows you to figure out how
much error there actually is between IP and cookie - rather than just guess.

One last question. You didn't explicitly say this, but I was thinking
of doing it anyway. Are you suggesting to store USERID *and* IP
address and compare the results of anylysing by IP and analysing by
cookie.... Sounds worthwhile...

Thanks for your help - very interesting.

Regards,

Fuzzy
 
P

Paul Rubin

Michael Sparks said:
Suppose the following user groups:
* Refuse all cookies - can't use cookies to track, IP isn't 100%
reliable.
* Users accept all cookies, don't care - ideal candidates for sampling
* Users generally have cookies enabled, but don't like being tracked.

There are some users who block cookies, but they're rare enough that
using cookies is more reliable than using IP addresses. Also, if
you're just looking at page views, you don't need persistent cookies,
just session cookies.

Another thing you can do is put a cookie-like tag in the actual URL
(visit Amazon and click around a few times to see how that's done).
But that's not reliable either. It's especially bad when some
retailer puts a shopping cart ID into the url. Then some user posts
to a BBS or newsgroup, "hey I got this great deal on a <whatever> at
<url>". Other users click on the url and get the original user's
shopping cart.
 
M

Michael Sparks

Michael said:
[thanks but snip.. ;-) ] ...
Sorry... :-( don't get it.
What is add-to-sample-set(request) doing ? Is it simply choosing a
proportion of our users to sample ?

If this is only a 'do if you have too many users' kind-of-thing then
unfortunately it won't be a problem for me !!

It's precisely that. If you don't think it'll be an issue for you then I'll
just leave things as is :) (If anyone's curious as to what I mean though
I'll be happy to expand)
As above... I don't get it, so I don't see how it achieves this ?

Suppose the following user groups:
* Refuse all cookies - can't use cookies to track, IP isn't 100%
reliable.
* Users accept all cookies, don't care - ideal candidates for sampling
* Users generally have cookies enabled, but don't like being tracked.
* In this scenario you can have a "click on this image to cease being
tracked" picture for them to click on, which sets a cookie that
effectively says "don't track me". It does rely on them trusting you
not to track them, but if you send them all the same cookie value
(eg "NOTRACK") it should be obvious to them that the cookie is
useless for tracking.

That way you get 3 types of cookies:
* NEWUSER
* specific trackable id value
* NOTRACK

It's a nice little thing you can add on after the fact which I think is
quite nice.
One last question. You didn't explicitly say this, but I was thinking
of doing it anyway. Are you suggesting to store USERID *and* IP
address and ...

Most webservers allow you to define custom logging formats, or at minimum
some extended formats. There is a defined format for example that will
include in the log any cookie value received along with the usual details.
You can then use standard tools to analyse the log after the fact. By
noting which log lines are new users "NEWUSER" (and hence not really
trackable) and people who don't want to be tracked "NOTRACK", you can
exclude these lines and pass the thing to tools like analog, and more
sophisticated tools designed to follow user paths through a site.

Obviously such things become harder depending on how much traffic your site
sees... (Which was why I was putting pointers to sampling earlier - it's
non-trivial to get "right" in many respects :)

The advantage of taking the "just let the system log the details" is that
it's simple to do, you can use standard tools, and you can choose whether
to analyse using IPs or cookies at your leisure.
compare the results of anylysing by IP and analysing by
cookie.... Sounds worthwhile...

It does. I'm not really aware of anyone who has actively attempted to follow
user trails through sites based on IP then repeating the same analysis
using cookies in this kind of way. I _suspect_ the results would be very
different for small vs large sites, and potentially even how niche a
website is.

After all if you find that the margin of error is acceptable and
(preferably) predictable, you can just choose the computationally
cheaper option.
Thanks for your help - very interesting.

No problem.

Best Regards,


Michael.
 
M

Michael Sparks

Paul said:
There are some users who block cookies, but they're rare enough that
using cookies is more reliable than using IP addresses. Also, if
you're just looking at page views, you don't need persistent cookies,
just session cookies.

The OP was talking about tracking users, so _personally_ I would suggest
persistent cookies of some sort. I'm not a fan of tracking at all myself,
which is why I'll tend to suggest to people the idea of an opt-out/opt-in
mechanism like the one I suggested - although IMO the fact you only send
these things to people you know can handle cookies correctly helps
somewhat. :) Also given the OP had a UK email address if he is tracking
users by IP he has to put a notice up on his website that users can find
very easily informing them that this is the case. (EU regulation)

Regards,


Michael.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,128
Latest member
ElwoodPhil
Top