Facile user-agent statistics tool

R

Robert Manea

Hello,

ever wondered how many different user agents are beeing used in a
newsgroup or even a whole usenet hierarchy? But didn't want to use
'the other' tools cause they draw their statistics based on wrong
criteria.

No?

Well, I did and came out with the following.

Suggestions, improvements, corrections and the like are highly welcome.


---8<---------------------------------------------------------------8<---

Example output for comp.lang.perl.misc:

1. Microsoft Outlook Express : 81
2. Mozilla : 35
3. Mozilla Thunderbird : 33
4. G2 : 32
5. slrn : 25
6. KNode : 21
7. Gnus : 19
8. Pan : 17
9. Forte Agent : 15
10. tin : 12
[...]

Summary:
- 2141 postings in total
- 30 different neawsreaders in 333 distinct postings
- Average of 5.574 articles per poster (with agent header)
- 285 without User Agent header

---8<---------------------------------------------------------------8<---


The code:

#!/bin/perl -w
#
# (c) 2004 by Robert Manea
#
# Retreive 'User-Agent' headers from usenet postings and display the
# commonnes of each newsreader. Only distinct postings count, e.g.
# every posting with the same email address counts as a single occurence
# of the according newsreader.
#
# Caveats: Postings with the same email address but different user
# agents can't be distinguished correctly

use strict;
use warnings;
use File::Find;

die "Usage: $0 <path/to/newsspool>" unless $ARGV[0];

our ( %agents, %emails );
our $cnt_file = 0;
our $no_agent = 0;

find( \&wanted, $ARGV[0] );

# Tiger Woods himself ;)
my $max_len = '';
$max_len |= $_ foreach keys %agents;
$max_len = 1 + length $max_len;

my ( $cnt_reader, $cnt_articles ) = ( 0, 0 );
for ( sort { $agents{$b} <=> $agents{$a} } keys %agents ) {
my $len = $max_len - length $_;
printf "%3d. %s %*s: %d\n", ++$cnt_reader, $_, $len, ' ', $agents{$_};

$cnt_articles += $agents{$_};
}

my $w_agent_avg = sprintf "%.3f", ($cnt_file - $no_agent) / $cnt_articles;

print << "EOF";

Summary:
- $cnt_file postings in total
- $cnt_reader different neawsreaders in $cnt_articles distinct postings
- Average of $w_agent_avg articles per poster (with agent header)
- $no_agent without User Agent header

EOF

sub wanted {
my $agent_header = qr/^User-Agent:|^X-User-Agent:|^X-Newsreader:|^X-Mailer:/;
my $from_header = qr/From:.*?([A-Za-z0-9\.]+@[A-Za-z0-9\.]+)\s*.*/;
my $file_name = $File::Find::name;

if ( -f $file_name && $file_name !~ /\/\./ ) {
open FH, "<$file_name" or ( warn "Cannot open $file_name: $!" and return 0);
++$cnt_file;

my ( $email, $reader );
while (<FH>) {
chomp;

if (/$from_header/) {
$email = $1;
}
elsif (/$agent_header/) {
# TODO: Faster general approach to determine the
# newsreader
my $raw_agent_str = ( split /: / )[1];
( $reader = ( split /\//, $raw_agent_str )[0] ) =~
s/( [A-Za-z]*\.*\d+\.\d+.*$)|(\(*?\[+?.*$)|(\[*?\(+?.*$)//o;
}
elsif ( $email && $reader ) {
if ( !$emails{$email} ) {
$agents{$reader}++;
$emails{$email}++;
}
last;
}
elsif (/^$/) { # Parse only header lines
++$no_agent;
last;
}
}
}
close FH;

1;
}

__END__


Thanks & Greets, Rob
 
A

AD

Two small things.
1. my $from_header =
qr/From:.*?([A-Za-z0-9\.]+@[A-Za-z0-9\.]+)\s*.*/;
This won't catch email addresses having '_', or '-'.
More importantly see $perldoc -q 'mail address'

2. if ( -f $file_name && $file_name !~ /\/\./ ) {
Writing reguilar expression like "$file_name !~ /\/\./ " makes
obfuscates code. You can use other delimiters like # etc.
 
C

Christopher Nehren

Summary:
- 2141 postings in total
^^^^
.... What? That's a ridiculously small sampling of posts. My news server
(news.individual.net, it's free) provides 15k articles for clpm, and
those articles are numbered between approximately 530k and 550k. One of
the most basic rules of statistics is to take a reasonably-sized sample.
Statistics based upon a small sampling represent exactly that: a small
sample of what's available.

Best Regards,
Christopher Nehren
 
C

Christopher Nehren

Two small things.
1. my $from_header =
qr/From:.*?([A-Za-z0-9\.]+@[A-Za-z0-9\.]+)\s*.*/;
This won't catch email addresses having '_', or '-'.
More importantly see $perldoc -q 'mail address'

Or + characters, like mine.

Best Regards,
Christopher Nehren
 
A

A. Sinan Unur

---8<---------------------------------------------------------------8<-
--

Example output for comp.lang.perl.misc:

1. Microsoft Outlook Express : 81
2. Mozilla : 35
3. Mozilla Thunderbird : 33
4. G2 : 32
5. slrn : 25
6. KNode : 21
7. Gnus : 19
8. Pan : 17
9. Forte Agent : 15
10. tin : 12
[...]

Am I the only one who uses XNews?
 
A

A. Sinan Unur

^^^^
... What? That's a ridiculously small sampling of posts. My news server
(news.individual.net, it's free) provides 15k articles for clpm, and
those articles are numbered between approximately 530k and 550k. One of
the most basic rules of statistics is to take a reasonably-sized
ample. Statistics based upon a small sampling represent exactly that: a
small sample of what's available.


Actually, 2000 postings out of 20000 is a very large sample. The crucial
factor in determining whether a sample is usable or not is whether it was
randomly selected, not its size.

http://college.hmco.com/history/readerscomp/rcah/html/ah_071900
_publicopinio.htm
 
R

Robert Manea

Segfault in module "A. Sinan Unur" - dump details are as follows:
Am I the only one who uses XNews?

1. Microsoft Outlook Express : 82
2. Mozilla : 35
3. G2 : 35
4. Mozilla Thunderbird : 33
5. slrn : 27
6. KNode : 21
7. Gnus : 19
8. Pan : 17
9. Forte Agent : 15
10. tin : 12
11. Xnews : 7 <--- No, you aren't. :)
12. trn : 6
13. MT-NewsWatcher : 5
14. nn : 4
15. Opera M2 : 4
16. 40tude_Dialog : 2
17. Thoth : 2
18. MicroPlanet-Gravity : 2
19. Microsoft-Entourage : 1
20. vBulletin USENET gateway : 1
21. ProNews : 1
22. trn : 1
23. Forte Free Agent : 1
24. hidden : 1
25. xrn : 1
26. Messenger-Pro : 1
27. Thunderbird : 1
28. Sylpheed version : 1
29. NewsLeecher : 1
30. Sylpheed-Claws : 1

Summary:
- 2181 postings in total
- 30 different neawsreaders in 340 distinct postings
- Average of 5.582 articles per poster (with agent header)
- 283 without User Agent header


Done with a slightly fixed $from_header regex.
my $from_header = qr/From:.*?([A-Za-z0-9\._\-\+]+@[A-Za-z0-9\._\-\+]+)\s*.*/;

As I want to catch 'slightly' non conforming email adresses as well a
standard compliant parsers would be rather contraproductive to the
quality of the results, I guess.

Anyways, the sole purpose of the address is to server as a unique
identifier for each poster. Thus compliance to standards isn't that much
necessary in this case..


Greets, Rob
 
C

Christopher Nehren

On 2004-12-09, Robert Manea scribbled these
curious markings:

[...]
4. Mozilla Thunderbird : 33
[...]

12. trn : 6
[...]

22. trn : 1
27. Thunderbird : 1
28. Sylpheed version : 1
[...]

30. Sylpheed-Claws : 1

I hope that you like dealing with inconsistencies. :)

Best Regards,
Christopher Nehren
 
R

Robert Manea

Segfault in module "Christopher Nehren" - dump details are as follows:
On 2004-12-09, Robert Manea scribbled these
curious markings:
4. Mozilla Thunderbird : 33
12. trn : 6
22. trn : 1
27. Thunderbird : 1
28. Sylpheed version : 1
30. Sylpheed-Claws : 1
I hope that you like dealing with inconsistencies. :)

Well, since the 'User-Agent' header's format conforms to absolutly no
standard or RFC one must be compromising.

I understand, that that piece of information is most important in my
programm but really can't come up with anything significantly better.

Maybe you have a better RegEx or method to extract it? (see TODO)

( Some kind of comparision function which detects common words in
different strings would be a solution, but still not the right
one.

E.g.: 'Mozilla Thunderbird' and 'Thunderbird' should be the same
but 'Mozilla' and 'Mozilla Thunderbird' must not.

How could one solve this without hardcoding the names? )
Best Regards,
Christopher Nehren

Greets, Rob
 
A

Anno Siegel

Robert Manea said:
Segfault in module "Christopher Nehren" - dump details are as follows:
On 2004-12-09, Robert Manea scribbled these
curious markings:
4. Mozilla Thunderbird : 33
12. trn : 6
22. trn : 1
27. Thunderbird : 1
28. Sylpheed version : 1
30. Sylpheed-Claws : 1
I hope that you like dealing with inconsistencies. :)

Well, since the 'User-Agent' header's format conforms to absolutly no
standard or RFC one must be compromising.

I understand, that that piece of information is most important in my
programm but really can't come up with anything significantly better.

Maybe you have a better RegEx or method to extract it? (see TODO)

( Some kind of comparision function which detects common words in
different strings would be a solution, but still not the right
one.

E.g.: 'Mozilla Thunderbird' and 'Thunderbird' should be the same
but 'Mozilla' and 'Mozilla Thunderbird' must not.

How could one solve this without hardcoding the names? )

Use a (set of) rule(s) that gets most cases right and hold exceptions
in a config file. That ought to be maintainable -- if you really need
that kind of precision.

Anno
 
A

Anno Siegel

Robert Manea said:
Segfault in module "Christopher Nehren" - dump details are as follows:
On 2004-12-09, Robert Manea scribbled these
curious markings:
4. Mozilla Thunderbird : 33
12. trn : 6

Ah, so I'm not the only trn user here...

....or am I?
27. Thunderbird : 1
28. Sylpheed version : 1
30. Sylpheed-Claws : 1
I hope that you like dealing with inconsistencies. :)

Well, since the 'User-Agent' header's format conforms to absolutly no
standard or RFC one must be compromising.

I understand, that that piece of information is most important in my
programm but really can't come up with anything significantly better.

Maybe you have a better RegEx or method to extract it? (see TODO)

( Some kind of comparision function which detects common words in
different strings would be a solution, but still not the right
one.

E.g.: 'Mozilla Thunderbird' and 'Thunderbird' should be the same
but 'Mozilla' and 'Mozilla Thunderbird' must not.

How could one solve this without hardcoding the names? )

Use a (set of) rule(s) that gets most cases right and hold exceptions
in a config file. That ought to be maintainable -- if you really need
that kind of precision.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top