Facile user-agent statistics tool

Discussion in 'Perl Misc' started by Robert Manea, Dec 9, 2004.

  1. Robert Manea

    Robert Manea Guest

    Hello,

    ever wondered how many different user agents are beeing used in a
    newsgroup or even a whole usenet hierarchy? But didn't want to use
    'the other' tools cause they draw their statistics based on wrong
    criteria.

    No?

    Well, I did and came out with the following.

    Suggestions, improvements, corrections and the like are highly welcome.


    ---8<---------------------------------------------------------------8<---

    Example output for comp.lang.perl.misc:

    1. Microsoft Outlook Express : 81
    2. Mozilla : 35
    3. Mozilla Thunderbird : 33
    4. G2 : 32
    5. slrn : 25
    6. KNode : 21
    7. Gnus : 19
    8. Pan : 17
    9. Forte Agent : 15
    10. tin : 12
    [...]

    Summary:
    - 2141 postings in total
    - 30 different neawsreaders in 333 distinct postings
    - Average of 5.574 articles per poster (with agent header)
    - 285 without User Agent header

    ---8<---------------------------------------------------------------8<---


    The code:

    #!/bin/perl -w
    #
    # (c) 2004 by Robert Manea
    #
    # Retreive 'User-Agent' headers from usenet postings and display the
    # commonnes of each newsreader. Only distinct postings count, e.g.
    # every posting with the same email address counts as a single occurence
    # of the according newsreader.
    #
    # Caveats: Postings with the same email address but different user
    # agents can't be distinguished correctly

    use strict;
    use warnings;
    use File::Find;

    die "Usage: $0 <path/to/newsspool>" unless $ARGV[0];

    our ( %agents, %emails );
    our $cnt_file = 0;
    our $no_agent = 0;

    find( \&wanted, $ARGV[0] );

    # Tiger Woods himself ;)
    my $max_len = '';
    $max_len |= $_ foreach keys %agents;
    $max_len = 1 + length $max_len;

    my ( $cnt_reader, $cnt_articles ) = ( 0, 0 );
    for ( sort { $agents{$b} <=> $agents{$a} } keys %agents ) {
    my $len = $max_len - length $_;
    printf "%3d. %s %*s: %d\n", ++$cnt_reader, $_, $len, ' ', $agents{$_};

    $cnt_articles += $agents{$_};
    }

    my $w_agent_avg = sprintf "%.3f", ($cnt_file - $no_agent) / $cnt_articles;

    print << "EOF";

    Summary:
    - $cnt_file postings in total
    - $cnt_reader different neawsreaders in $cnt_articles distinct postings
    - Average of $w_agent_avg articles per poster (with agent header)
    - $no_agent without User Agent header

    EOF

    sub wanted {
    my $agent_header = qr/^User-Agent:|^X-User-Agent:|^X-Newsreader:|^X-Mailer:/;
    my $from_header = qr/From:.*?([A-Za-z0-9\.]+@[A-Za-z0-9\.]+)\s*.*/;
    my $file_name = $File::Find::name;

    if ( -f $file_name && $file_name !~ /\/\./ ) {
    open FH, "<$file_name" or ( warn "Cannot open $file_name: $!" and return 0);
    ++$cnt_file;

    my ( $email, $reader );
    while (<FH>) {
    chomp;

    if (/$from_header/) {
    $email = $1;
    }
    elsif (/$agent_header/) {
    # TODO: Faster general approach to determine the
    # newsreader
    my $raw_agent_str = ( split /: / )[1];
    ( $reader = ( split /\//, $raw_agent_str )[0] ) =~
    s/( [A-Za-z]*\.*\d+\.\d+.*$)|(\(*?\[+?.*$)|(\[*?\(+?.*$)//o;
    }
    elsif ( $email && $reader ) {
    if ( !$emails{$email} ) {
    $agents{$reader}++;
    $emails{$email}++;
    }
    last;
    }
    elsif (/^$/) { # Parse only header lines
    ++$no_agent;
    last;
    }
    }
    }
    close FH;

    1;
    }

    __END__


    Thanks & Greets, Rob

    --
    The Enterprise meets God, and it's a child, a computer, or a C program.
     
    Robert Manea, Dec 9, 2004
    #1
    1. Advertising

  2. Robert Manea

    AD Guest

    Two small things.
    1. my $from_header =
    qr/From:.*?([A-Za-z0-9\.]+@[A-Za-z0-9\.]+)\s*.*/;
    This won't catch email addresses having '_', or '-'.
    More importantly see $perldoc -q 'mail address'

    2. if ( -f $file_name && $file_name !~ /\/\./ ) {
    Writing reguilar expression like "$file_name !~ /\/\./ " makes
    obfuscates code. You can use other delimiters like # etc.
     
    AD, Dec 9, 2004
    #2
    1. Advertising

  3. On 2004-12-09, Robert Manea scribbled these
    curious markings:
    > Summary:
    > - 2141 postings in total

    ^^^^
    .... What? That's a ridiculously small sampling of posts. My news server
    (news.individual.net, it's free) provides 15k articles for clpm, and
    those articles are numbered between approximately 530k and 550k. One of
    the most basic rules of statistics is to take a reasonably-sized sample.
    Statistics based upon a small sampling represent exactly that: a small
    sample of what's available.

    Best Regards,
    Christopher Nehren
    --
    I abhor a system designed for the "user", if that word is a coded
    pejorative meaning "stupid and unsophisticated". -- Ken Thompson
    If you ask the wrong questions, you get answers like "42" and "God".
    Unix is user friendly. However, it isn't idiot friendly.
     
    Christopher Nehren, Dec 9, 2004
    #3
  4. On 2004-12-09, AD scribbled these
    curious markings:
    > Two small things.
    > 1. my $from_header =
    > qr/From:.*?([A-Za-z0-9\.]+@[A-Za-z0-9\.]+)\s*.*/;
    > This won't catch email addresses having '_', or '-'.
    > More importantly see $perldoc -q 'mail address'


    Or + characters, like mine.

    Best Regards,
    Christopher Nehren
    --
    I abhor a system designed for the "user", if that word is a coded
    pejorative meaning "stupid and unsophisticated". -- Ken Thompson
    If you ask the wrong questions, you get answers like "42" and "God".
    Unix is user friendly. However, it isn't idiot friendly.
     
    Christopher Nehren, Dec 9, 2004
    #4
  5. Robert Manea <> wrote in
    news::

    > ---8<---------------------------------------------------------------8<-
    > --
    >
    > Example output for comp.lang.perl.misc:
    >
    > 1. Microsoft Outlook Express : 81
    > 2. Mozilla : 35
    > 3. Mozilla Thunderbird : 33
    > 4. G2 : 32
    > 5. slrn : 25
    > 6. KNode : 21
    > 7. Gnus : 19
    > 8. Pan : 17
    > 9. Forte Agent : 15
    > 10. tin : 12
    > [...]


    Am I the only one who uses XNews?

    --
    A. Sinan Unur
    d
    (remove '.invalid' and reverse each component for email address)
     
    A. Sinan Unur, Dec 9, 2004
    #5
  6. Christopher Nehren <> wrote in
    news::

    > On 2004-12-09, Robert Manea scribbled these
    > curious markings:
    >> Summary:
    >> - 2141 postings in total

    > ^^^^
    > ... What? That's a ridiculously small sampling of posts. My news server
    > (news.individual.net, it's free) provides 15k articles for clpm, and
    > those articles are numbered between approximately 530k and 550k. One of
    > the most basic rules of statistics is to take a reasonably-sized
    > ample. Statistics based upon a small sampling represent exactly that: a
    > small sample of what's available.



    Actually, 2000 postings out of 20000 is a very large sample. The crucial
    factor in determining whether a sample is usable or not is whether it was
    randomly selected, not its size.

    http://college.hmco.com/history/readerscomp/rcah/html/ah_071900
    _publicopinio.htm

    --
    A. Sinan Unur
    d
    (remove '.invalid' and reverse each component for email address)
     
    A. Sinan Unur, Dec 9, 2004
    #6
  7. Robert Manea

    Robert Manea Guest

    Segfault in module "A. Sinan Unur" - dump details are as follows:

    > Am I the only one who uses XNews?


    1. Microsoft Outlook Express : 82
    2. Mozilla : 35
    3. G2 : 35
    4. Mozilla Thunderbird : 33
    5. slrn : 27
    6. KNode : 21
    7. Gnus : 19
    8. Pan : 17
    9. Forte Agent : 15
    10. tin : 12
    11. Xnews : 7 <--- No, you aren't. :)
    12. trn : 6
    13. MT-NewsWatcher : 5
    14. nn : 4
    15. Opera M2 : 4
    16. 40tude_Dialog : 2
    17. Thoth : 2
    18. MicroPlanet-Gravity : 2
    19. Microsoft-Entourage : 1
    20. vBulletin USENET gateway : 1
    21. ProNews : 1
    22. trn : 1
    23. Forte Free Agent : 1
    24. hidden : 1
    25. xrn : 1
    26. Messenger-Pro : 1
    27. Thunderbird : 1
    28. Sylpheed version : 1
    29. NewsLeecher : 1
    30. Sylpheed-Claws : 1

    Summary:
    - 2181 postings in total
    - 30 different neawsreaders in 340 distinct postings
    - Average of 5.582 articles per poster (with agent header)
    - 283 without User Agent header


    Done with a slightly fixed $from_header regex.
    my $from_header = qr/From:.*?([A-Za-z0-9\._\-\+]+@[A-Za-z0-9\._\-\+]+)\s*.*/;

    As I want to catch 'slightly' non conforming email adresses as well a
    standard compliant parsers would be rather contraproductive to the
    quality of the results, I guess.

    Anyways, the sole purpose of the address is to server as a unique
    identifier for each poster. Thus compliance to standards isn't that much
    necessary in this case..


    Greets, Rob

    --
    The Enterprise meets God, and it's a child, a computer, or a C program.
     
    Robert Manea, Dec 9, 2004
    #7
  8. On 2004-12-09, Robert Manea scribbled these
    curious markings:

    [...]

    > 4. Mozilla Thunderbird : 33


    [...]

    > 12. trn : 6


    [...]

    > 22. trn : 1
    > 27. Thunderbird : 1
    > 28. Sylpheed version : 1


    [...]

    > 30. Sylpheed-Claws : 1


    I hope that you like dealing with inconsistencies. :)

    Best Regards,
    Christopher Nehren
    --
    I abhor a system designed for the "user", if that word is a coded
    pejorative meaning "stupid and unsophisticated". -- Ken Thompson
    If you ask the wrong questions, you get answers like "42" and "God".
    Unix is user friendly. However, it isn't idiot friendly.
     
    Christopher Nehren, Dec 9, 2004
    #8
  9. Robert Manea

    Robert Manea Guest

    Segfault in module "Christopher Nehren" - dump details are as follows:

    > On 2004-12-09, Robert Manea scribbled these
    > curious markings:


    > [...]


    >> 4. Mozilla Thunderbird : 33


    > [...]


    >> 12. trn : 6


    > [...]


    >> 22. trn : 1
    >> 27. Thunderbird : 1
    >> 28. Sylpheed version : 1


    > [...]


    >> 30. Sylpheed-Claws : 1


    > I hope that you like dealing with inconsistencies. :)


    Well, since the 'User-Agent' header's format conforms to absolutly no
    standard or RFC one must be compromising.

    I understand, that that piece of information is most important in my
    programm but really can't come up with anything significantly better.

    Maybe you have a better RegEx or method to extract it? (see TODO)

    ( Some kind of comparision function which detects common words in
    different strings would be a solution, but still not the right
    one.

    E.g.: 'Mozilla Thunderbird' and 'Thunderbird' should be the same
    but 'Mozilla' and 'Mozilla Thunderbird' must not.

    How could one solve this without hardcoding the names? )

    > Best Regards,
    > Christopher Nehren


    Greets, Rob

    --
    The Enterprise meets God, and it's a child, a computer, or a C program.
     
    Robert Manea, Dec 9, 2004
    #9
  10. Robert Manea

    Anno Siegel Guest

    Robert Manea <> wrote in comp.lang.perl.misc:
    > Segfault in module "Christopher Nehren" - dump details are as follows:
    >
    > > On 2004-12-09, Robert Manea scribbled these
    > > curious markings:

    >
    > > [...]

    >
    > >> 4. Mozilla Thunderbird : 33

    >
    > > [...]

    >
    > >> 12. trn : 6

    >
    > > [...]

    >
    > >> 22. trn : 1
    > >> 27. Thunderbird : 1
    > >> 28. Sylpheed version : 1

    >
    > > [...]

    >
    > >> 30. Sylpheed-Claws : 1

    >
    > > I hope that you like dealing with inconsistencies. :)

    >
    > Well, since the 'User-Agent' header's format conforms to absolutly no
    > standard or RFC one must be compromising.
    >
    > I understand, that that piece of information is most important in my
    > programm but really can't come up with anything significantly better.
    >
    > Maybe you have a better RegEx or method to extract it? (see TODO)
    >
    > ( Some kind of comparision function which detects common words in
    > different strings would be a solution, but still not the right
    > one.
    >
    > E.g.: 'Mozilla Thunderbird' and 'Thunderbird' should be the same
    > but 'Mozilla' and 'Mozilla Thunderbird' must not.
    >
    > How could one solve this without hardcoding the names? )


    Use a (set of) rule(s) that gets most cases right and hold exceptions
    in a config file. That ought to be maintainable -- if you really need
    that kind of precision.

    Anno
     
    Anno Siegel, Dec 10, 2004
    #10
  11. Robert Manea

    Anno Siegel Guest

    Robert Manea <> wrote in comp.lang.perl.misc:
    > Segfault in module "Christopher Nehren" - dump details are as follows:
    >
    > > On 2004-12-09, Robert Manea scribbled these
    > > curious markings:

    >
    > > [...]

    >
    > >> 4. Mozilla Thunderbird : 33

    >
    > > [...]

    >
    > >> 12. trn : 6


    Ah, so I'm not the only trn user here...

    > > [...]

    >
    > >> 22. trn : 1


    ....or am I?

    > >> 27. Thunderbird : 1
    > >> 28. Sylpheed version : 1

    >
    > > [...]

    >
    > >> 30. Sylpheed-Claws : 1

    >
    > > I hope that you like dealing with inconsistencies. :)

    >
    > Well, since the 'User-Agent' header's format conforms to absolutly no
    > standard or RFC one must be compromising.
    >
    > I understand, that that piece of information is most important in my
    > programm but really can't come up with anything significantly better.
    >
    > Maybe you have a better RegEx or method to extract it? (see TODO)
    >
    > ( Some kind of comparision function which detects common words in
    > different strings would be a solution, but still not the right
    > one.
    >
    > E.g.: 'Mozilla Thunderbird' and 'Thunderbird' should be the same
    > but 'Mozilla' and 'Mozilla Thunderbird' must not.
    >
    > How could one solve this without hardcoding the names? )


    Use a (set of) rule(s) that gets most cases right and hold exceptions
    in a config file. That ought to be maintainable -- if you really need
    that kind of precision.

    Anno
     
    Anno Siegel, Dec 10, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. LIN

    User Agent

    LIN, Aug 14, 2003, in forum: ASP .Net
    Replies:
    3
    Views:
    479
    James J. Foster
    Aug 14, 2003
  2. Colin
    Replies:
    0
    Views:
    336
    Colin
    Dec 1, 2003
  3. Brian Henry

    retrieveing entire user agent

    Brian Henry, Dec 22, 2003, in forum: ASP .Net
    Replies:
    3
    Views:
    406
    Brian Henry
    Dec 22, 2003
  4. fulio pen

    Tool for statistics

    fulio pen, Sep 24, 2011, in forum: HTML
    Replies:
    9
    Views:
    562
    Kristjan Robam
    Jan 5, 2012
  5. Luke Matuszewski
    Replies:
    8
    Views:
    668
    Luke Matuszewski
    Dec 2, 2005
Loading...

Share This Page