Reading poorly structured data

Discussion in 'Perl Misc' started by Alan Mead, Dec 8, 2004.

  1. Alan Mead

    Alan Mead Guest

    I have five files of contact info (one for each year of a conference).
    All five have slightly different fairly unstructured formats. One looks
    like this:

    Bush, George, President, 1 White House Way, Washington,
    DC 00000;
    Kerry, John, 1 Main, Detroit, MI 00000;
    Williams, Robin, 2 Main, Burbank, CA 00000
    Newman, Paul, President and Principal Spokesperson,
    Paul Newmans's Own Brand Foods, 123 Main Street,
    Olympia Fields, WY 00000;
    Blair, Tony, 1 Downing Street, London, UK 0000000
    .... etc..

    So the fields are comma-separated, except for email which may be absent,
    and the record may be split over two or three lines.

    In a later file dozens of records appear on the same line.

    I'd like to output

    lname=Bush
    fname=George
    address=President, 1 White House Way, Washington, DC 00000
    email=

    Any ideas how to parse this using Perl? So far I can parse about 60% of
    the records with the below hack. It gets tripped up when the number
    of commas in a record is large (some people have five lines of
    address with embedded commas) in which cases it will parse the
    first half of the record fairly well and then try to parse the
    next half as a new record.

    -Alan

    my $i=0;
    while($i<=$count) {
    $i++;
    my($lname,$fname,$address,$email)=('','','','');
    my $line = $lines{$i};
    if ($line =~ /[,;]$/) { # clearly more on next line
    $lines{$i+1} = "$line $lines{$i+1}";
    next;
    }
    if ( (scalar split/,/,$line) > 4) { # a proper name and address will
    # have at least 5 parts
    if ($line =~ /@/) {
    my @bits = split(/;/,$line); # email is last element when split
    # on semicolons, so save it
    $email = pop(@bits);
    $line = join(';',@bits); # put line back together (just
    # in case there's more than one
    # semi-colon in the record)
    }
    my @bits = split(/,/,$line); # now split on commas
    $lname = shift @bits; # lname is first bit
    $fname = shift @bits; # folllowed by fname
    $address = join(',',@bits); # the rest is the address
    } else {
    $lines{$i+1} = "$line $lines{$i+1}";
    next;
    }
    ....
    }
    Alan Mead, Dec 8, 2004
    #1
    1. Advertising

  2. Alan Mead <> wrote in
    news:p:

    > I have five files of contact info (one for each year of a conference).
    > All five have slightly different fairly unstructured formats. One looks
    > like this:
    >
    > Bush, George, President, 1 White House Way, Washington,
    > DC 00000;
    > Kerry, John, 1 Main, Detroit, MI 00000;
    > Williams, Robin, 2 Main, Burbank, CA 00000
    > Newman, Paul, President and Principal Spokesperson,
    > Paul Newmans's Own Brand Foods, 123 Main Street,
    > Olympia Fields, WY 00000;
    > Blair, Tony, 1 Downing Street, London, UK 0000000
    > ... etc..


    Here is somewhat of a kludge that "works" for the snippet you posted. Hope
    this helps.

    #! perl

    use strict;
    use warnings;

    use File::Slurp;

    my $input = read_file(\*DATA);
    $input =~ tr/\n/ /;

    my @records;

    while(length $input) {
    my %record;
    $record{lname} = grab_name($input);
    $record{fname} = grab_name($input);
    $input =~ /[A-Z]{2} \d+/g;
    $record{address} = substr $input, 0, pos($input);
    $input = substr $input, pos($input);
    if($input =~ /^;\s*(\w+\@\w+\.\w+)\s*/g) {
    $record{email} = $1;
    $input = substr $input, pos $input;
    }
    push @records, \%record;
    }

    use Data::Dumper;
    print Dumper \@records;

    sub grab_name {
    my $off = index $_[0], ',';
    my $name = substr $_[0], 0, $off;
    $_[0] = substr $_[0], $off + 2;
    return $name;
    }

    __DATA__
    Bush, George, President, 1 White House Way, Washington,
    DC 00000;
    Kerry, John, 1 Main, Detroit, MI 00000;
    Williams, Robin, 2 Main, Burbank, CA 00000
    Newman, Paul, President and Principal Spokesperson,
    Paul Newmans's Own Brand Foods, 123 Main Street,
    Olympia Fields, WY 00000;
    Blair, Tony, 1 Downing Street, London, UK 0000000
    A. Sinan Unur, Dec 8, 2004
    #2
    1. Advertising

  3. Alan Mead

    Alan Mead Guest

    On Wed, 08 Dec 2004 04:04:53 +0000, A. Sinan Unur wrote:

    > Here is somewhat of a kludge that "works" for the snippet you posted. Hope
    > this helps.
    >
    > #! perl
    > use strict;
    > use warnings;
    > use File::Slurp;
    > my $input = read_file(\*DATA);
    > $input =~ tr/\n/ /;
    > my @records;
    > while(length $input) {
    > my %record;
    > $record{lname} = grab_name($input);
    > $record{fname} = grab_name($input);
    > $input =~ /[A-Z]{2} \d+/g;
    > $record{address} = substr $input, 0, pos($input);
    > $input = substr $input, pos($input);
    > if($input =~ /^;\s*(\w+\@\w+\.\w+)\s*/g) {
    > $record{email} = $1;
    > $input = substr $input, pos $input;
    > }
    > push @records, \%record;
    > }

    [...]

    And so it does very nicely. I think you are making use of the fact that
    these all had a pair of capital letters near the end (including the
    convenient UK) but there is a 'D.C.' in my data and some other
    addresses outside the US (that lack this feature). I should have included
    a better sample. But this may get me to 95% ... The way you've slurped the
    file makes this perfectly applicable to the rest of the files which is a
    REALLY BIG help.

    Thanks!

    -Alan
    Alan Mead, Dec 8, 2004
    #3
  4. Alan Mead <> wrote in
    news:p:

    > On Wed, 08 Dec 2004 04:04:53 +0000, A. Sinan Unur wrote:
    >
    >> $input =~ /[A-Z]{2} \d+/g;

    ....

    > And so it does very nicely. I think you are making use of the fact
    > that these all had a pair of capital letters near the end (including
    > the convenient UK) but there is a 'D.C.' in my data and some other
    > addresses outside the US (that lack this feature).


    Actually, that is a standing for some kind of Country/State Code with
    numeric postal code match because all your addresses seemed to end with
    that.

    The "two capital letters followed by some digits as end of mailing address
    indicator" was one of the things that made the code kludgy.

    I am sure others will provide better ways once the sun comes up. Good luck.

    Sinan.
    A. Sinan Unur, Dec 8, 2004
    #4
  5. "A. Sinan Unur" <> wrote in
    news:Xns95B8F3ED5DCB9asu1cornelledu@132.236.56.8:

    > Actually, that is a standing for some kind of Country/State Code with

    ^^^^^^^^
    I meant 'stand-in'. Sorry.

    Sinan
    A. Sinan Unur, Dec 8, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Eric
    Replies:
    3
    Views:
    429
    Mike Moore [MSFT]
    Aug 26, 2003
  2. Bryce
    Replies:
    1
    Views:
    840
    Bryce
    Jun 28, 2003
  3. AngleWyrm

    enum is poorly defined

    AngleWyrm, Feb 10, 2004, in forum: C++
    Replies:
    9
    Views:
    627
    Victor Bazarov
    Feb 11, 2004
  4. Skip Montanaro

    Propagating poorly chosen idioms

    Skip Montanaro, Apr 5, 2005, in forum: Python
    Replies:
    3
    Views:
    262
    Roy Smith
    Apr 6, 2005
  5. TheDude5B
    Replies:
    4
    Views:
    334
    TheDude5B
    Sep 14, 2007
Loading...

Share This Page