Help sought Perl with a bit of REGEX

Discussion in 'Perl' started by Chris Newman, Jul 22, 2006.

  1. Chris Newman

    Chris Newman Guest

    I am working on a script to process a large number of old electoral records.
    There are about 100,000 records in all but here is a representative sample

    BTW hd =household duties


    ALLISON, Winifred hd
    BRACKENREG, Helen & James hd & lands officer
    MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver

    Note that the first names are in the same sequence as the occupations. An
    occupation may consist of one or two words eg 'hd' or 'tractor driver'. The
    last of these sample records has 3 'Marshalls' Margaret, Charles and Herbert

    though other records include up to six family members. In all cases there is

    a pattern:

    1 person . . . occupation is immediately followed by a line return
    (naturally)
    2 people . . . first occupation is followed by an '&', last occupation by
    line return
    3 or more people . . . the first and up to the second last occupation are
    followed by commas and the remainder of the line follows the aforementioned
    patterns


    My initial thoughts
    Use a global REGEX that would step though and match the next occupation but
    it has not proved that easy. Need a way to move the 'matching point forward
    to a ampersand, comma or line return depending on context. If anyone could
    provide some insights into whether RE can provide this level of control or
    point me to a more appropriate solution.


    Here the relevant code snippet:


    #preceding code to do with last name, addresses etc This part works well

    @matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record

    foreach $FirstName (@matches ) {

    (m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation

    $Occupation =$1; # stores the next matching occupation with each successive
    loop

    print ("\"$FirstName\",\"$Occupation\");

    }
    Chris Newman, Jul 22, 2006
    #1
    1. Advertising

  2. Chris Newman

    Mumia W. Guest

    On 07/22/2006 02:56 AM, Chris Newman wrote:
    > I am working on a script to process a large number of old electoral records.
    > There are about 100,000 records in all but here is a representative sample
    >
    > BTW hd =household duties
    >
    >
    > ALLISON, Winifred hd
    > BRACKENREG, Helen & James hd & lands officer
    > MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
    >
    > Note that the first names are in the same sequence as the occupations. An
    > occupation may consist of one or two words eg 'hd' or 'tractor driver'. The
    > last of these sample records has 3 'Marshalls' Margaret, Charles and Herbert
    >
    > though other records include up to six family members. In all cases there is
    >
    > a pattern:
    >
    > 1 person . . . occupation is immediately followed by a line return
    > (naturally)
    > 2 people . . . first occupation is followed by an '&', last occupation by
    > line return
    > 3 or more people . . . the first and up to the second last occupation are
    > followed by commas and the remainder of the line follows the aforementioned
    > patterns
    >
    >
    > My initial thoughts
    > Use a global REGEX that would step though and match the next occupation but
    > it has not proved that easy. Need a way to move the 'matching point forward
    > to a ampersand, comma or line return depending on context. If anyone could
    > provide some insights into whether RE can provide this level of control or
    > point me to a more appropriate solution.
    >
    >
    > Here the relevant code snippet:
    >
    >
    > #preceding code to do with last name, addresses etc This part works well
    >
    > @matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record
    >
    > foreach $FirstName (@matches ) {
    >
    > (m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation
    >
    > $Occupation =$1; # stores the next matching occupation with each successive
    > loop
    >
    > print ("\"$FirstName\",\"$Occupation\");
    >
    > }
    >


    The newsgroup comp.lang.perl is defunct. Comp.lang.perl.misc
    is where the action is.

    I like to break problems into pieces and eat away at them
    piece-by-piece. For this problem, I'd use the s/// operator to
    match and remove parts of the string that I'm looking for.

    Your strings are organized like so: <family-name>
    <first-names> <occupations>. So I'd suggest stripping off
    (while matching) the family-names first, followed by the
    first-names, followed by the occupations. And since '&' seems
    to have a function that's the same as the comma, I'd convert
    all &'s to commas before doing the real work, e.g.

    use Data::Dumper;

    my $data = q{
    ALLISON, Winifred hd
    BRACKENREG, Helen & James hd & lands officer
    MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
    };

    open (FH, "<", \$data) or die("Couldn't open in-memory file.\n");

    while (my $line = <FH>) {
    $_ = $line;
    s/^\s+//;
    s/\s+$//;
    next if m/^$/;

    my ($fam,@names,@occup);
    s/\&/,/g;
    if (s/^([A-Z]+),\s*//) { $fam = $1 }
    while (s/^([A-Z][a-z]+)(\s*,\s*)?//) { push @names, $1 }
    while (s/^([a-z ]+)(\s*,\s*)?//) { push @occup, $1 }

    print Data::Dumper->Dump([$fam,\@names,\@occup],
    [qw(family names occupations)]);
    }

    close FH;
    Mumia W., Jul 22, 2006
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    1,723
    Timothy Bendfelt
    Jan 19, 2007
  2. Replies:
    9
    Views:
    943
    Juha Nieminen
    Aug 22, 2007
  3. Replies:
    3
    Views:
    734
    Reedick, Andrew
    Jul 1, 2008
  4. Jeff.M
    Replies:
    6
    Views:
    163
    Lasse Reichstein Nielsen
    May 4, 2009
  5. Roedy Green

    simple regex pattern sought

    Roedy Green, May 25, 2012, in forum: Java
    Replies:
    18
    Views:
    372
Loading...

Share This Page