regex search - suggestions?

Discussion in 'Perl Misc' started by Sara, Jul 24, 2004.

  1. Sara

    Sara Guest

    Hi All,
    I have a string (a paragraph) without newlines, with organization
    names and their abbreviations in brackets like...

    $tmp = "... was proposed by World Health Organisation (WHO) in ...";

    I have the following code segment:

    $tmp =~ s/\)/\)\n<brk>/g; # because we have . in regex and
    # there is no \n in $tmp
    my ($abbr,$org) = "";
    my (%orgs) = ();
    foreach my $line (split (/\n/, $tmp)) {
    if ($line =~ /\b([A-Z])(\w+[ forand]*) ([A-Z])(.*?)
    \((\1\3[A-Z]*)\)/) {
    $abbr = $5; $org = "$1$2 $3$4";
    $orgs{$abbr} = $org;
    }
    }
    I added [ forand]* in regex to include 'for', 'of', 'and' that might
    appear after the first word.
    Can anyone help me to improve the accuracy of this search, especially
    the [ forand]* part.
    Thanks in advance.
    Sara, Jul 24, 2004
    #1
    1. Advertising

  2. On 2004-07-24, Sara <> wrote:
    > Hi All,
    > I have a string (a paragraph) without newlines, with organization
    > names and their abbreviations in brackets like...
    >
    > $tmp = "... was proposed by World Health Organisation (WHO) in ...";


    ....and you want to extract the organization names and abbreviations?

    my @tmp = split /\s*\(([A-Z]+)\)/, $tmp;
    pop @tmp;

    my %orgs;
    while (my ($str, $abbr) = splice(@tmp, 0, 2)) {
    (my $re = $abbr) =~ s/(.)/$1[a-z\\W]*/g;
    $str =~ /.*($re)$/s or warn "Can't expand $abbr!\n" and next;
    $orgs{$abbr} = $1;
    }


    > Can anyone help me to improve the accuracy of this search, especially


    If you could provide more sample data, I could do some more thorough
    testing. My code works for your example case, and probably quite many
    others. Some cases where it fails for various reasons include:

    World Wide Web Consortium (W3C)
    PlayStation 2 (PS2)
    Church of Scientology (CoS)
    Skip if Equal (SEQ)
    Decrement and Jump if Not Zero (DJN)
    Deutscher Jugendbund für Naturbeobachtung (DJN)
    GNU's Not Unix (GNU)

    Most of those can be fixed, although idiosyncratic abbreviations like
    W3C are probably not worth the effort.

    --
    Ilmari Karonen
    If replying by e-mail, please replace ".invalid" with ".net" in address.
    Ilmari Karonen, Jul 24, 2004
    #2
    1. Advertising

  3. Sara <> wrote:

    > I added [ forand]* in regex to include 'for', 'of', 'and' that might
    > appear after the first word.



    That will match exactly the same strings as:

    [adfnor ]*

    It would match:

    aaaaaa
    afafafaf

    etc.

    A character class matches a _character_, not a string.


    > Can anyone help me to improve the accuracy of this search, especially
    > the [ forand]* part.



    (for|of|and)


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Jul 24, 2004
    #3
  4. Sara

    Sara Guest

    Ilmari Karonen wrote in message
    >...and you want to extract the organization names and abbreviations?

    Yes, forgot to mention that :-o

    >If you could provide more sample data, I could do some more thorough
    >testing. My code works for your example case, and probably quite

    many

    I have got organization names like ...
    European Process Safety Centre (EPSC)
    Association of British Chemical Manufacturers (ABCM)
    Safety and Reliability Directorate (SRD)
    # The next one was not found by your code
    Health and Safety at Work etc. Act 1974 (HSWA)
    Advisory Committee on Major Hazards (ACMH)
    Center for Chemical Process Safety (CCPS)

    >Most of those can be fixed, although idiosyncratic abbreviations like
    >W3C are probably not worth the effort.

    I agree, I don't want to work for it either


    Tad McClellan wrote in message
    > That will match exactly the same strings as:
    > [adfnor ]*
    >
    > > Can anyone help me to improve the accuracy of this search, especially
    > > the [ forand]* part.

    >
    > (for|of|and)


    That was almost exactly what I tried first:
    $line =~ /\b([A-Z])(\w+)( for| of| and)? ([A-Z])(.*?)
    \((\1\4[A-Z]*)\)/;
    $abbr = $6; $org = "$1$2$3 $4$5";
    $orgs{$abbr} = $org;

    since 'for','of','and' don't get included in abbreviations, but won't
    it produce 'Use of uninitialized value in ...' for those which don't
    have 'for','of','and'? Is that ignorable?
    Thanks,
    Sara
    Sara, Jul 26, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ben Fidge
    Replies:
    8
    Views:
    438
    Ben Fidge
    May 2, 2005
  2. ©®
    Replies:
    4
    Views:
    406
    Craig
    Mar 7, 2006
  3. Replies:
    3
    Views:
    415
  4. Replies:
    3
    Views:
    725
    Reedick, Andrew
    Jul 1, 2008
  5. Abby Lee
    Replies:
    5
    Views:
    374
    Abby Lee
    Aug 2, 2004
Loading...

Share This Page