regex search - suggestions?

S

Sara

Hi All,
I have a string (a paragraph) without newlines, with organization
names and their abbreviations in brackets like...

$tmp = "... was proposed by World Health Organisation (WHO) in ...";

I have the following code segment:

$tmp =~ s/\)/\)\n<brk>/g; # because we have . in regex and
# there is no \n in $tmp
my ($abbr,$org) = "";
my (%orgs) = ();
foreach my $line (split (/\n/, $tmp)) {
if ($line =~ /\b([A-Z])(\w+[ forand]*) ([A-Z])(.*?)
\((\1\3[A-Z]*)\)/) {
$abbr = $5; $org = "$1$2 $3$4";
$orgs{$abbr} = $org;
}
}
I added [ forand]* in regex to include 'for', 'of', 'and' that might
appear after the first word.
Can anyone help me to improve the accuracy of this search, especially
the [ forand]* part.
Thanks in advance.
 
I

Ilmari Karonen

Hi All,
I have a string (a paragraph) without newlines, with organization
names and their abbreviations in brackets like...

$tmp = "... was proposed by World Health Organisation (WHO) in ...";

....and you want to extract the organization names and abbreviations?

my @tmp = split /\s*\(([A-Z]+)\)/, $tmp;
pop @tmp;

my %orgs;
while (my ($str, $abbr) = splice(@tmp, 0, 2)) {
(my $re = $abbr) =~ s/(.)/$1[a-z\\W]*/g;
$str =~ /.*($re)$/s or warn "Can't expand $abbr!\n" and next;
$orgs{$abbr} = $1;
}

Can anyone help me to improve the accuracy of this search, especially

If you could provide more sample data, I could do some more thorough
testing. My code works for your example case, and probably quite many
others. Some cases where it fails for various reasons include:

World Wide Web Consortium (W3C)
PlayStation 2 (PS2)
Church of Scientology (CoS)
Skip if Equal (SEQ)
Decrement and Jump if Not Zero (DJN)
Deutscher Jugendbund für Naturbeobachtung (DJN)
GNU's Not Unix (GNU)

Most of those can be fixed, although idiosyncratic abbreviations like
W3C are probably not worth the effort.
 
T

Tad McClellan

Sara said:
I added [ forand]* in regex to include 'for', 'of', 'and' that might
appear after the first word.


That will match exactly the same strings as:

[adfnor ]*

It would match:

aaaaaa
afafafaf

etc.

A character class matches a _character_, not a string.

Can anyone help me to improve the accuracy of this search, especially
the [ forand]* part.


(for|of|and)
 
S

Sara

Ilmari Karonen wrote in message
...and you want to extract the organization names and abbreviations?
Yes, forgot to mention that :-o
If you could provide more sample data, I could do some more thorough
testing. My code works for your example case, and probably quite
many

I have got organization names like ...
European Process Safety Centre (EPSC)
Association of British Chemical Manufacturers (ABCM)
Safety and Reliability Directorate (SRD)
# The next one was not found by your code
Health and Safety at Work etc. Act 1974 (HSWA)
Advisory Committee on Major Hazards (ACMH)
Center for Chemical Process Safety (CCPS)
Most of those can be fixed, although idiosyncratic abbreviations like
W3C are probably not worth the effort.
I agree, I don't want to work for it either


Tad McClellan wrote in message
That will match exactly the same strings as:
[adfnor ]*
Can anyone help me to improve the accuracy of this search, especially
the [ forand]* part.

(for|of|and)

That was almost exactly what I tried first:
$line =~ /\b([A-Z])(\w+)( for| of| and)? ([A-Z])(.*?)
\((\1\4[A-Z]*)\)/;
$abbr = $6; $org = "$1$2$3 $4$5";
$orgs{$abbr} = $org;

since 'for','of','and' don't get included in abbreviations, but won't
it produce 'Use of uninitialized value in ...' for those which don't
have 'for','of','and'? Is that ignorable?
Thanks,
Sara
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top