Push regex search result into hash with multiple values

F

fmassion

I have 2 lists:

List of words:
cat
dog
List of phrases:
This is a cat
This is another cat
This is a dog
This is a cat and not a dog

I wand have a hash with all phrases (=values) matching the word "cat" (key) or "dog" (other key)

In my code I only get the last value of each search. Obvisouly I am doing something wrong here. Any suggestions?

Here my code:

#!/usr/bin/perl -w
# Open words file
open(WORDLIST,$ARGV[0]) || die("Cannot open $ARGV[0]!\n");
@words = <WORDLIST>;
# Close words file
close(WORDLIST);
# Open phrases file
open(PHRASELIST,$ARGV[1]) || die("Cannot open $ARGV[1])!\n");
@phrase = <PHRASELIST>;
# Close phrases file
close(PHRASELIST);
# Create empty hash for results
%phrasefound = ();
foreach $word (@words) {
for($phrasecount=0 ; $phrasecount <= $#phrase ; $phrasecount++) { # Counts from 0 to last array entry
$phrase = $phrase[$phrasecount];
chomp $word;
chomp $phrase;
if ($phrase =~ m/$word/i) {
# push into hash
$phrasefound{$word} = $phrase;
print $word."-->".$phrasefound{$word}."\n"; #this is to check if it works. I get here all values
}}}
# output hash
print "Hash result:\n----------\n";
foreach $word (keys %phrasefound) {
print "$word --> $phrasefound{$word}\n"; #I get only the last match
}
 
D

Dr.Ruud

I have 2 lists:

List of words:
cat
dog
List of phrases:
This is a cat
This is another cat
This is a dog
This is a cat and not a dog

I wand have a hash with all phrases (=values) matching the word "cat" (key) or "dog" (other key)


catdog.pl
- - - - - - - - - - - - - - -
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

my @keys = qw(
cat
dog
);

my %found;

for my $phrase (<DATA>) {
chomp $phrase;

for my $key (@keys) {
$key = quotemeta($key);

push @{ $found{ $key } }, $phrase
if $phrase =~ /\b$key\b/;
}
}

print Dumper( \%found );

__DATA__
This is a cat
This is another cat
This is a dog
This is a cat and not a dog
- - - - - - - - - - - - - - -


$VAR1 = {
'dog' => [
'This is a dog',
'This is a cat and not a dog'
],
'cat' => [
'This is a cat',
'This is another cat',
'This is a cat and not a dog'
]
};
 
G

gamo

El 19/05/14 08:40, (e-mail address removed) escribió:
I have 2 lists:

List of words:
cat
dog
List of phrases:
This is a cat
This is another cat
This is a dog
This is a cat and not a dog

I wand have a hash with all phrases (=values) matching the word "cat" (key) or "dog" (other key)

In my code I only get the last value of each search. Obvisouly I am doing something wrong here. Any suggestions?

Here my code:

#!/usr/bin/perl -w
# Open words file
open(WORDLIST,$ARGV[0]) || die("Cannot open $ARGV[0]!\n");
@words = <WORDLIST>;
# Close words file
close(WORDLIST);
# Open phrases file
open(PHRASELIST,$ARGV[1]) || die("Cannot open $ARGV[1])!\n");
@phrase = <PHRASELIST>;
# Close phrases file
close(PHRASELIST);
# Create empty hash for results
%phrasefound = ();
foreach $word (@words) {
for($phrasecount=0 ; $phrasecount <= $#phrase ; $phrasecount++) { # Counts from 0 to last array entry
$phrase = $phrase[$phrasecount];
chomp $word;
chomp $phrase;
if ($phrase =~ m/$word/i) {
# push into hash
$phrasefound{$word} = $phrase;
print $word."-->".$phrasefound{$word}."\n"; #this is to check if it works. I get here all values
}}}
# output hash
print "Hash result:\n----------\n";
foreach $word (keys %phrasefound) {
print "$word --> $phrasefound{$word}\n"; #I get only the last match
}

That's because you overwrite $phrasefound{$word} with $phrase.
Maybe you want $phrasefound{$phrase} = $word;
if you want to store each phrase.

HTH
 
T

Thomas 'PointedEars' Lahn

Dr.Ruud wrote:
^^^^^^^
Please fix.
catdog.pl
- - - - - - - - - - - - - - -
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

my @keys = qw(
cat
dog
);

my %found;

for my $phrase (<DATA>) {
chomp $phrase;

for my $key (@keys) { ^^^^
$key = quotemeta($key);
^^^^
Be extra careful with “for(each)†loops in Perl.

This for-each-loop will make $key an lvalue-iterator so that the assignment
operation *modifies* the elements of the array @keys for each $phrase; not
what one wants here. For example, suppose “@keys[0] eq "foo.bar"â€, then
after the assignment operation it will “eq "foo\\.bar"†for the first
phrase, "foo\\\\\.bar" for the second, "foo\\\\\\\\\\\.bar" for the third,
and so on:

$ perl -e 'use strict; use warnings; my @keys = ("foo.bar", "bar.baz"); for
my $i ((1, 2, 3)) { for my $key (@keys) { $key = quotemeta($key); CORE::say
join(", ", @keys); }}'
foo\.bar, bar.baz
foo\.bar, bar\.baz
foo\\\.bar, bar\.baz
foo\\\.bar, bar\\\.baz
foo\\\\\\\.bar, bar\\\.baz
foo\\\\\\\.bar, bar\\\\\\\.baz

This can be avoided with

for (@keys) {
my $key = quotemeta($_);

# …
}

equivalent to

foreach (@keys) {
my $key = quotemeta($_);

# …
}

so that $key would be a block-scoped variable. See perlsyn(1).

(I find it useful to use “for†for C-style “for†loops and “foreach†for
for-each loops; YMMV.)
push @{ $found{ $key } }, $phrase

But I see no reason to quotemeta($key) for this operation. You would want
the keys of %found to be the original keys, not the (RE-)quoted ones.
if $phrase =~ /\b$key\b/;

Only for this operation $key needs to be quoted (unless one either *wants*
regular expression matching or is certain *not* to have words containing
other ASCII characters than matched by /[A-Za-z_0-9]/; see perlfunc(1)):

# see perlre(1)
if $phrase =~ /\b\Q$key\E\b/;

However, in that case $key only needs to be quoted *once*. Therefore, it
appears to be prudent to create a hash whose keys are the original keys, and
the corresponding values are the quoted keys:

# see perlfunc(1)
my %keys = map { $_ => quotemeta($_) } @keys;

or

my %keys = map { $_ => quotemeta($_) } qw(
cat
dog
);

in the first place, and then

my %found;

for my $phrase (<DATA>) {
chomp $phrase;

while (my ($key, $quoted_key) = each %keys) {
push @{ $found{ $key } }, $phrase
if $phrase =~ /\b$quoted_key\b/;
}
}

$phrase may not need to be chomp()ed here. See perlfunc(1) again.
In summary:

$ perl -e '

use strict;
use warnings;
use Data::Dumper;

my %keys = map { $_ => quotemeta($_) } qw(foo.bar bar.baz);
my @phrases = qw(foo.bar baz foo bar.baz foo.bar.baz);
my %found;

foreach my $phrase (@phrases) {
while (my ($key, $quoted_key) = each %keys) {
push @{ $found{ $key } }, $phrase
if $phrase =~ /\b$quoted_key\b/;
}
}

print Dumper(\%found);

'
$VAR1 = {
'foo.bar' => [
'foo.bar baz',
'foo.bar.baz'
],
'bar.baz' => [
'foo bar.baz',
'foo.bar.baz'
]
};

An interesting experiment to test this approach is to reduce the number of
spaces between the elements of @phrases in the declaration to one.
 
F

fmassion

Thanks for all the suggestions. I have tried to understand everything, but I am not a programmer....

I could use several suggestions, however my aim is to end up with a plain list of matches, i.e. either to process $VAR1 in the examples above or try another approach.

I have tried to push the values into an array and split it at the end. This seems to work fine. Here my code:

#!/usr/bin/perl -w
# Open words file
open(WORDLIST,$ARGV[0]) || die("Cannot open $ARGV[0]!\n");
@words = <WORDLIST>;
# # Close words file
close(WORDLIST);
# # Open phrases file
open(PHRASELIST,$ARGV[1]) || die("Cannot open $ARGV[1])!\n");
@phrase = <PHRASELIST>;
# Close phrases file
close(PHRASELIST);
# Create empty hash for results
for my $phrase (@phrase) {
chomp $phrase;
for my $word (@words) {
chomp $word;
if ($phrase =~ m/$word/i) {
$found = $word."\t".$phrase;
# push into hash
push (@result, $found);
}}}
foreach (@result){ # go through the result array
($word, $phrase) = split(/\t/,$_);
print "$word --> $phrase\n";
}
 
B

Ben Bacarisse

Thanks for all the suggestions. I have tried to understand everything,
but I am not a programmer....

I could use several suggestions,

One that you may not have spotted (or considered as a suggestion if you
did spot it) is to include

use strict;
use warnings;

at the top of your program. Experience shows that these catch many
mistakes and it's a good habit to get into.
however my aim is to end up with a
plain list of matches, i.e. either to process $VAR1 in the examples
above or try another approach.

I have tried to push the values into an array and split it at the
end. This seems to work fine. Here my code:

If the posted code works, then your original post misrepresented what
you want. You said you wanted a hash, and the suggestion was that it
should be used to collect together all the matching phrases. The code
below does not do that.

Dr Ruud presented a solution (with an small error, but that won't affect
you unless your words have peculiar characters in them). the solution
is the right way to do this in Perl, so your best plan is to ask about
it until you follow it. The key line is this:

push @{ $found{ $key } }, $phrase;

The magic is in the outer @{...}. It converts the plain scalar value
$found{$key} into a reference to an array, onto which a new phrase can
be pushed.

Some details:
#!/usr/bin/perl -w

Add:

use strict;
use warnings;

yYou will then have to declare all the undeclared globals (it's not
many, but it's really worthwhile doing).
# Open words file
open(WORDLIST,$ARGV[0]) || die("Cannot open $ARGV[0]!\n");
@words = <WORDLIST>;
# # Close words file
close(WORDLIST);
# # Open phrases file
open(PHRASELIST,$ARGV[1]) || die("Cannot open $ARGV[1])!\n");
@phrase = <PHRASELIST>;
# Close phrases file
close(PHRASELIST);

These comments just create noise. If the reader does not know what the
commented line does, the comment won't really help.
# Create empty hash for results

And this one seems to be wrong. I see no hash being created.
for my $phrase (@phrase) {
chomp $phrase;
for my $word (@words) {
chomp $word;
if ($phrase =~ m/$word/i) {

Dr Ruud's code had two things related to this. First, he added \b at
each end. This matched only are a word boundary. Do you want to match
"cathedral" again the word cat? You don't say, but probably not.
Second, it quotes the special characters in the word, so that ., * and
so on don't have their technical meanings anymore.
$found = $word."\t".$phrase;

What if the phrase has a tab in it? I with I had a pound for every line
of code I've fixed that had a comment like "there'll never be a Ctrl-A
in this data...".
# push into hash

It pushed onto a plain array, not a hash.
 
T

Thomas 'PointedEars' Lahn

Ben said:
(e-mail address removed) writes:
Dr Ruud presented a solution (with an small error, but that won't affect ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
you unless your words have peculiar characters in them).
for my $phrase (@phrase) {
chomp $phrase;
for my $word (@words) {
chomp $word;
if ($phrase =~ m/$word/i) {

Dr Ruud's code had two things related to this. […]
Second, it quotes the special characters in the word, so that ., * and
so on don't have their technical meanings anymore.

And it does that not only *repeatedly* but also for the *keys*, which I do
not consider a “small error, but a logically flawed algorithm. The quoting
is either unnecessary or desastrous (as in “not a solutionâ€) this way.
Hence my correction.
 
D

Dr.Ruud

^^^^
Be extra careful with “for(each)†loops in Perl.

This for-each-loop will make $key an lvalue-iterator so that the assignment
operation *modifies* the elements of the array @keys for each $phrase; not
what one wants here.

Thanks Thomas, good catch. I started limping because I wanted to put
several things in the same example. So I considered quotemeta(),
index(), word boundaries, etc.


- $key = quotemeta($key);
-
push @{ $found{ $key } }, $phrase
- if $phrase =~ /\b$key\b/;
+ if $phrase =~ /\b\Q$key\E\b/

I'll leave it to the OP to decide on the usefulness of the word boundaries.

it appears to be prudent to create a hash whose keys are the
original keys, and the corresponding values are the quoted keys

Or use the compiled regular expressions as values. Also because Perl
keeps moving forward with them.
 
F

fmassion

Thank you, Ben
These comments just create noise. If the reader does not know what the
commented line does, the comment won't really help.
Your're right.
use strict;
use warnings;
I've now done so until all Messages stating that a global symbol requires an explicit package Name disappear. However, I get no result anymore. If I use my previous code (without "my" and without "use strict" I get the results

Here the new code:
#!/usr/bin/perl -w
use strict;
use warnings;
open(WORDLIST,$ARGV[0]) || die("Cannot open $ARGV[0]!\n");
my @words = <WORDLIST>;
close(WORDLIST);
open(PHRASELIST,$ARGV[1]) || die("Cannot open $ARGV[1])!\n");
my @phrase = <PHRASELIST>;
close(PHRASELIST);
for my $phrase (@phrase) {
chomp $phrase;
for my $word (@words) {
chomp $word;
if ($phrase =~ m/\b$word\b/i) {
my $found = $word."\t".$phrase;
push (my @result, $found);
print @result."\n";
}}}
foreach (my @result){ # go through the result array
my ($word, $phrase) = split(/\t/,$_);
print "$word --> $phrase\n";
}
 
T

Thomas 'PointedEars' Lahn

Dr.Ruud said:
Dr.Ruud said:
for my $phrase (<DATA>) {
[…]
for my $key (@keys) { ^^^^
$key = quotemeta($key);
^^^^
[…]
This for-each-loop will make $key an lvalue-iterator so that the
assignment operation *modifies* the elements of the array @keys for each
$phrase; not what one wants here.

Thanks Thomas, good catch. […]

You're welcome.
[…]
push @{ $found{ $key } }, $phrase
- if $phrase =~ /\b$key\b/;
+ if $phrase =~ /\b\Q$key\E\b/
[…]
it appears to be prudent to create a hash whose keys are the
original keys, and the corresponding values are the quoted keys

Or use the compiled regular expressions as values. Also because Perl
keeps moving forward with them.

What do you mean by that?
 
B

Ben Bacarisse

Thank you, Ben
These comments just create noise. If the reader does not know what the
commented line does, the comment won't really help.
Your're right.
use strict;
use warnings;
I've now done so until all Messages stating that a global symbol
requires an explicit package Name disappear. However, I get no result
anymore. If I use my previous code (without "my" and without "use
strict" I get the results

Here the new code:
#!/usr/bin/perl -w
use strict;
use warnings;
open(WORDLIST,$ARGV[0]) || die("Cannot open $ARGV[0]!\n");
my @words = <WORDLIST>;
close(WORDLIST);
open(PHRASELIST,$ARGV[1]) || die("Cannot open $ARGV[1])!\n");
my @phrase = <PHRASELIST>;
close(PHRASELIST);

You need to have one @result array whose scope spans all uses, so put

my @results = ();

here.
for my $phrase (@phrase) {
chomp $phrase;
for my $word (@words) {
chomp $word;
if ($phrase =~ m/\b$word\b/i) {
my $found = $word."\t".$phrase;
push (my @result, $found);

remove the "my" here and the one int the foreach below.
 
D

Dr.Ruud

What do you mean by that?

It is mainly about the speed. In previous versions of Perl, keeping the
simple string version of a regular expression in a variable,
performed faster. So for a /$re/ in big loops, it paid off to use:

my $re = "some-regex";

in stead of:

my $re = qr/some-regex/;


But effort has been, and is still being, put in to make the qr// variant
the right choice in all normal contexts.
 
F

fmassion

Hi,

sorry to reply so late.
my @results = ();
here. (...)
remove the "my" here and the one int the foreach below.

I have changed the code accordingly to :

10 my @results = ();
11 for my $phrase (@phrase) {
12 chomp $phrase;
13 for my $word (@words) {
14 chomp $word;
15 if ($phrase =~ m/\b$word\b/i) {
16 my $found = $word."\t".$phrase;
17 push (@result, $found);
18 print @result."\n";
19 }}}
20 foreach (@result){ # go through the result array
21 my ($word, $phrase) = split(/\t/,$_);
22 print "$word --> $phrase\n";
When I run the script I get:
Global symbol "@result" requires explicit package name at test.pl line 17.
Global symbol "@result" requires explicit package name at test.pl line 18.
Global symbol "@result" requires explicit package name at test.pl line 20.
Execution of test.pl aborted due to compilation errors.
 
J

Justin C

Hi,

sorry to reply so late.
my @results = ();
here. (...)
remove the "my" here and the one int the foreach below.

I have changed the code accordingly to :

10 my @results = ();
[snip]

Global symbol "@result" requires explicit package name at test.pl line 17.

Typo.


Justin.
 
T

Thomas 'PointedEars' Lahn

Dr.Ruud said:
It is mainly about the speed. In previous versions of Perl, keeping the
simple string version of a regular expression in a variable,
performed faster. So for a /$re/ in big loops, it paid off to use:

my $re = "some-regex";

in stead of:

my $re = qr/some-regex/;

How can that be?

,----[perlop(1)
| […]
| Since Perl may compile the pattern at the moment of
| execution of the qr() operator, using qr() may have speed
| advantages in some situations, notably if the result of
| qr() is used standalone:
|
| sub match {
| my $patterns = shift;
| my @compiled = map qr/$_/i, @$patterns;
| grep {
| my $success = 0;
| foreach my $pat (@compiled) {
| $success = 1, last if /$pat/;
| }
| $success;
| } @_;
| }
|
| Precompilation of the pattern into an internal
| representation at the moment of qr() avoids a need to
^^^^^^^^^^^^^^^^
| recompile the pattern every time a match "/$pat/" is
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| attempted. (Perl has many other internal optimizations,
^^^^^^^^^
| but none would be triggered in the above example
| if we did not use qr() operator.)
| […]
`----

and

| m/PATTERN/msixpodualgc
| /PATTERN/msixpodualgc
| […]
| PATTERN may contain variables, which will be interpolated
| every time the pattern search is evaluated, except for
| when the delimiter is a single quote. (Note that $(, $),
| and $| are not interpolated because they look like end-of-
| string tests.) Perl will not recompile the pattern unless
^^^^^^
| an interpolated variable that it contains changes.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

By algorithm, the value of $key *always* changes here, so if you use
/\b\Q$key\E\b/ and the like in the loop, IIUC Perl will have to recompile
the pattern *every* time.

The most efficient approach should be to precompile the expression:

$ perl -e '

use strict;
use warnings;
use Data::Dumper;

my %keys = map { $_ => qr/\b$_\b/ } qw(foo.bar bar.baz);
my @phrases = qw(foo.bar baz foo bar.baz foo.bar.baz);
my %found;

print Dumper(\%keys);

foreach my $phrase (@phrases) {
while (my ($key, $key_expr) = each %keys) {
push @{ $found{ $key } }, $phrase
if $phrase =~ $key_expr;
}
}

print Dumper(\%found);'
$VAR1 = {
'bar.baz' => qr/(?^:\bbar.baz\b)/,
'foo.bar' => qr/(?^:\bfoo.bar\b)/
};
$VAR1 = {
'bar.baz' => [
'bar.baz',
'foo.bar.baz'
],
'foo.bar' => [
'foo.bar',
'foo.bar.baz'
]
};
But effort has been, and is still being, put in to make the qr// variant
the right choice in all normal contexts.

-v please
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Regex basic question 2
hash of arrays 1
Parse using Text::CSV into Hash 9
Comparing values of multiple hash keys 8
Help with Hash 2
data to hash 1
Uninitialized values in hash 2
Need help with this script 4

Members online

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,832
Latest member
GlennSmall

Latest Threads

Top