How do I get the text that is found by a regular expression?


S

scottcabit

Hi,

I am using a perl program I wrote to search MS Word .doc files for regular expressions using pattern matching. But after 3 days of googling, I cannot find any example where someone actually retrieves the text that is found by the pattern matching!
Here is part of my code:

# The following pattern finds all document numbers
$find->{Text} = m/\d{3}-\d{4}-\d{3}/;

if ($find->Execute()) {
print "The search text was found in $File::Find::name\n";
printf TextFile ("%s\n", $File::Find::name);

# my $output = $find->Found;
# printf TextFile ("%s\n",$find->{Text});
printf TextFile ($1."\n");
} else {
print ".";
}


The line printf TextFile ("%s\n",$find->{Text});

will display the text if it is assigned as a string, not with regular expressions. With regular expressions, it only shows me 1 or 0.

The line printf TextFile ($1."\n");

gives me a warning when run saying: Use of uninitialized value $1 in concatenation (.) or string

So what is the syntax for actually printing the text that was found by the search for a regular expression?


Thanks!
 
Ad

Advertisements

R

Rainer Weikusat

I am using a perl program I wrote to search MS Word .doc files for
regular expressions using pattern matching. But after 3 days of
googling, I cannot find any example where someone actually retrieves
the text that is found by the pattern matching!

Here is part of my code:

# The following pattern finds all document numbers
$find->{Text} = m/\d{3}-\d{4}-\d{3}/;

NB: This is a general answer which might be totally useless for you
because you didn't explain what $find is.

This matches against the current value of $_ and assigns the result of
the match to $find->{Text}. This results is either 1 (matched) or undef
(not matched). The matched text itself could be assigned via

($find->{Text}) = m/(\d{3}-\d{4}-\d{3})/;

The () inside the pattern capture the matched text. The patterns around
$find->{Text} mean 'this is a list assignment' which cause the first bit
of 'captured text' to be assigned to the first variable in the list and
so on, eg

perl -ne '($a,$b) = /(.)(.)/; print("$a\t$b\n");'

captures the first two characters if each input line, assigning the
first to $a and the second to $b.

In case the match was successful, the captured text will also be
available via $1 ($2, $3 and so on in case of more than one bracketed
expression in the pattern), so this first could also be written as

m/(\d{3}-\d{4}-\d{3})/ and $find->{TexT} = $1;
 
J

Jim Gibson

Hi,

I am using a perl program I wrote to search MS Word .doc files for regular
expressions using pattern matching. But after 3 days of googling, I cannot
find any example where someone actually retrieves the text that is found by
the pattern matching!
Here is part of my code:

# The following pattern finds all document numbers
$find->{Text} = m/\d{3}-\d{4}-\d{3}/;

You want to use the binding operator =~, not simple assignment. You are
assigning the result of a regular expression match with the default
variable $_, not the string in $find->{Text}.

The line printf TextFile ("%s\n",$find->{Text});

will display the text if it is assigned as a string, not with regular
expressions. With regular expressions, it only shows me 1 or 0.

You are assigning the result of the binding operation, not the string
matched. The result of the binding operation in a scalar context is
true if the pattern matched and false if it did not.
The line printf TextFile ($1."\n");

gives me a warning when run saying: Use of uninitialized value $1 in
concatenation (.) or string

So what is the syntax for actually printing the text that was found by the search for a regular expression?

You want to enclose the parts of the regular expression to be captured
in parentheses:

$find->{Text} =~ m/(\d{3}-\d{4}-\d{3})/;

If the string matches, then following this line, $1 will contain the
document number.

You should check to see if the string matched before trying to use the
results:

if( $find->{Text} =~ m/(\d{3}-\d{4}-\d{3})/ ) {
print "The document number is $1\n";
}

See 'perldoc perlre' for details and 'perldoc perlop', searching the
latter for "Regexp Quote-Like Operators".
 
S

scottcabit

Jim wrote:

You want to enclose the parts of the regular expression to be captured
in parentheses:

$find->{Text} =~ m/(\d{3}-\d{4}-\d{3})/;

Yes, that helps. My code now finds the search text regular expression andputs it in $1, most of the time! There is still an occasion when it performs a find execute and thinks it found the text, only to give me the error: Use of uninitialized value $1 in concatenation (.) or string, even though there are instances of my regular expression in the document it was searching.

Now I need to iterate through my document and find all instances of my regular expression match and print them.

Here is the subroutine I am calling each time the File::Find finds a worddocument for me to check:

sub rTxt {

# We only want .doc files (no links...)
return unless /\.doc$/ && -f && ! -l;

# Open document
my $doc = $MSWord->Documents->Open({FileName=>$File::Find::name});

# Exit nicely if we couldn't open doc
return unless $doc;

my $content=$doc->Content;
my $find=$content->Find;

# The following pattern finds all document numbers
$find->{Text} = m/(\d{3}-\d{4}-\d{3})/;

if ($find->Execute()) {
print "The search text was found in $File::Find::name\n";
printf TextFile ("%s\n", $File::Find::name);
printf TextFile ($1."\n");
} else {
print ".";
}
# Close document
$doc->Close();
}

Is there any easy way to search the whole document for every occurrence that matches my pattern? Do I have to copy the whole document text first andthen search it?

Thanks
 
R

Rainer Weikusat

Yes, that helps. My code now finds the search text regular
expression and puts it in $1, most of the time! There is still an
occasion when it performs a find execute and thinks it found the
text, only to give me the error: Use of uninitialized value $1 in
concatenation (.) or string, even though there are instances of my
regular expression in the document it was searching.

The code you've quoted below absolutely, certainly doesn't do that as
$_ is matched against this regex and the result is assigned to
$find->{Text}, whatever the purpose of that may be.

[...]

sub rTxt {

# We only want .doc files (no links...)
return unless /\.doc$/ && -f && ! -l;

# Open document
my $doc = $MSWord->Documents->Open({FileName=>$File::Find::name});

# Exit nicely if we couldn't open doc
return unless $doc;

my $content=$doc->Content;
my $find=$content->Find;

# The following pattern finds all document numbers
$find->{Text} = m/(\d{3}-\d{4}-\d{3})/;

[...]
 
$

$Bill

Here is the subroutine I am calling each time the File::Find finds a word document for me to check:

sub rTxt {

# We only want .doc files (no links...)
return unless /\.doc$/ && -f && ! -l;

# Open document
my $doc = $MSWord->Documents->Open({FileName=>$File::Find::name});

# Exit nicely if we couldn't open doc
return unless $doc;

my $content=$doc->Content;
my $find=$content->Find;

# The following pattern finds all document numbers
$find->{Text} = m/(\d{3}-\d{4}-\d{3})/;

The m// is working on $_ - I assume there's something in $_ like the file name ?
Are you looking for doc #s in the file name or file content ?
What's in {Text} or are you trying to put something in there ?
If you had all of the doc text in $_ that would give you a list of them in {Text}.
if $content contains the data with the doc #s, you want to use that instead of $_:

my @docnums = $content =~ /(\d{3}-\d{4}-\d{3})/gs;

would give you all the doc #s in the file.
 
Ad

Advertisements

S

scottcabit

Hi,

The regular expression does not seem to work. Here is what I've tried....

# The following pattern finds all document numbers
$find->{Text} = m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;

if ($find->Execute()) {
my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;
my $docnums_count = @docnums;
print $docnums_count;
}

So, I get into the $find-Execute so the expression is being ound in the word document, but once inside,
my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;

never finds any occurrences of the regular expression. I also tried it without the trailing gs. Same result. print $docnums_count always prints 0.

Any ideas?

Thanks
 
R

Rainer Weikusat

The regular expression does not seem to work. Here is what I've tried....

# The following pattern finds all document numbers
$find->{Text} = m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;

For how much longer to you plan to repost this particular piece of "code
which doesn't make any sense" (preferably without context so that "Happy
guessing hour!" never ends)?
if ($find->Execute()) {
my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;
my $docnums_count = @docnums;
print $docnums_count;
}

So, I get into the $find-Execute so the expression is being ound in
the word document,

.... but nobody knows what $find->Execute actually does (Judging from the
more complete example you posted last time, it ought to be 'some kind of
OLE method of some kind of object returned by a 'document' OLE method of
MS-Word. In any case, you're assigning the result of a pattern match
agains $_ which contains the filename File::Find currently returned to
$find->{Text}. This will usually be undef but might be one in case of
'strange circumstances'. And Microsoft DOES NOT publish documentation on
this, at least not anywhere on the web where it could be found with a
reasonable amount of searching.
but once inside,
my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;

never finds any occurrences of the regular expression.

Nobody knows what $content happens to be but given the broken way in
which you're trying to use this unknown 'presumably find something which
is some sort of text', chances are that the code inside the block is
never executed, anyway.
Any ideas?

"Stop trying".
 
J

Jim Gibson

Hi,

The regular expression does not seem to work. Here is what I've tried....

# The following pattern finds all document numbers
$find->{Text} = m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;

That needs to be:
$find->{Text} =~ m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;

Note the use of the binding operator '=~' instead of assignment '='

if ($find->Execute()) {
my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;
my $docnums_count = @docnums;
print $docnums_count;
}

So, I get into the $find-Execute so the expression is being ound in the
word document, but once inside,
my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;

What is in $content? What relation does $content have with the
previously used $find->{Text}? What is in $find, anyway?

never finds any occurrences of the regular expression. I also tried it
without the trailing gs. Same result. print $docnums_count always prints 0.

Any ideas?

I suggest you separate the tasks of 1) fetching the document and 2)
parsing the document looking for document numbers. Put the content of
your document into a Perl scalar variable (e.g., $content), print that,
and then attempt to extract document numbers from that. That should
take only a short program, which you could post in its entirety here.
Then we wouldn't have to guess what the rest of your program is doing,

Something like this:

#!/usr/bin/perl
use strict;
use warnings;
my $content = 'stuff ... 123-4567-ABC ... more stuff';
print "$content\n";
my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;
print "@docnums\n";
 
Ad

Advertisements

R

Rainer Weikusat

Jim Gibson said:
That needs to be:
$find->{Text} =~ m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;

Note the use of the binding operator '=~' instead of assignment '='

Judging from 'glimpses on other people web postings', $find likely
refers to an object which can be used to 'find' something in the
associated document and $find->{Text} seems to be what $find is
supposed to look for (and the OP likely believes that he is "assigning
the regex" to this, not the result of evaluating the match, and that
this would 'magically' cause the OLE-object represented by $find to do
'PCRE-matching' instead of whatever it usually does).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top