How do I get the text that is found by a regular expression?

Discussion in 'Perl Misc' started by scottcabit, Apr 30, 2014.

  1. scottcabit

    scottcabit Guest


    I am using a perl program I wrote to search MS Word .doc files for regular expressions using pattern matching. But after 3 days of googling, I cannot find any example where someone actually retrieves the text that is found by the pattern matching!
    Here is part of my code:

    # The following pattern finds all document numbers
    $find->{Text} = m/\d{3}-\d{4}-\d{3}/;

    if ($find->Execute()) {
    print "The search text was found in $File::Find::name\n";
    printf TextFile ("%s\n", $File::Find::name);

    # my $output = $find->Found;
    # printf TextFile ("%s\n",$find->{Text});
    printf TextFile ($1."\n");
    } else {
    print ".";

    The line printf TextFile ("%s\n",$find->{Text});

    will display the text if it is assigned as a string, not with regular expressions. With regular expressions, it only shows me 1 or 0.

    The line printf TextFile ($1."\n");

    gives me a warning when run saying: Use of uninitialized value $1 in concatenation (.) or string

    So what is the syntax for actually printing the text that was found by the search for a regular expression?

    scottcabit, Apr 30, 2014
    1. Advertisements

  2. NB: This is a general answer which might be totally useless for you
    because you didn't explain what $find is.

    This matches against the current value of $_ and assigns the result of
    the match to $find->{Text}. This results is either 1 (matched) or undef
    (not matched). The matched text itself could be assigned via

    ($find->{Text}) = m/(\d{3}-\d{4}-\d{3})/;

    The () inside the pattern capture the matched text. The patterns around
    $find->{Text} mean 'this is a list assignment' which cause the first bit
    of 'captured text' to be assigned to the first variable in the list and
    so on, eg

    perl -ne '($a,$b) = /(.)(.)/; print("$a\t$b\n");'

    captures the first two characters if each input line, assigning the
    first to $a and the second to $b.

    In case the match was successful, the captured text will also be
    available via $1 ($2, $3 and so on in case of more than one bracketed
    expression in the pattern), so this first could also be written as

    m/(\d{3}-\d{4}-\d{3})/ and $find->{TexT} = $1;
    Rainer Weikusat, Apr 30, 2014
    1. Advertisements

  3. scottcabit

    Jim Gibson Guest

    You want to use the binding operator =~, not simple assignment. You are
    assigning the result of a regular expression match with the default
    variable $_, not the string in $find->{Text}.

    You are assigning the result of the binding operation, not the string
    matched. The result of the binding operation in a scalar context is
    true if the pattern matched and false if it did not.
    You want to enclose the parts of the regular expression to be captured
    in parentheses:

    $find->{Text} =~ m/(\d{3}-\d{4}-\d{3})/;

    If the string matches, then following this line, $1 will contain the
    document number.

    You should check to see if the string matched before trying to use the

    if( $find->{Text} =~ m/(\d{3}-\d{4}-\d{3})/ ) {
    print "The document number is $1\n";

    See 'perldoc perlre' for details and 'perldoc perlop', searching the
    latter for "Regexp Quote-Like Operators".
    Jim Gibson, Apr 30, 2014
  4. scottcabit

    scottcabit Guest

    Jim wrote:

    You want to enclose the parts of the regular expression to be captured
    in parentheses:

    $find->{Text} =~ m/(\d{3}-\d{4}-\d{3})/;

    Yes, that helps. My code now finds the search text regular expression andputs it in $1, most of the time! There is still an occasion when it performs a find execute and thinks it found the text, only to give me the error: Use of uninitialized value $1 in concatenation (.) or string, even though there are instances of my regular expression in the document it was searching.

    Now I need to iterate through my document and find all instances of my regular expression match and print them.

    Here is the subroutine I am calling each time the File::Find finds a worddocument for me to check:

    sub rTxt {

    # We only want .doc files (no links...)
    return unless /\.doc$/ && -f && ! -l;

    # Open document
    my $doc = $MSWord->Documents->Open({FileName=>$File::Find::name});

    # Exit nicely if we couldn't open doc
    return unless $doc;

    my $content=$doc->Content;
    my $find=$content->Find;

    # The following pattern finds all document numbers
    $find->{Text} = m/(\d{3}-\d{4}-\d{3})/;

    if ($find->Execute()) {
    print "The search text was found in $File::Find::name\n";
    printf TextFile ("%s\n", $File::Find::name);
    printf TextFile ($1."\n");
    } else {
    print ".";
    # Close document

    Is there any easy way to search the whole document for every occurrence that matches my pattern? Do I have to copy the whole document text first andthen search it?

    scottcabit, Apr 30, 2014
  5. The code you've quoted below absolutely, certainly doesn't do that as
    $_ is matched against this regex and the result is assigned to
    $find->{Text}, whatever the purpose of that may be.


    Rainer Weikusat, Apr 30, 2014
  6. scottcabit

    $Bill Guest

    The m// is working on $_ - I assume there's something in $_ like the file name ?
    Are you looking for doc #s in the file name or file content ?
    What's in {Text} or are you trying to put something in there ?
    If you had all of the doc text in $_ that would give you a list of them in {Text}.
    if $content contains the data with the doc #s, you want to use that instead of $_:

    my @docnums = $content =~ /(\d{3}-\d{4}-\d{3})/gs;

    would give you all the doc #s in the file.
    $Bill, May 1, 2014
  7. scottcabit

    scottcabit Guest


    The regular expression does not seem to work. Here is what I've tried....

    # The following pattern finds all document numbers
    $find->{Text} = m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;

    if ($find->Execute()) {
    my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;
    my $docnums_count = @docnums;
    print $docnums_count;

    So, I get into the $find-Execute so the expression is being ound in the word document, but once inside,
    my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;

    never finds any occurrences of the regular expression. I also tried it without the trailing gs. Same result. print $docnums_count always prints 0.

    Any ideas?

    scottcabit, May 2, 2014
  8. For how much longer to you plan to repost this particular piece of "code
    which doesn't make any sense" (preferably without context so that "Happy
    guessing hour!" never ends)?
    .... but nobody knows what $find->Execute actually does (Judging from the
    more complete example you posted last time, it ought to be 'some kind of
    OLE method of some kind of object returned by a 'document' OLE method of
    MS-Word. In any case, you're assigning the result of a pattern match
    agains $_ which contains the filename File::Find currently returned to
    $find->{Text}. This will usually be undef but might be one in case of
    'strange circumstances'. And Microsoft DOES NOT publish documentation on
    this, at least not anywhere on the web where it could be found with a
    reasonable amount of searching.
    Nobody knows what $content happens to be but given the broken way in
    which you're trying to use this unknown 'presumably find something which
    is some sort of text', chances are that the code inside the block is
    never executed, anyway.
    "Stop trying".
    Rainer Weikusat, May 2, 2014
  9. scottcabit

    Jim Gibson Guest

    That needs to be:
    $find->{Text} =~ m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;

    Note the use of the binding operator '=~' instead of assignment '='

    What is in $content? What relation does $content have with the
    previously used $find->{Text}? What is in $find, anyway?

    I suggest you separate the tasks of 1) fetching the document and 2)
    parsing the document looking for document numbers. Put the content of
    your document into a Perl scalar variable (e.g., $content), print that,
    and then attempt to extract document numbers from that. That should
    take only a short program, which you could post in its entirety here.
    Then we wouldn't have to guess what the rest of your program is doing,

    Something like this:

    use strict;
    use warnings;
    my $content = 'stuff ... 123-4567-ABC ... more stuff';
    print "$content\n";
    my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;
    print "@docnums\n";
    Jim Gibson, May 2, 2014
  10. Judging from 'glimpses on other people web postings', $find likely
    refers to an object which can be used to 'find' something in the
    associated document and $find->{Text} seems to be what $find is
    supposed to look for (and the OP likely believes that he is "assigning
    the regex" to this, not the result of evaluating the match, and that
    this would 'magically' cause the OLE-object represented by $find to do
    'PCRE-matching' instead of whatever it usually does).
    Rainer Weikusat, May 2, 2014
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.