How do I get the text that is found by a regular expression?

Discussion in 'Perl Misc' started by scottcabit@gmail.com, Apr 30, 2014.

  1. Guest

    Hi,

    I am using a perl program I wrote to search MS Word .doc files for regular expressions using pattern matching. But after 3 days of googling, I cannot find any example where someone actually retrieves the text that is found by the pattern matching!
    Here is part of my code:

    # The following pattern finds all document numbers
    $find->{Text} = m/\d{3}-\d{4}-\d{3}/;

    if ($find->Execute()) {
    print "The search text was found in $File::Find::name\n";
    printf TextFile ("%s\n", $File::Find::name);

    # my $output = $find->Found;
    # printf TextFile ("%s\n",$find->{Text});
    printf TextFile ($1."\n");
    } else {
    print ".";
    }


    The line printf TextFile ("%s\n",$find->{Text});

    will display the text if it is assigned as a string, not with regular expressions. With regular expressions, it only shows me 1 or 0.

    The line printf TextFile ($1."\n");

    gives me a warning when run saying: Use of uninitialized value $1 in concatenation (.) or string

    So what is the syntax for actually printing the text that was found by the search for a regular expression?


    Thanks!
     
    , Apr 30, 2014
    #1
    1. Advertising

  2. writes:
    > I am using a perl program I wrote to search MS Word .doc files for
    > regular expressions using pattern matching. But after 3 days of
    > googling, I cannot find any example where someone actually retrieves
    > the text that is found by the pattern matching!
    >
    > Here is part of my code:
    >
    > # The following pattern finds all document numbers
    > $find->{Text} = m/\d{3}-\d{4}-\d{3}/;


    NB: This is a general answer which might be totally useless for you
    because you didn't explain what $find is.

    This matches against the current value of $_ and assigns the result of
    the match to $find->{Text}. This results is either 1 (matched) or undef
    (not matched). The matched text itself could be assigned via

    ($find->{Text}) = m/(\d{3}-\d{4}-\d{3})/;

    The () inside the pattern capture the matched text. The patterns around
    $find->{Text} mean 'this is a list assignment' which cause the first bit
    of 'captured text' to be assigned to the first variable in the list and
    so on, eg

    perl -ne '($a,$b) = /(.)(.)/; print("$a\t$b\n");'

    captures the first two characters if each input line, assigning the
    first to $a and the second to $b.

    In case the match was successful, the captured text will also be
    available via $1 ($2, $3 and so on in case of more than one bracketed
    expression in the pattern), so this first could also be written as

    m/(\d{3}-\d{4}-\d{3})/ and $find->{TexT} = $1;
     
    Rainer Weikusat, Apr 30, 2014
    #2
    1. Advertising

  3. Jim Gibson Guest

    In article <>,
    <> wrote:

    > Hi,
    >
    > I am using a perl program I wrote to search MS Word .doc files for regular
    > expressions using pattern matching. But after 3 days of googling, I cannot
    > find any example where someone actually retrieves the text that is found by
    > the pattern matching!
    > Here is part of my code:
    >
    > # The following pattern finds all document numbers
    > $find->{Text} = m/\d{3}-\d{4}-\d{3}/;


    You want to use the binding operator =~, not simple assignment. You are
    assigning the result of a regular expression match with the default
    variable $_, not the string in $find->{Text}.


    > The line printf TextFile ("%s\n",$find->{Text});
    >
    > will display the text if it is assigned as a string, not with regular
    > expressions. With regular expressions, it only shows me 1 or 0.


    You are assigning the result of the binding operation, not the string
    matched. The result of the binding operation in a scalar context is
    true if the pattern matched and false if it did not.
    >
    > The line printf TextFile ($1."\n");
    >
    > gives me a warning when run saying: Use of uninitialized value $1 in
    > concatenation (.) or string
    >
    > So what is the syntax for actually printing the text that was found by the search for a regular expression?


    You want to enclose the parts of the regular expression to be captured
    in parentheses:

    $find->{Text} =~ m/(\d{3}-\d{4}-\d{3})/;

    If the string matches, then following this line, $1 will contain the
    document number.

    You should check to see if the string matched before trying to use the
    results:

    if( $find->{Text} =~ m/(\d{3}-\d{4}-\d{3})/ ) {
    print "The document number is $1\n";
    }

    See 'perldoc perlre' for details and 'perldoc perlop', searching the
    latter for "Regexp Quote-Like Operators".

    --
    Jim Gibson
     
    Jim Gibson, Apr 30, 2014
    #3
  4. Guest

    Jim wrote:

    You want to enclose the parts of the regular expression to be captured
    in parentheses:

    $find->{Text} =~ m/(\d{3}-\d{4}-\d{3})/;

    Yes, that helps. My code now finds the search text regular expression andputs it in $1, most of the time! There is still an occasion when it performs a find execute and thinks it found the text, only to give me the error: Use of uninitialized value $1 in concatenation (.) or string, even though there are instances of my regular expression in the document it was searching.

    Now I need to iterate through my document and find all instances of my regular expression match and print them.

    Here is the subroutine I am calling each time the File::Find finds a worddocument for me to check:

    sub rTxt {

    # We only want .doc files (no links...)
    return unless /\.doc$/ && -f && ! -l;

    # Open document
    my $doc = $MSWord->Documents->Open({FileName=>$File::Find::name});

    # Exit nicely if we couldn't open doc
    return unless $doc;

    my $content=$doc->Content;
    my $find=$content->Find;

    # The following pattern finds all document numbers
    $find->{Text} = m/(\d{3}-\d{4}-\d{3})/;

    if ($find->Execute()) {
    print "The search text was found in $File::Find::name\n";
    printf TextFile ("%s\n", $File::Find::name);
    printf TextFile ($1."\n");
    } else {
    print ".";
    }
    # Close document
    $doc->Close();
    }

    Is there any easy way to search the whole document for every occurrence that matches my pattern? Do I have to copy the whole document text first andthen search it?

    Thanks
     
    , Apr 30, 2014
    #4
  5. writes:
    > Yes, that helps. My code now finds the search text regular
    > expression and puts it in $1, most of the time! There is still an
    > occasion when it performs a find execute and thinks it found the
    > text, only to give me the error: Use of uninitialized value $1 in
    > concatenation (.) or string, even though there are instances of my
    > regular expression in the document it was searching.


    The code you've quoted below absolutely, certainly doesn't do that as
    $_ is matched against this regex and the result is assigned to
    $find->{Text}, whatever the purpose of that may be.

    [...]


    > sub rTxt {
    >
    > # We only want .doc files (no links...)
    > return unless /\.doc$/ && -f && ! -l;
    >
    > # Open document
    > my $doc = $MSWord->Documents->Open({FileName=>$File::Find::name});
    >
    > # Exit nicely if we couldn't open doc
    > return unless $doc;
    >
    > my $content=$doc->Content;
    > my $find=$content->Find;
    >
    > # The following pattern finds all document numbers
    > $find->{Text} = m/(\d{3}-\d{4}-\d{3})/;


    [...]
     
    Rainer Weikusat, Apr 30, 2014
    #5
  6. $Bill Guest

    On 4/30/2014 11:51, wrote:
    >
    > Here is the subroutine I am calling each time the File::Find finds a word document for me to check:
    >
    > sub rTxt {
    >
    > # We only want .doc files (no links...)
    > return unless /\.doc$/ && -f && ! -l;
    >
    > # Open document
    > my $doc = $MSWord->Documents->Open({FileName=>$File::Find::name});
    >
    > # Exit nicely if we couldn't open doc
    > return unless $doc;
    >
    > my $content=$doc->Content;
    > my $find=$content->Find;
    >
    > # The following pattern finds all document numbers
    > $find->{Text} = m/(\d{3}-\d{4}-\d{3})/;


    The m// is working on $_ - I assume there's something in $_ like the file name ?
    Are you looking for doc #s in the file name or file content ?
    What's in {Text} or are you trying to put something in there ?
    If you had all of the doc text in $_ that would give you a list of them in {Text}.
    if $content contains the data with the doc #s, you want to use that instead of $_:

    my @docnums = $content =~ /(\d{3}-\d{4}-\d{3})/gs;

    would give you all the doc #s in the file.

    > if ($find->Execute()) {
    > print "The search text was found in $File::Find::name\n";
    > printf TextFile ("%s\n", $File::Find::name);
    > printf TextFile ($1."\n");
    > } else {
    > print ".";
    > }
    > # Close document
    > $doc->Close();
    > }
    >
    > Is there any easy way to search the whole document for every occurrence that matches my pattern? Do I have to copy the whole document text first and then search it?
    >
    > Thanks
    >
     
    $Bill, May 1, 2014
    #6
  7. Guest

    Hi,

    The regular expression does not seem to work. Here is what I've tried....

    # The following pattern finds all document numbers
    $find->{Text} = m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;

    if ($find->Execute()) {
    my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;
    my $docnums_count = @docnums;
    print $docnums_count;
    }

    So, I get into the $find-Execute so the expression is being ound in the word document, but once inside,
    my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;

    never finds any occurrences of the regular expression. I also tried it without the trailing gs. Same result. print $docnums_count always prints 0.

    Any ideas?

    Thanks
     
    , May 2, 2014
    #7
  8. writes:
    > The regular expression does not seem to work. Here is what I've tried....
    >
    > # The following pattern finds all document numbers
    > $find->{Text} = m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;


    For how much longer to you plan to repost this particular piece of "code
    which doesn't make any sense" (preferably without context so that "Happy
    guessing hour!" never ends)?

    > if ($find->Execute()) {
    > my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;
    > my $docnums_count = @docnums;
    > print $docnums_count;
    > }
    >
    > So, I get into the $find-Execute so the expression is being ound in
    > the word document,


    .... but nobody knows what $find->Execute actually does (Judging from the
    more complete example you posted last time, it ought to be 'some kind of
    OLE method of some kind of object returned by a 'document' OLE method of
    MS-Word. In any case, you're assigning the result of a pattern match
    agains $_ which contains the filename File::Find currently returned to
    $find->{Text}. This will usually be undef but might be one in case of
    'strange circumstances'. And Microsoft DOES NOT publish documentation on
    this, at least not anywhere on the web where it could be found with a
    reasonable amount of searching.

    > but once inside,
    > my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;
    >
    > never finds any occurrences of the regular expression.


    Nobody knows what $content happens to be but given the broken way in
    which you're trying to use this unknown 'presumably find something which
    is some sort of text', chances are that the code inside the block is
    never executed, anyway.

    > Any ideas?


    "Stop trying".
     
    Rainer Weikusat, May 2, 2014
    #8
  9. Jim Gibson Guest

    In article <>,
    <> wrote:

    > Hi,
    >
    > The regular expression does not seem to work. Here is what I've tried....
    >
    > # The following pattern finds all document numbers
    > $find->{Text} = m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;


    That needs to be:
    $find->{Text} =~ m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;

    Note the use of the binding operator '=~' instead of assignment '='


    > if ($find->Execute()) {
    > my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;
    > my $docnums_count = @docnums;
    > print $docnums_count;
    > }
    >
    > So, I get into the $find-Execute so the expression is being ound in the
    > word document, but once inside,
    > my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;


    What is in $content? What relation does $content have with the
    previously used $find->{Text}? What is in $find, anyway?


    > never finds any occurrences of the regular expression. I also tried it
    > without the trailing gs. Same result. print $docnums_count always prints 0.
    >
    > Any ideas?


    I suggest you separate the tasks of 1) fetching the document and 2)
    parsing the document looking for document numbers. Put the content of
    your document into a Perl scalar variable (e.g., $content), print that,
    and then attempt to extract document numbers from that. That should
    take only a short program, which you could post in its entirety here.
    Then we wouldn't have to guess what the rest of your program is doing,

    Something like this:

    #!/usr/bin/perl
    use strict;
    use warnings;
    my $content = 'stuff ... 123-4567-ABC ... more stuff';
    print "$content\n";
    my @docnums = $content =~ /(\d{3}-\d{4}-\w{3})/gs;
    print "@docnums\n";

    --
    Jim Gibson
     
    Jim Gibson, May 2, 2014
    #9
  10. Jim Gibson <> writes:
    > In article <>,
    > <> wrote:
    >
    >> Hi,
    >>
    >> The regular expression does not seem to work. Here is what I've tried....
    >>
    >> # The following pattern finds all document numbers
    >> $find->{Text} = m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;

    >
    > That needs to be:
    > $find->{Text} =~ m/(\d{3}-\d{4}-\w{3})/; #\d{3})/;
    >
    > Note the use of the binding operator '=~' instead of assignment '='


    Judging from 'glimpses on other people web postings', $find likely
    refers to an object which can be used to 'find' something in the
    associated document and $find->{Text} seems to be what $find is
    supposed to look for (and the OP likely believes that he is "assigning
    the regex" to this, not the result of evaluating the match, and that
    this would 'magically' cause the OLE-object represented by $find to do
    'PCRE-matching' instead of whatever it usually does).
     
    Rainer Weikusat, May 2, 2014
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,390
  2. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    884
    Alan Moore
    Dec 2, 2005
  3. Jon Nicoll
    Replies:
    1
    Views:
    297
    James Henderson
    Jul 8, 2004
  4. mike
    Replies:
    1
    Views:
    114
    julie lawrence
    Oct 4, 2006
  5. penny
    Replies:
    28
    Views:
    3,048
    Charlton Wilbur
    Mar 10, 2008
Loading...

Share This Page