Pattern Matching problem!

Discussion in 'Perl Misc' started by Francis Sylvester, Nov 14, 2005.

  1. Hi,

    I'm a Perl newbie and am having a nightmare trying to get the code below
    working. I'm trying to fetch a webpage and if a link within the page matches
    the search criterion - return the text after the link. It doesn't seem to be
    working and I'm wondering if it's because the pattern match is within the
    while loop. If anybody can shed some light I'd be eternally grateful!

    Cheers,
    Francis

    # --------------------------
    use LWP::Simple;
    use HTML::TokeParser;

    my $document = get("http://www.anexamplesite.com");
    my $mymatch = "searchstring";

    my $parser = HTML::TokeParser->new(\$document);

    while ($token = $parser->get_tag("a")) {
    if ($token->[1]->{"href"} =~ /$mymatch/) {
    # print $server.$token->[1]->{href}."\n";
    $document =~ /$searchstring(.+?)someidentifier/;
    print "$1";
    }
    }
     
    Francis Sylvester, Nov 14, 2005
    #1
    1. Advertising

  2. "Francis Sylvester" <> wrote in
    news:AH8ef.16551$:

    > I'm a Perl newbie and am having a nightmare trying to get the code
    > below working. I'm trying to fetch a webpage and if a link within the
    > page matches the search criterion - return the text after the link. It
    > doesn't seem to be working and I'm wondering


    As it is, we have no idea "doesn't seem to be working means". Please
    read the posting guidelines to find out how you can help yourself, and,
    in the process, help others help you.

    use strict;
    use warnings;

    missing.

    > use LWP::Simple;
    > use HTML::TokeParser;
    >
    > my $document = get("http://www.anexamplesite.com");
    > my $mymatch = "searchstring";
    >
    > my $parser = HTML::TokeParser->new(\$document);
    >
    > while ($token = $parser->get_tag("a")) {
    > if ($token->[1]->{"href"} =~ /$mymatch/) {
    > # print $server.$token->[1]->{href}."\n";
    > $document =~ /$searchstring(.+?)someidentifier/;


    The exact contents of $mymatch, $searchstring and whatever
    someidentifier might have something to do with what's actually being
    matched, no?

    > print "$1";


    You are not capturing anything, why do you expect there to be anything
    valid in $1?

    Sinan
    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Nov 14, 2005
    #2
    1. Advertising

  3. Francis Sylvester wrote:
    > Hi,
    >
    > I'm a Perl newbie and am having a nightmare trying to get the code below
    > working. I'm trying to fetch a webpage and if a link within the page matches
    > the search criterion - return the text after the link. It doesn't seem to be
    > working and I'm wondering if it's because the pattern match is within the
    > while loop. If anybody can shed some light I'd be eternally grateful!
    >
    > Cheers,
    > Francis
    >
    > # --------------------------
    > use LWP::Simple;
    > use HTML::TokeParser;
    >
    > my $document = get("http://www.anexamplesite.com");
    > my $mymatch = "searchstring";
    >
    > my $parser = HTML::TokeParser->new(\$document);
    >
    > while ($token = $parser->get_tag("a")) {
    > if ($token->[1]->{"href"} =~ /$mymatch/) {


    try:
    if ( $token->[1]{href} =~ /$mymatch/o ) {

    > # print $server.$token->[1]->{href}."\n";
    > $document =~ /$searchstring(.+?)someidentifier/;
    > print "$1";
    > }
    > }
     
    it_says_BALLS_on_your forehead, Nov 14, 2005
    #3
  4. it_says_BALLS_on_your forehead wrote:
    > Francis Sylvester wrote:
    >>
    >>while ($token = $parser->get_tag("a")) {
    >> if ($token->[1]->{"href"} =~ /$mymatch/) {

    >
    > try:
    > if ( $token->[1]{href} =~ /$mymatch/o ) {


    I fail to see why that would make a difference. Could you please explain
    why you think it would?

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Nov 15, 2005
    #4
  5. Gunnar Hjalmarsson wrote:
    > it_says_BALLS_on_your forehead wrote:
    > > Francis Sylvester wrote:
    > >>
    > >>while ($token = $parser->get_tag("a")) {
    > >> if ($token->[1]->{"href"} =~ /$mymatch/) {

    > >
    > > try:
    > > if ( $token->[1]{href} =~ /$mymatch/o ) {

    >
    > I fail to see why that would make a difference. Could you please explain
    > why you think it would?
    >


    I looked up HTML::TokeParse in CPAN.

    The first Example displayed illustrated that the way to get the href
    was:

    my $url = $token->[1]{href} || "-";

    ....i noticed that the OP did not use the same syntax. I didn't know if
    this was causing his problem. the 'o' at the end of the pattern was
    just to optimize the pattern match, since it doesn't seem like the OP
    needed to recompile the regex every time...
     
    it_says_BALLS_on_your forehead, Nov 15, 2005
    #5
  6. it_says_BALLS_on_your forehead wrote:
    > Gunnar Hjalmarsson wrote:
    >>it_says_BALLS_on_your forehead wrote:
    >>>Francis Sylvester wrote:
    >>>>
    >>>>while ($token = $parser->get_tag("a")) {
    >>>> if ($token->[1]->{"href"} =~ /$mymatch/) {
    >>>
    >>>try:
    >>>if ( $token->[1]{href} =~ /$mymatch/o ) {

    >>
    >>I fail to see why that would make a difference. Could you please explain
    >>why you think it would?

    >
    > I looked up HTML::TokeParse in CPAN.


    That's a good start, I suppose. :)

    > The first Example displayed illustrated that the way to get the href
    > was:
    >
    > my $url = $token->[1]{href} || "-";
    >
    > ...i noticed that the OP did not use the same syntax. I didn't know if
    > this was causing his problem.


    The reason why I asked is that I thought that

    $token->[1]->{"href"}

    is always the same as

    $token->[1]{href}

    following Perl's syntax for references and data structures.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Nov 15, 2005
    #6
  7. Gunnar Hjalmarsson wrote:
    > it_says_BALLS_on_your forehead wrote:
    > > Gunnar Hjalmarsson wrote:
    > >>it_says_BALLS_on_your forehead wrote:
    > >>>Francis Sylvester wrote:
    > >>>>
    > >>>>while ($token = $parser->get_tag("a")) {
    > >>>> if ($token->[1]->{"href"} =~ /$mymatch/) {
    > >>>
    > >>>try:
    > >>>if ( $token->[1]{href} =~ /$mymatch/o ) {
    > >>
    > >>I fail to see why that would make a difference. Could you please explain
    > >>why you think it would?

    > >
    > > I looked up HTML::TokeParse in CPAN.

    >
    > That's a good start, I suppose. :)
    >
    > > The first Example displayed illustrated that the way to get the href
    > > was:
    > >
    > > my $url = $token->[1]{href} || "-";
    > >
    > > ...i noticed that the OP did not use the same syntax. I didn't know if
    > > this was causing his problem.

    >
    > The reason why I asked is that I thought that
    >
    > $token->[1]->{"href"}
    >
    > is always the same as
    >
    > $token->[1]{href}
    >
    > following Perl's syntax for references and data structures.


    ahh, i think you're right. pg. 254 Programming Perl 3rd ed.

    "The arrow is optional between brackets or braces, or between a closing
    bracket or brace and a parenthesis for an indirect function call."
     
    it_says_BALLS_on_your forehead, Nov 15, 2005
    #7
  8. Abigail <> wrote in
    news::

    > A. Sinan Unur () wrote on MMMMCDLVIII
    > September MCMXCIII in
    > <URL:news:Xns970EBA81F5AA9asu1cornelledu@127.0.0.1>:
    >:) "Francis Sylvester" <> wrote in
    >:) news:AH8ef.16551$:


    ....

    >:) > $document =~ /$searchstring(.+?)someidentifier/;
    >:)
    >:) The exact contents of $mymatch, $searchstring and whatever
    >:) someidentifier might have something to do with what's actually
    >:) being matched, no?
    >:)
    >:) > print "$1";
    >:)
    >:) You are not capturing anything, why do you expect there to be
    >:) anything valid in $1?



    > Not capturing? I'd say the parens in
    > /$searchstring(.+?)someidentifier/ capture (if the match is
    > succesful), or there's a bug in perl.


    Arrgh! Thank you very much for catching that.

    Sinan
    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Nov 15, 2005
    #8
  9. Francis Sylvester wrote:
    > I'm a Perl newbie and am having a nightmare trying to get the code below
    > working. I'm trying to fetch a webpage and if a link within the page matches
    > the search criterion - return the text after the link.
    >
    > use LWP::Simple;
    > use HTML::TokeParser;


    Yes, using a module for parsing an HTML document is a good idea.

    > my $document = get("http://www.anexamplesite.com");
    > my $mymatch = "searchstring";
    >
    > my $parser = HTML::TokeParser->new(\$document);
    >
    > while ($token = $parser->get_tag("a")) {
    > if ($token->[1]->{"href"} =~ /$mymatch/) {
    > # print $server.$token->[1]->{href}."\n";
    > $document =~ /$searchstring(.+?)someidentifier/;


    What's that? After you have possibly found your search string, you let
    the program search the whole document using a simple regex. Doing so
    makes no sense to me.

    Either you'd better stick to a simple regex, and skip the parsing
    module, or (better) taking advantage of the module you are using, and
    doing something like:

    while ( my $token = $parser->get_tag('a') ) {
    if ($token->[1]{href} =~ /$mymatch/) {
    print $parser->get_text('a')."\n";
    }
    }

    (I'm not sure if that's what you're looking for, but hopefully you get
    the idea.)

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Nov 15, 2005
    #9
  10. > Either you'd better stick to a simple regex, and skip the parsing module,
    > or (better) taking advantage of the module you are using, and doing
    > something like:
    >
    > while ( my $token = $parser->get_tag('a') ) {
    > if ($token->[1]{href} =~ /$mymatch/) {
    > print $parser->get_text('a')."\n";
    > }
    > }
    >
    > (I'm not sure if that's what you're looking for, but hopefully you get the
    > idea.)
    >


    Many thanks for all your replies. I'm sorry, I should have been clearer -
    the code executes without error messages but I sometimes get unwanted
    results in $1. After closer inspection, I think it's because sometimes it's
    returning $1 from the earlier pattern match ( if ($token->[1]->{"href"} =~
    /$mymatch/) rather than the pattern match I wanted ($document =~
    /$searchstring(.+?)someidentifier/;)
    Is there a way to reset the value of $1?

    Many thanks,
    Francis
     
    Francis Sylvester, Nov 15, 2005
    #10
  11. Francis Sylvester <> wrote:

    >> if ($token->[1]{href} =~ /$mymatch/) {



    > I sometimes get unwanted
    > results in $1. After closer inspection, I think it's because sometimes it's
    > returning $1 from the earlier pattern match ( if ($token->[1]->{"href"} =~
    > /$mymatch/)



    Note that that code ensures that the pattern match *succeeded*.


    > rather than the pattern match I wanted ($document =~
    > /$searchstring(.+?)someidentifier/;)



    We don't really know, since you did not quote that part of the code,
    but you should always ensure that the match succeeded before
    using the dollar-digit variables, so:

    Is _your_ pattern match being tested for success?


    > Is there a way to reset the value of $1?



    Yes. They are reset on every _successful_ pattern match.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Nov 15, 2005
    #11
  12. Francis Sylvester wrote:
    >>Either you'd better stick to a simple regex, and skip the parsing module,
    >>or (better) taking advantage of the module you are using, and doing
    >>something like:
    >>
    >> while ( my $token = $parser->get_tag('a') ) {
    >> if ($token->[1]{href} =~ /$mymatch/) {
    >> print $parser->get_text('a')."\n";
    >> }
    >> }
    >>
    >>(I'm not sure if that's what you're looking for, but hopefully you get the
    >>idea.)

    >
    > the code executes without error messages but I sometimes get unwanted
    > results in $1.


    And that may well be a result of the fact that you don't actually make
    use of the module you are using for parsing HTML...

    Didn't you understand my objection to your code?
    http://groups.google.com/group/comp.lang.perl.misc/msg/60f72a205520c4b1

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Nov 15, 2005
    #12
  13. >>>(I'm not sure if that's what you're looking for, but hopefully you get
    >>>the idea.)

    >>
    >> the code executes without error messages but I sometimes get unwanted
    >> results in $1.

    >
    > And that may well be a result of the fact that you don't actually make use
    > of the module you are using for parsing HTML...
    >
    > Didn't you understand my objection to your code?
    > http://groups.google.com/group/comp.lang.perl.misc/msg/60f72a205520c4b1
    >
    > --


    Thanks Gunnar. I did understand your objection but thought I needed to
    resort to pattern matching for a specific section of the text I'm retrieving
    after the link. Having read your message and looking at the module docs
    again now - I think I might be able to achieve the desired result without
    the pattern match. I'm very grateful to you for the responses - you've
    probably saved me hours!

    Thanks again,
    Francis
     
    Francis Sylvester, Nov 15, 2005
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. anshul
    Replies:
    2
    Views:
    1,828
  2. Ari Brown

    Pattern Matching Problem

    Ari Brown, Jul 6, 2007, in forum: Ruby
    Replies:
    12
    Views:
    195
    Morton Goldberg
    Jul 6, 2007
  3. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    251
    Marc Bissonnette
    Jan 13, 2004
  4. Bryan

    Pattern matching problem

    Bryan, Jun 12, 2004, in forum: Perl Misc
    Replies:
    6
    Views:
    132
  5. Bobby Chamness
    Replies:
    2
    Views:
    246
    Xicheng Jia
    May 3, 2007
Loading...

Share This Page