regexp

Discussion in 'Perl Misc' started by Jayme Assuncao Casimiro, Jan 30, 2004.

  1. I have this piece of html text from Amazon.com

    <dt><b><a
    href="/exec/obidos/ASIN/0965761762/qid=917872216/sr=1-1/002-1496444-0064804">1
    Business, 2 Approaches : How to Succeed in Internet Business by Employing
    Real-World Strategies</a></b>
    ~ <NOBR><font color=#990033>Usually ships in 2-3 days</font></NOBR><dd>
    Ron Gielgun / Hardcover / Published 1998
    <br>
    Our Price: $13.97 ~ <NOBR><font color =#990033>You Save: $5.98
    (30%)</font></NOBR>
    <br>
    <a
    href="/exec/obidos/ASIN/0965761762/qid=917872216/sr=1-1/002-1496444-0064804"><i>Read
    more about this title...</i></a>
    <p>

    And I would like to use only one regexp to extract the title, the price,
    and the desconunt in percent.

    On the above example it would be:
    title = 1 Business, 2 Approaches : How to Succeed in Internet Business byEmploying
    Real-World Strategies
    Price = $13.97
    Descount = 30%

    I have used:
    ($title) = $_ =~ m{<a.*?>(.*?)</a>};
    ($price) = $_ =~ m{.*Our Price:\s(\$?[\d\,.]+)};
    ($descount) = $_ =~ m{.*You Save:.*?[\d\,.]+.*?([\d\,.]+)};

    But I would like to use only one regexp.

    Thanks
    +---------------------------------------------+
    | Jayme Assuncao Casimiro |
    | Graduado em Ciência da Computação |
    | Estudante de Mestrado em Computação |
    | Universidade Federal de Minas Gerais - UFMG |
    +---------------------------------------------+
     
    Jayme Assuncao Casimiro, Jan 30, 2004
    #1
    1. Advertising

  2. Jayme Assuncao Casimiro wrote:
    > I have used:
    > ($title) = $_ =~ m{<a.*?>(.*?)</a>};
    > ($price) = $_ =~ m{.*Our Price:\s(\$?[\d\,.]+)};
    > ($descount) = $_ =~ m{.*You Save:.*?[\d\,.]+.*?([\d\,.]+)};
    >
    > But I would like to use only one regexp.


    So, what stops you?

    ($title, $price, $discount) = m{...};
    ------------------------------------^^^
    (to be filles with the regex)

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jan 30, 2004
    #2
    1. Advertising

  3. Jayme Assuncao Casimiro <> wrote:

    > I have this piece of html text from Amazon.com
    >

    [snip HTML]
    >
    > And I would like to use only one regexp to extract the title, the price,
    > and the desconunt in percent.


    Don't do that. Use one of the modules designed for parsing HTML. Using REs
    to parse HTML is painful and produces easily-broken code.

    --
    David Wall
     
    David K. Wall, Jan 30, 2004
    #3
  4. David K. Wall wrote:
    > Jayme Assuncao Casimiro <> wrote:
    >> I have this piece of html text from Amazon.com
    >>
    >> [snip HTML]
    >>
    >> And I would like to use only one regexp to extract the title, the
    >> price, and the desconunt in percent.

    >
    > Don't do that. Use one of the modules designed for parsing HTML.
    > Using REs to parse HTML is painful and produces easily-broken code.


    For extracting the first link and two other parts that are not
    identified by help of HTML markup? Please, David, there are more
    colours in this world than black and white. ;-)

    perlfaq9 is less rigid:

    http://www.perldoc.com/perl5.8.0/pod/perlfaq9.html#How-do-I-remove-HTML-from-a-string-

    http://www.perldoc.com/perl5.8.0/pod/perlfaq9.html#How-do-I-extract-URLs-

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jan 30, 2004
    #4
  5. Gunnar Hjalmarsson <> wrote:

    > David K. Wall wrote:
    >> Jayme Assuncao Casimiro <> wrote:
    >>> I have this piece of html text from Amazon.com
    >>>
    >>> [snip HTML]
    >>>
    >>> And I would like to use only one regexp to extract the title, the
    >>> price, and the desconunt in percent.

    >>
    >> Don't do that. Use one of the modules designed for parsing HTML.
    >> Using REs to parse HTML is painful and produces easily-broken code.

    >
    > For extracting the first link and two other parts that are not
    > identified by help of HTML markup? Please, David, there are more
    > colours in this world than black and white. ;-)


    Yeah, you're right. <insert standard excuses>. Thanks for the reality
    check.

    --
    David Wall
     
    David K. Wall, Jan 30, 2004
    #5
  6. Jayme Assuncao Casimiro <> wrote:

    > I have this piece of html text from Amazon.com
    >
    ><dt><b><a
    > href="/exec/obidos/ASIN/0965761762/qid=917872216/sr=1-1/002-1496444-00648
    > 04">1 Business, 2 Approaches : How to Succeed in Internet Business by
    > Employing Real-World Strategies</a></b>
    > ~ <NOBR><font color=#990033>Usually ships in 2-3 days</font></NOBR><dd>
    > Ron Gielgun / Hardcover / Published 1998
    ><br>
    > Our Price: $13.97 ~ <NOBR><font color =#990033>You Save: $5.98
    > (30%)</font></NOBR>
    ><br>
    ><a
    > href="/exec/obidos/ASIN/0965761762/qid=917872216/sr=1-1/002-1496444-00648
    > 04"><i>Read more about this title...</i></a>
    ><p>
    >
    > And I would like to use only one regexp to extract the title, the price,
    > and the desconunt in percent.


    I still think you should use one of the HTML parsing modules to get the
    otherwise unremarkable piece of HTML, but below is one regex that captures
    all three things. Ugly and fragile.

    my ($price, $title, $discount);
    if ($html =~ m{
    <dt>\s*
    <b>\s*
    <a\s+href\s*=\s*"\S+">
    ([^<]+) # title
    </a>\s*
    </b>
    .*?
    Our\s+Price:\s+
    (\S+) # price
    .*?
    You\s+Save:\s+\S+\s+
    \((\S+)\) # discount
    }xs )
    {
    ($title, $price, $discount) = ($1, $2, $3);
    $title =~ s/\s+/ /g;

    print "title: $title\n\n";
    print "price: $price\n\n";
    print "discount: $discount\n";

    }

    --
    David Wall
     
    David K. Wall, Jan 30, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Hurrell
    Replies:
    4
    Views:
    163
    James Edward Gray II
    Feb 14, 2007
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    491
    Mikel Lindsaar
    Mar 31, 2008
  3. Joao Silva
    Replies:
    16
    Views:
    363
    7stud --
    Aug 21, 2009
  4. Uldis  Bojars
    Replies:
    2
    Views:
    193
    Janwillem Borleffs
    Dec 17, 2006
  5. Matìj Cepl

    new RegExp().test() or just RegExp().test()

    Matìj Cepl, Nov 24, 2009, in forum: Javascript
    Replies:
    3
    Views:
    181
    Matěj Cepl
    Nov 24, 2009
Loading...

Share This Page