hwo to match more than 1 line?

Discussion in 'Perl Misc' started by Geoff Cox, Dec 7, 2003.

  1. Geoff Cox

    Geoff Cox Guest

    Hello,

    How do I capture text that goes over 2 lines?

    The text could be say

    <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
    London N500 5JJJ</TD></TR>

    The following code only gets the text up to and including Northgate,

    if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
    print OUT ("$1 \n");
    }

    Ideas please?!

    Thanks

    Geoff
     
    Geoff Cox, Dec 7, 2003
    #1
    1. Advertising

  2. Geoff Cox

    Jay Tilton Guest

    Geoff Cox <> wrote:

    : How do I capture text that goes over 2 lines?
    :
    : The text could be say
    :
    : <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
    : London N500 5JJJ</TD></TR>
    :
    : The following code only gets the text up to and including Northgate,
    :
    : if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
    ^
    -------------------------------------------^
    The /m switch affects only how ^ and $ match, and your regex contains
    neither of those metacharacters.

    You want the /s switch, which lets . match a newline character.

    : print OUT ("$1 \n");
    : }
     
    Jay Tilton, Dec 7, 2003
    #2
    1. Advertising

  3. Geoff Cox wrote:
    > How do I capture text that goes over 2 lines?
    >
    > The text could be say
    >
    > <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
    > London N500 5JJJ</TD></TR>
    >
    > The following code only gets the text up to and including
    > Northgate,
    >
    > if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
    > print OUT ("$1 \n");
    > }
    >
    > Ideas please?!


    Use the right modifier. /m seems not to be what you want. Look up in

    perldoc perlre

    what to use instead.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 7, 2003
    #3
  4. Geoff Cox

    Tintin Guest

    "Geoff Cox" <> wrote in message
    news:...
    > Hello,
    >
    > How do I capture text that goes over 2 lines?
    >
    > The text could be say
    >
    > <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
    > London N500 5JJJ</TD></TR>
    >
    > The following code only gets the text up to and including Northgate,
    >
    > if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
    > print OUT ("$1 \n");
    > }
    >
    > Ideas please?!


    You've discovered that regexes aren't very robust/easy/flexible when it
    comes to parsing HTML. Use one of the HTML parsers on CPAN.
     
    Tintin, Dec 7, 2003
    #4
  5. Geoff Cox

    Geoff Cox Guest

    On Sun, 07 Dec 2003 09:39:24 GMT, (Jay Tilton)
    wrote:

    >Geoff Cox <> wrote:
    >
    >: How do I capture text that goes over 2 lines?
    >:
    >: The text could be say
    >:
    >: <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
    >: London N500 5JJJ</TD></TR>
    >:
    >: The following code only gets the text up to and including Northgate,
    >:
    >: if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
    > ^
    >-------------------------------------------^
    >The /m switch affects only how ^ and $ match, and your regex contains
    >neither of those metacharacters.
    >
    >You want the /s switch, which lets . match a newline character.
    >
    >: print OUT ("$1 \n");
    >: }



    Jay,

    thanks for that - I'm still not quite there - I am trying to get the
    name and address only out of following - how should I do this? Geoff

    <TR>
    <TD vAlign=top align=left colSpan=4>
    <H6><IMG height=10 alt=bullet
    src="barnet_files/blue_bullet2.gif"
    width=7>&nbsp;&nbsp;The College</H6></TD></TR>
    <TR>
    <TD align=left width="20%" colSpan=2><B>Head
    Teacher</B></TD>
    <TD vAlign=top width="80%" colSpan=2>Fred Smith</TD></TR>
    <TR>
    <TD align=left width="20%" colSpan=2><B>Address</B></TD>
    <TD vAlign=top width="80%" colSpan=2>Cedar Road, Northgate,
    Sussex N777 5RJ</TD></TR>
     
    Geoff Cox, Dec 7, 2003
    #5
  6. Geoff Cox

    Geoff Cox Guest

    On Sun, 07 Dec 2003 10:52:05 +0100, Gunnar Hjalmarsson
    <> wrote:

    >Geoff Cox wrote:
    >> How do I capture text that goes over 2 lines?
    >>
    >> The text could be say
    >>
    >> <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
    >> London N500 5JJJ</TD></TR>
    >>
    >> The following code only gets the text up to and including
    >> Northgate,
    >>
    >> if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
    >> print OUT ("$1 \n");
    >> }
    >>
    >> Ideas please?!

    >
    >Use the right modifier. /m seems not to be what you want. Look up in
    >
    > perldoc perlre
    >
    >what to use instead.


    Gunnar,

    think you are correct about the m but could you take a look at my
    other email which show the text I am trying to use..?

    Thanks

    Geoff
     
    Geoff Cox, Dec 7, 2003
    #6
  7. Geoff Cox

    Geoff Cox Guest

    !

    On Sun, 7 Dec 2003 23:10:48 +1300, "Tintin" <> wrote:

    >
    >"Geoff Cox" <> wrote in message
    >news:...
    >> Hello,
    >>
    >> How do I capture text that goes over 2 lines?
    >>
    >> The text could be say
    >>
    >> <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
    >> London N500 5JJJ</TD></TR>
    >>
    >> The following code only gets the text up to and including Northgate,
    >>
    >> if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
    >> print OUT ("$1 \n");
    >> }
    >>
    >> Ideas please?!

    >
    >You've discovered that regexes aren't very robust/easy/flexible when it
    >comes to parsing HTML. Use one of the HTML parsers on CPAN.
    >


    There seem to be a large number of them! any recommendation?!

    Cheers

    Geoff
     
    Geoff Cox, Dec 7, 2003
    #7
  8. Geoff Cox wrote:
    > I am trying to get the name and address only out of following - how
    > should I do this? Geoff
    >
    > <TR>
    > <TD vAlign=top align=left colSpan=4>
    > <H6><IMG height=10 alt=bullet
    > src="barnet_files/blue_bullet2.gif"
    > width=7>&nbsp;&nbsp;The College</H6></TD></TR>
    > <TR>
    > <TD align=left width="20%" colSpan=2><B>Head
    > Teacher</B></TD>
    > <TD vAlign=top width="80%" colSpan=2>Fred Smith</TD></TR>
    > <TR>
    > <TD align=left width="20%" colSpan=2><B>Address</B></TD>
    > <TD vAlign=top width="80%" colSpan=2>Cedar Road, Northgate,
    > Sussex N777 5RJ</TD></TR>


    That was quite a different question. This might do what you want:

    if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
    .+?
    Address.+?<TD[^>]+>([^<]+)
    /isx ) {
    print "Name: $1\nAddress: $2\n";
    }

    But don't use it if you don't understand it. And even if you do
    understand it, you may want to use a module for parsing HTML instead.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 7, 2003
    #8
  9. Geoff Cox wrote:

    You are asking the wrong question, but anyway...

    > How do I capture text that goes over 2 lines?
    >
    > The text could be say
    >
    > <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
    > London N500 5JJJ</TD></TR>
    >
    > The following code only gets the text up to and including Northgate,
    >
    > if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {


    To answer the question you did ask in the subject:
    You are using the wrong modifier. Actually you are using exactly the
    opposite one to the one you need.
    Please "perldoc perlre" about what 'm' and what 's' do.

    [...]
    >
    > Ideas please?!


    The question you should have asked but didn't ask is: what is the right tool
    to parse HTML?

    And as has been answered a gazillion of times: parsing HTML correctly is
    rocket science and nobody with a sane mind would attempt to do it using REs.
    See 'perldoc -q "remove HTML"' for why and how and what to do instead.

    jue
     
    Jürgen Exner, Dec 7, 2003
    #9
  10. Geoff Cox

    ko Guest

    Re: !

    Geoff Cox wrote:
    > On Sun, 7 Dec 2003 23:10:48 +1300, "Tintin" <> wrote:
    >
    >
    >>"Geoff Cox" <> wrote in message
    >>news:...


    [snip]

    >>>Ideas please?!

    >>
    >>You've discovered that regexes aren't very robust/easy/flexible when it
    >>comes to parsing HTML. Use one of the HTML parsers on CPAN.

    >
    > There seem to be a large number of them! any recommendation?!


    HTML::parser. If you're only interested in extracting text, here's an
    example to get you started:

    http://search.cpan.org/src/GAAS/HTML-Parser-3.34/eg/htext

    There are other example scripts in the parent directory.

    HTH - keith
     
    ko, Dec 7, 2003
    #10
  11. Geoff Cox

    Geoff Cox Guest

    On Sun, 07 Dec 2003 12:06:07 +0100, Gunnar Hjalmarsson
    <> wrote:

    >Geoff Cox wrote:
    >> I am trying to get the name and address only out of following - how
    >> should I do this? Geoff
    >>
    >> <TR>
    >> <TD vAlign=top align=left colSpan=4>
    >> <H6><IMG height=10 alt=bullet
    >> src="barnet_files/blue_bullet2.gif"
    >> width=7>&nbsp;&nbsp;The College</H6></TD></TR>
    >> <TR>
    >> <TD align=left width="20%" colSpan=2><B>Head
    >> Teacher</B></TD>
    >> <TD vAlign=top width="80%" colSpan=2>Fred Smith</TD></TR>
    >> <TR>
    >> <TD align=left width="20%" colSpan=2><B>Address</B></TD>
    >> <TD vAlign=top width="80%" colSpan=2>Cedar Road, Northgate,
    >> Sussex N777 5RJ</TD></TR>

    >
    >That was quite a different question. This might do what you want:
    >
    > if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
    > .+?
    > Address.+?<TD[^>]+>([^<]+)
    > /isx ) {
    > print "Name: $1\nAddress: $2\n";
    > }
    >
    >But don't use it if you don't understand it. And even if you do
    >understand it, you may want to use a module for parsing HTML instead.


    Gunnar,

    I have tried an HTML parser - I do het the text OK but would like to
    understand your regex ... what does the [^<] stand for?

    Geoff
     
    Geoff Cox, Dec 7, 2003
    #11
  12. Geoff Cox wrote:
    > what does the [^<] stand for?


    It's a character class representing any character but '<'.

    If you want to learn regular expressions, you need to study

    http://www.perldoc.com/perl5.8.0/pod/perlre.html

    Not once, not twice, but over and over again. The answer to your
    question, and most other questions about Perl regular expressions, can
    be found there.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 7, 2003
    #12
  13. Geoff Cox

    Geoff Cox Guest

    On Sun, 07 Dec 2003 11:26:13 GMT, "Jürgen Exner"
    <> wrote:


    >And as has been answered a gazillion of times: parsing HTML correctly is
    >rocket science and nobody with a sane mind would attempt to do it using REs.
    >See 'perldoc -q "remove HTML"' for why and how and what to do instead.


    OK ! will go for the HTML parser!

    Cheers

    Geoff

    >
    >jue
    >
     
    Geoff Cox, Dec 7, 2003
    #13
  14. Geoff Cox <> wrote:

    > but could you take a look at my
    > other email



    This is not email.

    This is a Usenet newsgroup.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Dec 7, 2003
    #14
  15. Geoff Cox wrote:
    > On Sun, 07 Dec 2003 12:06:07 +0100, Gunnar Hjalmarsson
    > <> wrote:

    [...]
    >> if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)

    [...]
    > understand your regex ... what does the [^<] stand for?


    From "perldoc perlre":
    In particular the following metacharacters have their standard
    *egrep*-ish meanings:
    [...]
    [] Character class

    However for some unknown reason there is no explanation of the meaning of ^
    in the docs.
    Only for POSIX classes the docs mention

    You can negate the [::] character classes by prefixing the class name
    with a '^'.

    From this you have to infer that you can negate a non-POSIX class, too.

    To answer the original question:
    [^<] stands for the character class which contains every character except
    the less-than sign.

    jue
     
    Jürgen Exner, Dec 7, 2003
    #15
  16. Geoff Cox <> wrote:

    > what does the [^<] stand for?



    It doesn't "stand for" anything, it "matches" something though.

    It matches any single character that is not the "<" character.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Dec 7, 2003
    #16
  17. Jürgen Exner wrote:
    > Geoff Cox wrote:
    >> what does the [^<] stand for?

    >
    > From "perldoc perlre":
    > In particular the following metacharacters have their
    > standard *egrep*-ish meanings:
    > [...]
    > [] Character class
    >
    > However for some unknown reason there is no explanation of the
    > meaning of ^ in the docs.


    Hmm.. You are right. Shouldn't somebody better do something about
    that? After all, it's one of the most common constructs in Perl
    regular expressions.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 7, 2003
    #17
  18. On Sun, 7 Dec 2003, Gunnar Hjalmarsson wrote:

    > > However for some unknown reason there is no explanation of the
    > > meaning of ^ in the docs.

    >
    > Hmm.. You are right. Shouldn't somebody better do something about
    > that?


    FWIW: I hadn't gained a close acquaintance with regexes before I
    started on Perl, and I recall also being a bit disappointed that the
    Perl documentation seemed to be written for readers who already would
    have a working acquaintance with regexes and were chiefly looking for
    details of the specific Perl embodiment.

    I noticed more recently that the Cambridge PCRE library (Perl
    compatible regular expressions) has a general presentation of this
    regular expression syntax, which (as the name implies) is deliberately
    close to Perl. It starts about halfway down the composite page
    http://www.pcre.org/pcre.txt - below the heading:

    PCRE REGULAR EXPRESSION DETAILS

    which some readers might find to be a useful adjunct to the Perl
    documentation. Hope this helps a bit.
     
    Alan J. Flavell, Dec 7, 2003
    #18
  19. Geoff Cox

    Geoff Cox Guest

    On Sun, 07 Dec 2003 13:39:47 +0100, Gunnar Hjalmarsson
    <> wrote:

    >Geoff Cox wrote:
    >> what does the [^<] stand for?

    >
    >It's a character class representing any character but '<'.


    Gunnar,

    OK thanks for that - I have printed off the perlre pages!

    Having tried the HTML Parser module it gives me too much text ... am I
    able to use it selectively?

    Geoff




    >If you want to learn regular expressions, you need to study
    >
    > http://www.perldoc.com/perl5.8.0/pod/perlre.html
    >
    >Not once, not twice, but over and over again. The answer to your
    >question, and most other questions about Perl regular expressions, can
    >be found there.
     
    Geoff Cox, Dec 7, 2003
    #19
  20. Geoff Cox

    Geoff Cox Guest

    On Sun, 07 Dec 2003 12:44:29 GMT, "Jürgen Exner"
    <> wrote:

    >Geoff Cox wrote:
    >> On Sun, 07 Dec 2003 12:06:07 +0100, Gunnar Hjalmarsson
    >> <> wrote:

    >[...]
    >>> if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)

    >[...]
    >> understand your regex ... what does the [^<] stand for?

    >
    >From "perldoc perlre":
    > In particular the following metacharacters have their standard
    > *egrep*-ish meanings:
    >[...]
    > [] Character class
    >
    >However for some unknown reason there is no explanation of the meaning of ^
    >in the docs.
    >Only for POSIX classes the docs mention
    >
    > You can negate the [::] character classes by prefixing the class name
    > with a '^'.
    >
    >From this you have to infer that you can negate a non-POSIX class, too.
    >
    >To answer the original question:
    >[^<] stands for the character class which contains every character except
    >the less-than sign.
    >
    >jue


    Thanks Jue ...

    Cheers

    Geoff


    >
     
    Geoff Cox, Dec 7, 2003
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Arsalan

    Hwo to Upload file ?

    Arsalan, Feb 23, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    396
    Patrick Olurotimi Ige
    Feb 24, 2005
  2. RC
    Replies:
    1
    Views:
    498
    Anton Spaans
    Mar 29, 2005
  3. mike
    Replies:
    6
    Views:
    294
    Jim Janney
    Apr 3, 2012
  4. Steven D'Aprano
    Replies:
    0
    Views:
    146
    Steven D'Aprano
    Dec 23, 2013
  5. Replies:
    3
    Views:
    120
    Gary Herron
    Dec 23, 2013
Loading...

Share This Page