hwo to match more than 1 line?

G

Geoff Cox

Hello,

How do I capture text that goes over 2 lines?

The text could be say

<TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
London N500 5JJJ</TD></TR>

The following code only gets the text up to and including Northgate,

if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
print OUT ("$1 \n");
}

Ideas please?!

Thanks

Geoff
 
J

Jay Tilton

: How do I capture text that goes over 2 lines?
:
: The text could be say
:
: <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
: London N500 5JJJ</TD></TR>
:
: The following code only gets the text up to and including Northgate,
:
: if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
^
-------------------------------------------^
The /m switch affects only how ^ and $ match, and your regex contains
neither of those metacharacters.

You want the /s switch, which lets . match a newline character.

: print OUT ("$1 \n");
: }
 
G

Gunnar Hjalmarsson

Geoff said:
How do I capture text that goes over 2 lines?

The text could be say

<TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
London N500 5JJJ</TD></TR>

The following code only gets the text up to and including
Northgate,

if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
print OUT ("$1 \n");
}

Ideas please?!

Use the right modifier. /m seems not to be what you want. Look up in

perldoc perlre

what to use instead.
 
T

Tintin

Geoff Cox said:
Hello,

How do I capture text that goes over 2 lines?

The text could be say

<TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
London N500 5JJJ</TD></TR>

The following code only gets the text up to and including Northgate,

if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
print OUT ("$1 \n");
}

Ideas please?!

You've discovered that regexes aren't very robust/easy/flexible when it
comes to parsing HTML. Use one of the HTML parsers on CPAN.
 
G

Geoff Cox

: How do I capture text that goes over 2 lines?
:
: The text could be say
:
: <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
: London N500 5JJJ</TD></TR>
:
: The following code only gets the text up to and including Northgate,
:
: if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
^
-------------------------------------------^
The /m switch affects only how ^ and $ match, and your regex contains
neither of those metacharacters.

You want the /s switch, which lets . match a newline character.

: print OUT ("$1 \n");
: }


Jay,

thanks for that - I'm still not quite there - I am trying to get the
name and address only out of following - how should I do this? Geoff

<TR>
<TD vAlign=top align=left colSpan=4>
<H6><IMG height=10 alt=bullet
src="barnet_files/blue_bullet2.gif"
width=7>&nbsp;&nbsp;The College</H6></TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Head
Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Smith</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Cedar Road, Northgate,
Sussex N777 5RJ</TD></TR>
 
G

Geoff Cox

Use the right modifier. /m seems not to be what you want. Look up in

perldoc perlre

what to use instead.

Gunnar,

think you are correct about the m but could you take a look at my
other email which show the text I am trying to use..?

Thanks

Geoff
 
G

Geoff Cox

You've discovered that regexes aren't very robust/easy/flexible when it
comes to parsing HTML. Use one of the HTML parsers on CPAN.

There seem to be a large number of them! any recommendation?!

Cheers

Geoff
 
G

Gunnar Hjalmarsson

Geoff said:
I am trying to get the name and address only out of following - how
should I do this? Geoff

<TR>
<TD vAlign=top align=left colSpan=4>
<H6><IMG height=10 alt=bullet
src="barnet_files/blue_bullet2.gif"
width=7>&nbsp;&nbsp;The College</H6></TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Head
Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Smith</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Cedar Road, Northgate,
Sussex N777 5RJ</TD></TR>

That was quite a different question. This might do what you want:

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print "Name: $1\nAddress: $2\n";
}

But don't use it if you don't understand it. And even if you do
understand it, you may want to use a module for parsing HTML instead.
 
J

Jürgen Exner

Geoff Cox wrote:

You are asking the wrong question, but anyway...
How do I capture text that goes over 2 lines?

The text could be say

<TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
London N500 5JJJ</TD></TR>

The following code only gets the text up to and including Northgate,

if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {

To answer the question you did ask in the subject:
You are using the wrong modifier. Actually you are using exactly the
opposite one to the one you need.
Please "perldoc perlre" about what 'm' and what 's' do.

[...]
Ideas please?!

The question you should have asked but didn't ask is: what is the right tool
to parse HTML?

And as has been answered a gazillion of times: parsing HTML correctly is
rocket science and nobody with a sane mind would attempt to do it using REs.
See 'perldoc -q "remove HTML"' for why and how and what to do instead.

jue
 
K

ko

Geoff said:
[snip]
You've discovered that regexes aren't very robust/easy/flexible when it
comes to parsing HTML. Use one of the HTML parsers on CPAN.

There seem to be a large number of them! any recommendation?!

HTML::parser. If you're only interested in extracting text, here's an
example to get you started:

http://search.cpan.org/src/GAAS/HTML-Parser-3.34/eg/htext

There are other example scripts in the parent directory.

HTH - keith
 
G

Geoff Cox

Geoff said:
I am trying to get the name and address only out of following - how
should I do this? Geoff

<TR>
<TD vAlign=top align=left colSpan=4>
<H6><IMG height=10 alt=bullet
src="barnet_files/blue_bullet2.gif"
width=7>&nbsp;&nbsp;The College</H6></TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Head
Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Smith</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Cedar Road, Northgate,
Sussex N777 5RJ</TD></TR>

That was quite a different question. This might do what you want:

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print "Name: $1\nAddress: $2\n";
}

But don't use it if you don't understand it. And even if you do
understand it, you may want to use a module for parsing HTML instead.

Gunnar,

I have tried an HTML parser - I do het the text OK but would like to
understand your regex ... what does the [^<] stand for?

Geoff
 
G

Geoff Cox

And as has been answered a gazillion of times: parsing HTML correctly is
rocket science and nobody with a sane mind would attempt to do it using REs.
See 'perldoc -q "remove HTML"' for why and how and what to do instead.

OK ! will go for the HTML parser!

Cheers

Geoff
 
J

Jürgen Exner

Geoff said:
On Sun, 07 Dec 2003 12:06:07 +0100, Gunnar Hjalmarsson
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
[...]
understand your regex ... what does the [^<] stand for?

From "perldoc perlre":
In particular the following metacharacters have their standard
*egrep*-ish meanings:
[...]
[] Character class

However for some unknown reason there is no explanation of the meaning of ^
in the docs.
Only for POSIX classes the docs mention

You can negate the [::] character classes by prefixing the class name
with a '^'.

From this you have to infer that you can negate a non-POSIX class, too.

To answer the original question:
[^<] stands for the character class which contains every character except
the less-than sign.

jue
 
T

Tad McClellan

Geoff Cox said:
what does the [^<] stand for?


It doesn't "stand for" anything, it "matches" something though.

It matches any single character that is not the "<" character.
 
G

Gunnar Hjalmarsson

Jürgen Exner said:
Geoff said:
what does the [^<] stand for?

From "perldoc perlre":
In particular the following metacharacters have their
standard *egrep*-ish meanings:
[...]
[] Character class

However for some unknown reason there is no explanation of the
meaning of ^ in the docs.

Hmm.. You are right. Shouldn't somebody better do something about
that? After all, it's one of the most common constructs in Perl
regular expressions.
 
A

Alan J. Flavell

Hmm.. You are right. Shouldn't somebody better do something about
that?

FWIW: I hadn't gained a close acquaintance with regexes before I
started on Perl, and I recall also being a bit disappointed that the
Perl documentation seemed to be written for readers who already would
have a working acquaintance with regexes and were chiefly looking for
details of the specific Perl embodiment.

I noticed more recently that the Cambridge PCRE library (Perl
compatible regular expressions) has a general presentation of this
regular expression syntax, which (as the name implies) is deliberately
close to Perl. It starts about halfway down the composite page
http://www.pcre.org/pcre.txt - below the heading:

PCRE REGULAR EXPRESSION DETAILS

which some readers might find to be a useful adjunct to the Perl
documentation. Hope this helps a bit.
 
G

Geoff Cox

Geoff said:
what does the [^<] stand for?

It's a character class representing any character but '<'.

Gunnar,

OK thanks for that - I have printed off the perlre pages!

Having tried the HTML Parser module it gives me too much text ... am I
able to use it selectively?

Geoff
 
G

Geoff Cox

Geoff said:
On Sun, 07 Dec 2003 12:06:07 +0100, Gunnar Hjalmarsson
if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
[...]
understand your regex ... what does the [^<] stand for?

From "perldoc perlre":
In particular the following metacharacters have their standard
*egrep*-ish meanings:
[...]
[] Character class

However for some unknown reason there is no explanation of the meaning of ^
in the docs.
Only for POSIX classes the docs mention

You can negate the [::] character classes by prefixing the class name
with a '^'.

From this you have to infer that you can negate a non-POSIX class, too.

To answer the original question:
[^<] stands for the character class which contains every character except
the less-than sign.

jue

Thanks Jue ...

Cheers

Geoff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top