hwo to match more than 1 line?

Geoff Cox · Dec 7, 2003

Hello,

How do I capture text that goes over 2 lines?

The text could be say

<TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
London N500 5JJJ</TD></TR>

The following code only gets the text up to and including Northgate,

if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
print OUT ("$1 \n");
}

Ideas please?!

Thanks

Geoff

Jay Tilton · Dec 7, 2003

: How do I capture text that goes over 2 lines?
:
: The text could be say
:
: <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
: London N500 5JJJ</TD></TR>
:
: The following code only gets the text up to and including Northgate,
:
: if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
^
-------------------------------------------^
The /m switch affects only how ^ and $ match, and your regex contains
neither of those metacharacters.

You want the /s switch, which lets . match a newline character.

: print OUT ("$1 \n");
: }

Gunnar Hjalmarsson · Dec 7, 2003

Geoff said:
How do I capture text that goes over 2 lines?

The text could be say

<TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
London N500 5JJJ</TD></TR>

The following code only gets the text up to and including
Northgate,

if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
print OUT ("$1 \n");
}

Ideas please?!

Use the right modifier. /m seems not to be what you want. Look up in

perldoc perlre

what to use instead.

Tintin · Dec 7, 2003

Geoff Cox said:
Hello,

How do I capture text that goes over 2 lines?

The text could be say

<TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
London N500 5JJJ</TD></TR>

The following code only gets the text up to and including Northgate,

if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
print OUT ("$1 \n");
}

Ideas please?!

You've discovered that regexes aren't very robust/easy/flexible when it
comes to parsing HTML. Use one of the HTML parsers on CPAN.

Geoff Cox · Dec 7, 2003

: How do I capture text that goes over 2 lines?
:
: The text could be say
:
: <TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
: London N500 5JJJ</TD></TR>
:
: The following code only gets the text up to and including Northgate,
:
: if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {
^
-------------------------------------------^
The /m switch affects only how ^ and $ match, and your regex contains
neither of those metacharacters.

You want the /s switch, which lets . match a newline character.

: print OUT ("$1 \n");
: }

Jay,

thanks for that - I'm still not quite there - I am trying to get the
name and address only out of following - how should I do this? Geoff

<TR>
<TD vAlign=top align=left colSpan=4>
<H6><IMG height=10 alt=bullet
src="barnet_files/blue_bullet2.gif"
width=7>  The College</H6></TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Head
Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Smith</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Cedar Road, Northgate,
Sussex N777 5RJ</TD></TR>

Geoff Cox · Dec 7, 2003

Use the right modifier. /m seems not to be what you want. Look up in

perldoc perlre

what to use instead.

Gunnar,

think you are correct about the m but could you take a look at my
other email which show the text I am trying to use..?

Thanks

Geoff

Geoff Cox · Dec 7, 2003

You've discovered that regexes aren't very robust/easy/flexible when it
comes to parsing HTML. Use one of the HTML parsers on CPAN.

There seem to be a large number of them! any recommendation?!

Cheers

Geoff

Gunnar Hjalmarsson · Dec 7, 2003

Geoff said:
I am trying to get the name and address only out of following - how
should I do this? Geoff

<TR>
<TD vAlign=top align=left colSpan=4>
<H6><IMG height=10 alt=bullet
src="barnet_files/blue_bullet2.gif"
width=7>  The College</H6></TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Head
Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Smith</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Cedar Road, Northgate,
Sussex N777 5RJ</TD></TR>

That was quite a different question. This might do what you want:

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print "Name: $1\nAddress: $2\n";
}

But don't use it if you don't understand it. And even if you do
understand it, you may want to use a module for parsing HTML instead.

Jürgen Exner · Dec 7, 2003

Geoff Cox wrote:

You are asking the wrong question, but anyway...

How do I capture text that goes over 2 lines?

The text could be say

<TD vAlign=top width="80%" colSpan=2>White Road, Northgate,
London N500 5JJJ</TD></TR>

The following code only gets the text up to and including Northgate,

if ($line =~ /<TD vAlign=top(.*?)<\/TD>/m) {

To answer the question you did ask in the subject:
You are using the wrong modifier. Actually you are using exactly the
opposite one to the one you need.
Please "perldoc perlre" about what 'm' and what 's' do.

[...]

Ideas please?!

The question you should have asked but didn't ask is: what is the right tool
to parse HTML?

And as has been answered a gazillion of times: parsing HTML correctly is
rocket science and nobody with a sane mind would attempt to do it using REs.
See 'perldoc -q "remove HTML"' for why and how and what to do instead.

jue

ko · Dec 7, 2003

Geoff said:
news:[email protected]...

Click to expand...

[snip]

You've discovered that regexes aren't very robust/easy/flexible when it
comes to parsing HTML. Use one of the HTML parsers on CPAN.

Click to expand...

There seem to be a large number of them! any recommendation?!

HTML:

arser. If you're only interested in extracting text, here's an
example to get you started:

http://search.cpan.org/src/GAAS/HTML-Parser-3.34/eg/htext

There are other example scripts in the parent directory.

HTH - keith

Geoff Cox · Dec 7, 2003

Geoff said:
Geoff said:

I am trying to get the name and address only out of following - how
should I do this? Geoff

<TR>
<TD vAlign=top align=left colSpan=4>
<H6><IMG height=10 alt=bullet
src="barnet_files/blue_bullet2.gif"
width=7>  The College</H6></TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Head
Teacher</B></TD>
<TD vAlign=top width="80%" colSpan=2>Fred Smith</TD></TR>
<TR>
<TD align=left width="20%" colSpan=2><B>Address</B></TD>
<TD vAlign=top width="80%" colSpan=2>Cedar Road, Northgate,
Sussex N777 5RJ</TD></TR>

Click to expand...

That was quite a different question. This might do what you want:

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)
.+?
Address.+?<TD[^>]+>([^<]+)
/isx ) {
print "Name: $1\nAddress: $2\n";
}

But don't use it if you don't understand it. And even if you do
understand it, you may want to use a module for parsing HTML instead.

Gunnar,

I have tried an HTML parser - I do het the text OK but would like to
understand your regex ... what does the [^<] stand for?

Geoff

Gunnar Hjalmarsson · Dec 7, 2003

Geoff said:
what does the [^<] stand for?

It's a character class representing any character but '<'.

If you want to learn regular expressions, you need to study

http://www.perldoc.com/perl5.8.0/pod/perlre.html

Not once, not twice, but over and over again. The answer to your
question, and most other questions about Perl regular expressions, can
be found there.

Geoff Cox · Dec 7, 2003

And as has been answered a gazillion of times: parsing HTML correctly is
rocket science and nobody with a sane mind would attempt to do it using REs.
See 'perldoc -q "remove HTML"' for why and how and what to do instead.

OK ! will go for the HTML parser!

Cheers

Geoff

Tad McClellan · Dec 7, 2003

Geoff Cox said:
but could you take a look at my
other email

This is not email.

This is a Usenet newsgroup.

Jürgen Exner · Dec 7, 2003

Geoff said:
On Sun, 07 Dec 2003 12:06:07 +0100, Gunnar Hjalmarsson

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)

Click to expand...

[...]
understand your regex ... what does the [^<] stand for?

From "perldoc perlre":
In particular the following metacharacters have their standard
*egrep*-ish meanings:
[...]
[] Character class

However for some unknown reason there is no explanation of the meaning of ^
in the docs.
Only for POSIX classes the docs mention

You can negate the [::] character classes by prefixing the class name
with a '^'.

From this you have to infer that you can negate a non-POSIX class, too.

To answer the original question:
[^<] stands for the character class which contains every character except
the less-than sign.

jue

Tad McClellan · Dec 7, 2003

Geoff Cox said:
what does the [^<] stand for?

It doesn't "stand for" anything, it "matches" something though.

It matches any single character that is not the "<" character.

Gunnar Hjalmarsson · Dec 7, 2003

Jürgen Exner said:
Geoff said:

what does the [^<] stand for?

Click to expand...

From "perldoc perlre":
In particular the following metacharacters have their
standard *egrep*-ish meanings:
[...]
[] Character class

However for some unknown reason there is no explanation of the
meaning of ^ in the docs.

Hmm.. You are right. Shouldn't somebody better do something about
that? After all, it's one of the most common constructs in Perl
regular expressions.

Alan J. Flavell · Dec 7, 2003

Hmm.. You are right. Shouldn't somebody better do something about
that?

FWIW: I hadn't gained a close acquaintance with regexes before I
started on Perl, and I recall also being a bit disappointed that the
Perl documentation seemed to be written for readers who already would
have a working acquaintance with regexes and were chiefly looking for
details of the specific Perl embodiment.

I noticed more recently that the Cambridge PCRE library (Perl
compatible regular expressions) has a general presentation of this
regular expression syntax, which (as the name implies) is deliberately
close to Perl. It starts about halfway down the composite page
http://www.pcre.org/pcre.txt - below the heading:

PCRE REGULAR EXPRESSION DETAILS

which some readers might find to be a useful adjunct to the Perl
documentation. Hope this helps a bit.

Geoff Cox · Dec 7, 2003

Geoff said:
Geoff said:

what does the [^<] stand for?

Click to expand...

It's a character class representing any character but '<'.

Gunnar,

OK thanks for that - I have printed off the perlre pages!

Having tried the HTML Parser module it gives me too much text ... am I
able to use it selectively?

Geoff

Geoff Cox · Dec 7, 2003

Geoff said:
Geoff said:

On Sun, 07 Dec 2003 12:06:07 +0100, Gunnar Hjalmarsson

if ( $line =~ /Head\s+Teacher.+?<TD[^>]+>([^<]+)

Click to expand...

[...]
understand your regex ... what does the [^<] stand for?

Click to expand...

From "perldoc perlre":
In particular the following metacharacters have their standard
*egrep*-ish meanings:
[...]
[] Character class

However for some unknown reason there is no explanation of the meaning of ^
in the docs.
Only for POSIX classes the docs mention

You can negate the [::] character classes by prefixing the class name
with a '^'.

From this you have to infer that you can negate a non-POSIX class, too.

To answer the original question:
[^<] stands for the character class which contains every character except
the less-than sign.

jue

Thanks Jue ...

Cheers

Geoff

Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
How to wrap <td> content .	16	Sep 28, 2023
How to have two html audio players on one page?	0	May 3, 2022
Uncaught ReferenceError: item is not defined at HTMLButtonElement.onclick in the: <button onclick="item.inserir()">Inserir dados</button>	1	Apr 22, 2023
I need help fixing my website	2	Oct 15, 2023
ERROR MESSAGE CITES DIFFERENT LINE FROM PROBLEM	6	Jun 27, 2007
SendGrid email issue in responsive Gmail	1	Nov 4, 2021
tables are not aligned to the correct line	8	Jan 28, 2010

hwo to match more than 1 line?

Geoff Cox

Jay Tilton

Gunnar Hjalmarsson

Tintin

Geoff Cox

Geoff Cox

Geoff Cox

Gunnar Hjalmarsson

Jürgen Exner

ko

Geoff Cox

Gunnar Hjalmarsson

Geoff Cox

Tad McClellan

Jürgen Exner

Tad McClellan

Gunnar Hjalmarsson

Alan J. Flavell

Geoff Cox

Geoff Cox

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads