Match regular expression from LEFT to right

F

fritz-bayer

Hi,

lets say I have the following string:

<tr> dfsdfre <tr>fsdsfd35gd <tr>khf758 <tr>afdga654jhuotj <input
type="text"> 67kfbs356</tr>sh tu65 </tr> hbrubs</tr>

and I would like to capture the text before the <input...> until the
first <tr> and the text after until the first <tr> so that I get

<tr>afdga654jhuotj <input type="text> 67kfbs356</tr>

How would I do this?

=~ m!<tr>.*?<input type="text">.*?<tr>!

will only work capturing the first <tr> after the <input..>. The
problem is that I have to find a expression, which starts looking from
the right to the left of <input...>.

Fritz
 
A

anno4000

Hi,

lets say I have the following string:

<tr> dfsdfre <tr>fsdsfd35gd <tr>khf758 <tr>afdga654jhuotj <input
type="text"> 67kfbs356</tr>sh tu65 </tr> hbrubs</tr>

and I would like to capture the text before the <input...> until the
first <tr> and the text after until the first <tr> so that I get

<tr>afdga654jhuotj <input type="text> 67kfbs356</tr>

How would I do this?

=~ m!<tr>.*?<input type="text">.*?<tr>!

will only work capturing the first <tr> after the <input..>. The
problem is that I have to find a expression, which starts looking from
the right to the left of <input...>.

Your explanation is confused about whether the closing part should
be <tr> or </tr>. Please clear that up.

Anno
 
P

Paul Lalli

lets say I have the following string:

<tr> dfsdfre <tr>fsdsfd35gd <tr>khf758 <tr>afdga654jhuotj <input
type="text"> 67kfbs356</tr>sh tu65 </tr> hbrubs</tr>

I can't imagine why you'd have such a thing, as it's massively
incorrectly formatted HTML, but okay.
and I would like to capture the text before the <input...> until the
first <tr> and the text after until the first <tr> so that I get

<tr>afdga654jhuotj <input type="text> 67kfbs356</tr>

How would I do this?

=~ m!<tr>.*?<input type="text">.*?<tr>!

I assume you meant to include a / in the last said:
will only work capturing the first <tr> after the <input..>. The
problem is that I have to find a expression, which starts looking from
the right to the left of <input...>.

One idea might be to use a negative look-ahead assertion. Basically
say "match a <tr> that's not followed by anything that includes <tr>".


$ perl -le'
$_ = q{<tr> dfsdfre <tr>fsdsfd35gd <tr>khf758 <tr>afdga654jhuotj <input
type="text"> 67kfbs356</tr>sh tu65 </tr> hbrubs</tr>};
m#(<tr>(?!.*<tr>.*).*<input type="text">.*?</tr>)#;
print $1;
'
<tr>afdga654jhuotj <input type="text"> 67kfbs356</tr>


I assume there are other ways as well.

Paul Lalli
 
F

fritz-bayer

Your explanation is confused about whether the closing part should
be <tr> or </tr>. Please clear that up.

Anno

Hi Anno,

this is just an example. Actually I'm not looking for a concrete
solution for this. It's just an example to illustrate my problem.

If I have a text which contains a lot of text for example. An in the
middle somewhere I have the phrase "this is the center of the text",
then how can I capture this sentence plus the 10 words preceeding and
following the sentence.

Or how could I caputre this sentece plus any text which preceeds this
sentence UNTIL the word "stopword" is matched. A "stopword.*?this is
the center of the text" can fail, if the word "stopword" is a common
word, which appears several times.

Maybe in this context my phrase matching from "right to left" becomes
more meaningfull?

Fritz
 
A

anno4000

Hi Anno,

this is just an example. Actually I'm not looking for a concrete
solution for this. It's just an example to illustrate my problem.

Right. Because it is meant to illustrate the problem it is important
that you make it consistent.

Concrete solutions is all I have to offer. I don't think there is
a general solution to your backwards-matching problem. There may
well be individual solutions to special cases of it.
If I have a text which contains a lot of text for example. An in the
middle somewhere I have the phrase "this is the center of the text",
then how can I capture this sentence plus the 10 words preceeding and
following the sentence.

I'll use a somewhat simplistic definition of "word": any sequence of
non-spaces, optionally followed by a space, /\S+ ?/ in regex. This
will match $n words around "this is the center ":

my $text = "stop this and stop that and " .
"this is the center " .
"and stop stop again";
my $n = 3;
my ( $extr) = $text =~
/((?:\S+ ?){$n}this is the center (?:\S+ ?){$n})/;
print $extr || '-failed-', "\n";

That prints three words on both sides of "this is the center "

stop that and this is the center and stop stop
Or how could I caputre this sentece plus any text which preceeds this
sentence UNTIL the word "stopword" is matched. A "stopword.*?this is
the center of the text" can fail, if the word "stopword" is a common
word, which appears several times.

Using "stop" for "stopword" and the same text from above:

( $extr) = $text =~ /.*(stop.*this is the center.*?stop)/;
print $extr || '-failed-', "\n";

prints the nearest pair of "stop"s surrounding "this is the center"
plus intervening text:

stop that and this is the center and stop

These are rough solutions which may be good enough for some
applications but not for others. Refining them while sticking
to the principle that one regex must do it is usually *not*
worth the while. If a robust, flexible solution is needed,
it is better to do the work in several steps.

Anno
 
R

robic0

Hi,

lets say I have the following string:

<tr> dfsdfre <tr>fsdsfd35gd <tr>khf758 <tr>afdga654jhuotj <input
type="text"> 67kfbs356</tr>sh tu65 </tr> hbrubs</tr>

and I would like to capture the text before the <input...> until the
first <tr> and the text after until the first <tr> so that I get

<tr>afdga654jhuotj <input type="text> 67kfbs356</tr>

How would I do this?

=~ m!<tr>.*?<input type="text">.*?<tr>!

will only work capturing the first <tr> after the <input..>. The
problem is that I have to find a expression, which starts looking from
the right to the left of <input...>.

Fritz

Finding text from the phrase back to the keyword is indeed hard since
searches procede left to right in general.
Html/Xml is indeed easier to parse because of its mark-up, and indeed
one of the hardest things to do corectly.

Some other alternatives:

- Method 1 is is a negative character class with one character '<'.
<> are very powerfull delimeters.

- Method 2 is an alternative to a negative assertion construct (?!...)
i believe was mentioned by another poster. I believe the method below to
be a close proximity to negative assertions.
I'm not at all comfortable with negative assertions, however, logically it is the only way.

I made all the tags start tags, and narrowed down the regex to the range
of interest, the start/end text;


use strict;
use warnings;

my $string =
'<tr> dfsdfre <tr>fsdsfd35gd <tr>khf758 <tr>afdga654jhuotj <input type="text"> 67kfbs356<tr>sh tu65 <tr> hbrubs<tr>';

# -- method 1 --

my ($capt) = $string =~ m!(<tr>[^<]*<input type="text">)!;
print "found: $capt\n";


# -- method 2 --

while ($string =~ /<tr>(.*?)(?:(<tr>)|<input type="text">)/g)
# 1 1( 2 2| )
{
if (defined $2)
{
pos($string) = pos($string) - 4;
next;
}
print "found: $1\n";
}

__END__

found: <tr>afdga654jhuotj <input type="text">
found: afdga654jhuotj
 
F

fritz-bayer

robic0 said:
Finding text from the phrase back to the keyword is indeed hard since
searches procede left to right in general.

That is actually getting to the heart of my question. If somebody could
tell me how do do this, then my problem would be solved.

I have read in Oreilly Regular Expressions Book but could not find a
topic on it, even though I just skimmed through the chapters.

I had a feeling so, that lookaheads could be helpfull, because they are
just used to mark a position. But this I guess still leaves me with
defining where this position is, so I figured they aren't the answer to
my question.
Html/Xml is indeed easier to parse because of its mark-up, and indeed
one of the hardest things to do corectly.

Actually the real example consists of HTML. So there are all kinds of
different tags and they of course can varry. I want to grap a group of
radio boxes, which are contained inside a table. But I only want to
grap the first and last row of the table and everything within.

The back of the expression is easy, but searching from the first radio
box to the left is difficult, because up to this point the document
contains all kinds of tags, words and so on, that you always catch
something in the front. That's why I would like to look from the right
to the left. Then I could ignore all this noise before.
Some other alternatives:

- Method 1 is is a negative character class with one character '<'.
<> are very powerfull delimeters.

This would fail of course, because it will capture many other tags in
front of my radio button group.
- Method 2 is an alternative to a negative assertion construct (?!...)
i believe was mentioned by another poster. I believe the method below to
be a close proximity to negative assertions.
I'm not at all comfortable with negative assertions, however, logically it is the only way.

I made all the tags start tags, and narrowed down the regex to the range
of interest, the start/end text;

This could work, if I let in rund through until the very end. However,
I'm not sure, I have to try.
use strict;
use warnings;

my $string =
'<tr> dfsdfre <tr>fsdsfd35gd <tr>khf758 <tr>afdga654jhuotj <input type="text"> 67kfbs356<tr>sh tu65 <tr> hbrubs<tr>';

# -- method 1 --

my ($capt) = $string =~ m!(<tr>[^<]*<input type="text">)!;
print "found: $capt\n";


# -- method 2 --

while ($string =~ /<tr>(.*?)(?:(<tr>)|<input type="text">)/g)
# 1 1( 2 2| )
{
if (defined $2)
{
pos($string) = pos($string) - 4;
next;
}
print "found: $1\n";
}

__END__

found: <tr>afdga654jhuotj <input type="text">
found: afdga654jhuotj
 
F

fritz-bayer

Right. Because it is meant to illustrate the problem it is important
that you make it consistent.

Hi Anno, sorry you are right. The thing is my real example contains so
much text that I did not want to post it here. But of course, if I
don't I'm likely to get the right answer on the wrong question.

So let me explain this words - as I did below. I'm trying to capture a
group of radio buttons which resides inside a table in the middle of a
html document, which contains lots of tags and text.

Capturing the back of the html table after the radio buttons is easy as
a ".*?</table>, will do the job. However, capturing the first <table>
tag before the radio button group is more difficult, because there are
plenty of table tags before.

Actually I only want to get the rows in the table, which contain the
radio buttons, but I guess once I get the table I can just strip the
table tags off.
 
T

Tad McClellan

Actually the real example consists of HTML.


Then you probably should not be trying to process it with
regular expressions.

You should use a module that understands HTML for processing HTML data.
 
A

anno4000

[...]

So let me explain this words - as I did below. I'm trying to capture a
group of radio buttons which resides inside a table in the middle of a
html document, which contains lots of tags and text.

Capturing the back of the html table after the radio buttons is easy as
a ".*?</table>, will do the job. However, capturing the first <table>
tag before the radio button group is more difficult, because there are
plenty of table tags before.

It isn't so hard. Did you look at the "stop" example I gave?

Allowing an arbitrary greedy match before capturing the leading "stop"
eats as much text as possible while still allowing the match. So it
finds the "stop" nearest to the center text with no more intervening
"stop"s. That's what you want, isn't it?

Anno
 
X

Xicheng Jia

Hi Anno, sorry you are right. The thing is my real example contains so
much text that I did not want to post it here. But of course, if I
don't I'm likely to get the right answer on the wrong question.

So let me explain this words - as I did below. I'm trying to capture a
group of radio buttons which resides inside a table in the middle of a
html document, which contains lots of tags and text.

Capturing the back of the html table after the radio buttons is easy as
a ".*?</table>, will do the job. However, capturing the first <table>
tag before the radio button group is more difficult, because there are
plenty of table tags before.

Actually I only want to get the rows in the table, which contain the
radio buttons, but I guess once I get the table I can just strip the
table tags off.

Dont know if this helps:

1) if there is not any other <input type="text"> tags, then it's easy:

m#(<tr>(?:(?!<tr>).)*?<input\s+type="text">.*?</tr>)#sgi

(?:(?!<tr>).) construct will make sure there is not another <tr>
between the opening <tr> and your <input> tag.

add s modifier to allow dot match newline..

2) If there are more than 2 <input type="text"> tags within the table,
then try add some more constraints. Here I add another anchor:

m{ <table>
# anything except "<input type="text">"
(?:(?!<input type="text">).)*?
# start of capturing $1
(
<tr>
# anything but "<tr>"
(?:(?!<tr>).)*?
<input\s+type="text">.*?
</tr>
# end of capturing $1
)
}sxi

You can use a similar way to get another one..

Good luck
Xicheng
 
X

Xicheng Jia

Xicheng said:
Dont know if this helps:

1) if there is not any other <input type="text"> tags, then it's easy:

m#(<tr>(?:(?!<tr>).)*?<input\s+type="text">.*?</tr>)#sgi

(?:(?!<tr>).) construct will make sure there is not another <tr>
between the opening <tr> and your <input> tag.

add s modifier to allow dot match newline..

2) If there are more than 2 <input type="text"> tags within the table,
then try add some more constraints. Here I add another anchor:

I should have said:

2) If there are some more <input type="text"> tags outside/inside the
table, then try to add more constraints. and It's not an anchor that I
added here but just some more contexts....Sorry for my English....:-(

Good luck
Xicheng
 
F

fritz-bayer

(e-mail address removed)-berlin.de wrote:
(e-mail address removed) <[email protected]> wrote in comp.lang.perl.misc:
[...]

So let me explain this words - as I did below. I'm trying to capture a
group of radio buttons which resides inside a table in the middle of a
html document, which contains lots of tags and text.

Capturing the back of the html table after the radio buttons is easy as
a ".*?</table>, will do the job. However, capturing the first <table>
tag before the radio button group is more difficult, because there are
plenty of table tags before.

It isn't so hard. Did you look at the "stop" example I gave?

Allowing an arbitrary greedy match before capturing the leading "stop"
eats as much text as possible while still allowing the match. So it
finds the "stop" nearest to the center text with no more intervening
"stop"s. That's what you want, isn't it?

Anno

Yep you are right...
 
R

robic0

(e-mail address removed)-berlin.de wrote:
(e-mail address removed)-berlin.de wrote:
[...]

So let me explain this words - as I did below. I'm trying to capture a
group of radio buttons which resides inside a table in the middle of a
html document, which contains lots of tags and text.

Capturing the back of the html table after the radio buttons is easy as
a ".*?</table>, will do the job. However, capturing the first <table>
tag before the radio button group is more difficult, because there are
plenty of table tags before.

It isn't so hard. Did you look at the "stop" example I gave?
my $text = "stop this and stop that and " .
"this is the center " .
"and stop stop again";
( $extr) = $text =~ /.*(stop.*this is the center.*?stop)/;
print $extr || '-failed-', "\n";

Allowing an arbitrary greedy match before capturing the leading "stop"
eats as much text as possible while still allowing the match. So it
finds the "stop" nearest to the center text with no more intervening
"stop"s. That's what you want, isn't it?

Anno

Yep you are right...

Yeah but its wrong. Don't patronize anbody here dude, your credibility dollar
has about .02 left. Post your problem text and stop playing games.

I've held off telling you the true story with searches/matches/anchors because
its about ten goddamed miles over your head, and I want to be nice. Stop
thinking your have just discovered the truth of the universe. You haven't,
any neophite Perl programmer discovers this paradigm.

Wake up and smell the coffee.......
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top