Pattern matching [newbie]


V

vivek_12315

I m working on my perl regex code, where I have to parse a html line like :

<a href="/question?id=15422849"><p>MY text here 1</p><p>MY text here 2</p><p>MY text here 3</p></a>

I am doing something like:
$string =~ m/(.*)href(.*)/;

But this is not helping me in what I want. I want something closer to following text:

"MY text here 1 MY text here 2 MY text here 3"

Can some give some ideas ?
 
Ad

Advertisements

J

Jürgen Exner

vivek_12315 said:
I m working on my perl regex code, where I have to parse a html line like :

<a href="/question?id=15422849"><p>MY text here 1</p><p>MY text here 2</p><p>MY text here 3</p></a>

I am doing something like:
$string =~ m/(.*)href(.*)/;

But this is not helping me in what I want. I want something closer to following text:
"MY text here 1 MY text here 2 MY text here 3"

Can some give some ideas ?

Your Question used to be Asked Frequently. Please see

perldoc -q "remove html"

jue
 
B

brian d foy

vivek_12315 said:
I m working on my perl regex code, where I have to parse a html line like :

<a href="/question?id=15422849"><p>MY text here 1</p><p>MY text here
2</p><p>MY text here 3</p></a>

I am doing something like:
$string =~ m/(.*)href(.*)/;

But this is not helping me in what I want. I want something closer to
following text:

"MY text here 1 MY text here 2 MY text here 3"


http://search.cpan.org/dist/HTML-Strip/Strip.pm
 
Ad

Advertisements

J

Jürgen Exner

Henry Law said:
I appreciate that you call yourself a newbie, and to you what I'm about
to suggest may seem complicated and difficult; but that's the way we all
learn ...

Have you thought of parsing the HTML properly, using a module like
HTML::Tree or HTML::TreeBuilder? The hardest part is choosing the
module; after that you should find it moderately easy to use it do what
you want, since it's pretty simple. And once you've done it it will
probably be a lot better than hand-cranked parsing code.

Note to all concerned: I'm not joining in the "you can't parse HTML with
regexes" thread. In this case, at least, I'm sure that's perfectly
possible.

Actually for this particular example it is almost trivial(*):
s/<.*?>//g;
Of course this is going to fail as soon as the HTML code becomes a tiny
bit more complex.

*: almost because it doesn't add the space characters between the
individual paragraph elements.

jue
 

Top