Regular Expression help for parsing html tables

S

steve551979

Hello,

I am having some difficulty creating a regular expression for the
following string situation in html. I want to find a table that has
specific text in it and then extract the html just for that immediate
table.

the string would look something like this:

....stuff here...
<table>
....stuff here...
<table>
....stuff here...
<table>
....
text i'm searching for
....
</table>
....stuff here...
</table>
....stuff here...
</table>
....stuff here...


My question: is there a way in RE to say: "when I find this text I'm
looking for, search backwards and find the immediate instance of the
string "<table>" and then search forwards and find the immediate
instance of the string "</table>". " ?

any help is appreciated.

Steve.
 
S

Stefan Behnel

Hi Steve,

I am having some difficulty creating a regular expression for the
following string situation in html. I want to find a table that has
specific text in it and then extract the html just for that immediate
table.

Any reason why you can't use a real HTML parser and API (e.g. the one provided
by lxml)? That can really make things easier here.

http://codespeak.net/lxml/
http://codespeak.net/lxml/api.html#parsers
http://codespeak.net/lxml/api.html#trees-and-documents
http://effbot.org/zone/element-index.htm

Stefan
 
O

Odalrick

(e-mail address removed) skrev:
Hello,

I am having some difficulty creating a regular expression for the
following string situation in html. I want to find a table that has
specific text in it and then extract the html just for that immediate
table.

the string would look something like this:

...stuff here...
<table>
...stuff here...
<table>
...stuff here...
<table>
...
text i'm searching for
...
</table>
...stuff here...
</table>
...stuff here...
</table>
...stuff here...


My question: is there a way in RE to say: "when I find this text I'm
looking for, search backwards and find the immediate instance of the
string "<table>" and then search forwards and find the immediate
instance of the string "</table>". " ?

any help is appreciated.

Steve.

It would have been easier if you'd said what the text you are looking
for is, but I think:

regex = re.compile( r'<table>(.*?text you are looking for.*?)</table>',
re.DOTALL )
match = regex.search( html_string )
found_table = match.group( 1 )

would work.

/Odalrick
 
P

Paddy

Hello,

I am having some difficulty creating a regular expression for the
following string situation in html. I want to find a table that has
specific text in it and then extract the html just for that immediate
table.

the string would look something like this:

...stuff here...
<table>
...stuff here...
<table>
...stuff here...
<table>
...
text i'm searching for
...
</table>
...stuff here...
</table>
...stuff here...
</table>
...stuff here...


My question: is there a way in RE to say: "when I find this text I'm
looking for, search backwards and find the immediate instance of the
string "<table>" and then search forwards and find the immediate
instance of the string "</table>". " ?

any help is appreciated.

Steve.

Might searching the output of BeautifulSoup(html).prettify() make
things easier?

http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML

- Paddy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top