Extract information from HTML table

U

Ulysse

Hello,

I'm trying to extract the data from HTML table. Here is the part of
the HTML source :
"""
<tr>
<td class="tdn" valign="top">
<input name="x44553130" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:24:00</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp;beid=2628033">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44553032" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:14:35</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp;beid=2628007">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44552991" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:11:39</td>
<td class="tdn"> Vous avez bien accompli votre
tâche de Gardien de Cimetière et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320
<img src="messages-bite_fichiers/res2.gif"
alt="Or" align="absmiddle" border="0">
et collectez 3 d'expérience !</td>
</tr>
"""

I would like to transform this in following thing :

Date : Sat, 31.03.2007 - 20:24:00
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL : http://s2.bitefight.fr/bite/bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp;beid=2628033

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL : http://s2.bitefight.fr/bite/bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp;beid=2628007

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Text
Contain : Vous avez bien accompli votre tâche de Gardien de Cimetière
et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320 et collectez 3 d'expérience !

.....

Do you know the way to do it ?

Thanks
 
P

placid

Hello,

I'm trying to extract the data from HTML table. Here is the part of
the HTML source :
"""
<tr>
<td class="tdn" valign="top">
<input name="x44553130" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:24:00</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp;beid=2628033">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44553032" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:14:35</td>
<td class="tdn">
<a href="http://s2.bitefight.fr/bite/
bericht.php?q=01bf0ba7258ad976d890379f987d444e&amp;beid=2628007">Vous
avez tendu une embuscade à votre victime !</a></td>
</tr>
<tr>
<td class="tdn" valign="top">
<input name="x44552991" value="y"
type="checkbox"></td>
<td class="tdn" valign="top" width="30%">
Sat, 31.03.2007 - 20:11:39</td>
<td class="tdn"> Vous avez bien accompli votre
tâche de Gardien de Cimetière et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320
<img src="messages-bite_fichiers/res2.gif"
alt="Or" align="absmiddle" border="0">
et collectez 3 d'expérience !</td>
</tr>
"""

I would like to transform this in following thing :

Date : Sat, 31.03.2007 - 20:24:00
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL :http://s2.bitefight.fr/bite/bericht.php?q=01bf0ba7258ad976d890379f987...

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Link
LinkText : Vous avez tendu une embuscade à votre victime !
LinkURL :http://s2.bitefight.fr/bite/bericht.php?q=01bf0ba7258ad976d890379f987...

Date : Sat, 31.03.2007 - 20:14:35
ContainType : Text
Contain : Vous avez bien accompli votre tâche de Gardien de Cimetière
et vous vous
voyez remis votre salaire comme récompense.
Vous recevez 320 et collectez 3 d'expérience !

....

Do you know the way to do it ?

You can use Beautiful Soup http://www.crummy.com/software/BeautifulSoup/

see this page to see how you can search for tags, then retrieve the
contents

http://www.crummy.com/software/BeautifulSoup/documentation.html#Searching Within the Parse Tree

Cheers
 
I

irstas

Hello,

I'm trying to extract the data from HTML table. Here is the part of
the HTML source :

....

Do you know the way to do it ?

Beautiful Soup is an easy way to parse HTML (that may be broken).
http://www.crummy.com/software/BeautifulSoup/

Here's a start of a parser for your HTML:

soup = BeautifulSoup(txt)
for tr in soup('tr'):
dateTd, textTd = tr('td')[1:]
print 'Date :', dateTd.contents[0].strip()
print textTd #element still needs parsing

where txt is the string in your message.
 
U

Ulysse

I'm trying to extract the data from HTML table. Here is the part of
the HTML source :

Do you know the way to do it ?

Beautiful Soup is an easy way to parse HTML (that may be broken).http://www.crummy.com/software/BeautifulSoup/

Here's a start of a parser for your HTML:

soup = BeautifulSoup(txt)
for tr in soup('tr'):
dateTd, textTd = tr('td')[1:]
print 'Date :', dateTd.contents[0].strip()
print textTd #element still needs parsing

where txt is the string in your message.

I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...
 
D

Dotan Cohen

I have seen the Beautiful Soup online help and tried to apply that to
my problem. But it seems to be a little bit hard. I will rather try to
do this with regular expressions...

If you think that Beautiful Soup is difficult than wait till you try
to do this with regexes. Granted you know the exact format of the HTML
you are scraping will help, if you ever need to parse HTML from an
unknown source than Beautiful Soup is the only way to go. Not all HTML
authors close their td and tr tags, and sometimes there are attributes
to those tags. If you plan on ever reusing the code or the format of
the HTML may change, then you are best off sticking with Beautiful
Soup.

Dotan Cohen

http://lyricslist.com/
http://what-is-what.com/
 
A

anjesh

If you think that Beautiful Soup is difficult than wait till you try
to do this with regexes. Granted you know the exact format of the HTML
you are scraping will help, if you ever need to parse HTML from an
unknown source than Beautiful Soup is the only way to go. Not all HTML
authors close their td and tr tags, and sometimes there are attributes
to those tags. If you plan on ever reusing the code or the format of
the HTML may change, then you are best off sticking with Beautiful
Soup.

Dotan Cohen

http://lyricslist.com/http://what-is-what.com/


Have you tried HTMLParser. It can do the task you want to perform
http://docs.python.org/lib/module-HTMLParser.html

-anjesh
 
C

Cameron Laird

Have you tried HTMLParser. It can do the task you want to perform
http://docs.python.org/lib/module-HTMLParser.html

-anjesh

Yes, except that these last two follow-ups UNDERstate the difficulty--in
fact, the impossibility--of achieving adequate results on this problem
with regular expressions. We'll help with the documentation for HTMLParser
and BeautifulSoup. REs are an invitation to madness.

<URL: http://www.unixreview.com/documents/s=10121/ur0702e/ > might amuse
those who want to think more about REs.
 
U

Ulysse

Yes, except that these last two follow-ups UNDERstate the difficulty--in
fact, the impossibility--of achieving adequate results on this problem
with regular expressions. We'll help with the documentation for HTMLParser
and BeautifulSoup. REs are an invitation to madness.

<URL:http://www.unixreview.com/documents/s=10121/ur0702e/> might amuse
those who want to think more about REs.

r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2})</td>\W*?<td class="tdn">
\W*?<a href="(.*?)">(.*?)</a>.*?</td>'

r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2}).*?player\.php.*?>(.*?)</
a>.*?<textarea.*?>(.*?)</textarea>'

r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2})</td>\W*?<td class="tdn">
\W*?Message au clan de :([a-zA-Z0-9_\-]+?)\W*<br>(.*?)</th>'

These three REs extract all data I need. That not exactly apply to the
given string.
I read the article but I didn't understood why REs are invitation to
madness...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top