Parsing html :: output to comma delimited

S

samuels

Hello All,

I am a total python newbie, and I need help writing a script.

This is what I want to do:

There is a list of links at http://www.rentalhq.com/fulllist.asp. Each
link goes to a page like,
http://www.rentalhq.com/store.asp?id=907/272-4425, that contains a
company name, address, phone, and fax. I want extract each page, parse
this information, and export it to a comma delimited text file, or tab
delimited. The important information in each page is:

<table border="0" cellpadding="0" cellspacing="0"
style="border-collapse: collapse" bordercolor="#111111" width="100%"
id="AutoNumber1">
<tr>
<td width="100%" colspan="2">
<h2 style="text-align: center; margin-top:2; margin-bottom:2;
line-height:14px" class="title">
<font size="4">United Rentals Inc.</font>
</h2>

<h3 style="text-align: center; margin-top:4;
margin-bottom:4">3401&nbsp;Commercial&nbsp;Dr.&nbsp;
Anchorage&nbsp;AK,&nbsp;99501-3024
</h3>
<p style="text-align: center; margin-top:4; margin-bottom:4">
<a target="_blank"
href="http://maps.google.com/maps?q=3401+Commercial+Dr. Anchorage AK
99501-3024 ">
<!-- <a target="_blank"
href="http://www.mapquest.com/maps/map.ad...Commercial+Dr.&zip=99501-3024&country=&zoom=8">-->
<img height="15" src="Scraps/Rental_Images/map.gif" width="33"
border="0"></a>
</p>
</td>
</tr>
<tr>
<td width="50%" valign="top">
<p style="text-align: center; line-height:100%; margin-top:0;
margin-bottom:0">&nbsp;
</p>
<p style="text-align: center; line-height: 100%; margin-top:0;
margin-bottom:0">
<b>Phone</b> - 907/272-4425<br>
<b>Fax</b> - 907/272-9683 </p>

So from that I want output like :

United Rentals Inc.,3401 Commercial
Dr.,Anchorage,AK,"995013024","9072724425","9072729683"

or

United Rentals Inc. 3401 Commercial
Dr. Anchorage AK 995013024 9072724425 9072729683


I have been messing around with beautiful soup
(http://www.crummy.com/software/BeautifulSoup/index.html) but haven't
gotten very far. (specially because the html is so sloppy)

Any help would be really appreciated! Just point me in the right
direction, what to use, examples... Thanks!

-Sam
 
W

William Park

samuels said:
Hello All,

I am a total python newbie, and I need help writing a script.

This is what I want to do:

There is a list of links at http://www.rentalhq.com/fulllist.asp. Each
link goes to a page like,
http://www.rentalhq.com/store.asp?id=907/272-4425, that contains a
company name, address, phone, and fax. I want extract each page, parse
this information, and export it to a comma delimited text file, or tab
delimited. The important information in each page is:

<table border="0" cellpadding="0" cellspacing="0"
style="border-collapse: collapse" bordercolor="#111111" width="100%"
id="AutoNumber1">
<tr>
<td width="100%" colspan="2">
<h2 style="text-align: center; margin-top:2; margin-bottom:2;
line-height:14px" class="title">
<font size="4">United Rentals Inc.</font>
</h2>

<h3 style="text-align: center; margin-top:4;
margin-bottom:4">3401&nbsp;Commercial&nbsp;Dr.&nbsp;
Anchorage&nbsp;AK,&nbsp;99501-3024
</h3>
<p style="text-align: center; margin-top:4; margin-bottom:4">
<a target="_blank"
href="http://maps.google.com/maps?q=3401+Commercial+Dr. Anchorage AK
99501-3024 ">
<!-- <a target="_blank"
href="http://www.mapquest.com/maps/map.ad...Commercial+Dr.&zip=99501-3024&country=&zoom=8">-->
<img height="15" src="Scraps/Rental_Images/map.gif" width="33"
border="0"></a>
</p>
</td>
</tr>
<tr>
<td width="50%" valign="top">
<p style="text-align: center; line-height:100%; margin-top:0;
margin-bottom:0">&nbsp;
</p>
<p style="text-align: center; line-height: 100%; margin-top:0;
margin-bottom:0">
<b>Phone</b> - 907/272-4425<br>
<b>Fax</b> - 907/272-9683 </p>

So from that I want output like :

United Rentals Inc.,3401 Commercial
Dr.,Anchorage,AK,"995013024","9072724425","9072729683"

or

United Rentals Inc. 3401 Commercial
Dr. Anchorage AK 995013024 9072724425 9072729683


I have been messing around with beautiful soup
(http://www.crummy.com/software/BeautifulSoup/index.html) but haven't
gotten very far. (specially because the html is so sloppy)

Any help would be really appreciated! Just point me in the right
direction, what to use, examples... Thanks!

I'm sure others will give proper Python solution. But, here, shell is
not a bad tool.

lynx -dump 'http://www.rentalhq.com/store.asp?id=907/272-4425' | \
awk '/Return to List of Rental Stores/,/To reserve an item/' | \
sed -n -e '3p;5p;10p;11p'

gives me

United Rentals Inc.
3401 Commercial Dr. Anchorage AK, 99501-3024
Phone - 907/272-4425
Fax - 907/272-9683

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
 
S

samuels

Thanks for the replies, I'll post here when/if I get it finally
working.

So, now I know how to extract the links for the big page, and extract
the text from the individual page. Really what I need to find out is
how run the script on each individual page automatically, and get the
output in comma delimited format. Thanks for solving the two problems
though :)

-Sam
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top