Parsing html :: output to comma delimited

samuels · Jul 16, 2005

Hello All,

I am a total python newbie, and I need help writing a script.

This is what I want to do:

There is a list of links at http://www.rentalhq.com/fulllist.asp. Each
link goes to a page like,
http://www.rentalhq.com/store.asp?id=907/272-4425, that contains a
company name, address, phone, and fax. I want extract each page, parse
this information, and export it to a comma delimited text file, or tab
delimited. The important information in each page is:

<table border="0" cellpadding="0" cellspacing="0"
style="border-collapse: collapse" bordercolor="#111111" width="100%"
id="AutoNumber1">
<tr>
<td width="100%" colspan="2">
<h2 style="text-align: center; margin-top:2; margin-bottom:2;
line-height:14px" class="title">
<font size="4">United Rentals Inc.</font>
</h2>

<h3 style="text-align: center; margin-top:4;
margin-bottom:4">3401 Commercial Dr. 
Anchorage AK, 99501-3024
</h3>
<p style="text-align: center; margin-top:4; margin-bottom:4">
<a target="_blank"
href="http://maps.google.com/maps?q=3401+Commercial+Dr. Anchorage AK
99501-3024 ">

<img height="15" src="Scraps/Rental_Images/map.gif" width="33"
border="0"></a>
</p>
</td>
</tr>
<tr>
<td width="50%" valign="top">
<p style="text-align: center; line-height:100%; margin-top:0;
margin-bottom:0"> 
</p>
<p style="text-align: center; line-height: 100%; margin-top:0;
margin-bottom:0">
<b>Phone</b> - 907/272-4425<br>
<b>Fax</b> - 907/272-9683 </p>

So from that I want output like :

United Rentals Inc.,3401 Commercial
Dr.,Anchorage,AK,"995013024","9072724425","9072729683"

or

United Rentals Inc. 3401 Commercial
Dr. Anchorage AK 995013024 9072724425 9072729683

I have been messing around with beautiful soup
(http://www.crummy.com/software/BeautifulSoup/index.html) but haven't
gotten very far. (specially because the html is so sloppy)

Any help would be really appreciated! Just point me in the right
direction, what to use, examples... Thanks!

-Sam

William Park · Jul 17, 2005

samuels said:
Hello All,

I am a total python newbie, and I need help writing a script.

This is what I want to do:

There is a list of links at http://www.rentalhq.com/fulllist.asp. Each
link goes to a page like,
http://www.rentalhq.com/store.asp?id=907/272-4425, that contains a
company name, address, phone, and fax. I want extract each page, parse
this information, and export it to a comma delimited text file, or tab
delimited. The important information in each page is:

<table border="0" cellpadding="0" cellspacing="0"
style="border-collapse: collapse" bordercolor="#111111" width="100%"
id="AutoNumber1">
<tr>
<td width="100%" colspan="2">
<h2 style="text-align: center; margin-top:2; margin-bottom:2;
line-height:14px" class="title">
<font size="4">United Rentals Inc.</font>
</h2>

<h3 style="text-align: center; margin-top:4;
margin-bottom:4">3401 Commercial Dr. 
Anchorage AK, 99501-3024
</h3>
<p style="text-align: center; margin-top:4; margin-bottom:4">
<a target="_blank"
href="http://maps.google.com/maps?q=3401+Commercial+Dr. Anchorage AK
99501-3024 ">

<img height="15" src="Scraps/Rental_Images/map.gif" width="33"
border="0"></a>
</p>
</td>
</tr>
<tr>
<td width="50%" valign="top">
<p style="text-align: center; line-height:100%; margin-top:0;
margin-bottom:0"> 
</p>
<p style="text-align: center; line-height: 100%; margin-top:0;
margin-bottom:0">
<b>Phone</b> - 907/272-4425<br>
<b>Fax</b> - 907/272-9683 </p>

So from that I want output like :

United Rentals Inc.,3401 Commercial
Dr.,Anchorage,AK,"995013024","9072724425","9072729683"

or

United Rentals Inc. 3401 Commercial
Dr. Anchorage AK 995013024 9072724425 9072729683

I have been messing around with beautiful soup
(http://www.crummy.com/software/BeautifulSoup/index.html) but haven't
gotten very far. (specially because the html is so sloppy)

Any help would be really appreciated! Just point me in the right
direction, what to use, examples... Thanks!

I'm sure others will give proper Python solution. But, here, shell is
not a bad tool.

lynx -dump 'http://www.rentalhq.com/store.asp?id=907/272-4425' | \
awk '/Return to List of Rental Stores/,/To reserve an item/' | \
sed -n -e '3p;5p;10p;11p'

gives me

United Rentals Inc.
3401 Commercial Dr. Anchorage AK, 99501-3024
Phone - 907/272-4425
Fax - 907/272-9683

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/

Paul McGuire · Jul 17, 2005

Pyparsing includes a sample program for extracting URLs from web pages.
You should be able to adapt it to this problem.

Download pyparsing at http://pyparsing.sourceforge.net

-- Paul

samuels · Jul 18, 2005

Thanks for the replies, I'll post here when/if I get it finally
working.

So, now I know how to extract the links for the big page, and extract
the text from the individual page. Really what I need to find out is
how run the script on each individual page automatically, and get the
output in comma delimited format. Thanks for solving the two problems
though

-Sam

How to have two html audio players on one page?	0	May 3, 2022
How can I add arrows to my FAQ	0	Aug 9, 2023
HTML Site Problems	11	Nov 25, 2019
SendGrid email issue in responsive Gmail	1	Nov 4, 2021
Align img inside nav tabs section	5	Dec 29, 2023
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
Stuck with html and css	25	Dec 14, 2022
How to position the tooltip comment on these buttons?	9	Nov 4, 2023

Parsing html :: output to comma delimited

samuels

William Park

Paul McGuire

samuels

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads