Parsing html :: output to comma delimited

Discussion in 'Python' started by samuels, Jul 16, 2005.

  1. samuels

    samuels Guest

    Hello All,

    I am a total python newbie, and I need help writing a script.

    This is what I want to do:

    There is a list of links at http://www.rentalhq.com/fulllist.asp. Each
    link goes to a page like,
    http://www.rentalhq.com/store.asp?id=907/272-4425, that contains a
    company name, address, phone, and fax. I want extract each page, parse
    this information, and export it to a comma delimited text file, or tab
    delimited. The important information in each page is:

    <table border="0" cellpadding="0" cellspacing="0"
    style="border-collapse: collapse" bordercolor="#111111" width="100%"
    id="AutoNumber1">
    <tr>
    <td width="100%" colspan="2">
    <h2 style="text-align: center; margin-top:2; margin-bottom:2;
    line-height:14px" class="title">
    <font size="4">United Rentals Inc.</font>
    </h2>

    <h3 style="text-align: center; margin-top:4;
    margin-bottom:4">3401&nbsp;Commercial&nbsp;Dr.&nbsp;
    Anchorage&nbsp;AK,&nbsp;99501-3024
    </h3>
    <p style="text-align: center; margin-top:4; margin-bottom:4">
    <a target="_blank"
    href="http://maps.google.com/maps?q=3401+Commercial+Dr%2E Anchorage AK
    99501-3024 ">
    <!-- <a target="_blank"
    href="http://www.mapquest.com/maps/map.adp?city=Anchorage&state=AK&address=3401+Commercial+Dr.&zip=99501-3024&country=&zoom=8">-->
    <img height="15" src="Scraps/Rental_Images/map.gif" width="33"
    border="0"></a>
    </p>
    </td>
    </tr>
    <tr>
    <td width="50%" valign="top">
    <p style="text-align: center; line-height:100%; margin-top:0;
    margin-bottom:0">&nbsp;
    </p>
    <p style="text-align: center; line-height: 100%; margin-top:0;
    margin-bottom:0">
    <b>Phone</b> - 907/272-4425<br>
    <b>Fax</b> - 907/272-9683 </p>

    So from that I want output like :

    United Rentals Inc.,3401 Commercial
    Dr.,Anchorage,AK,"995013024","9072724425","9072729683"

    or

    United Rentals Inc. 3401 Commercial
    Dr. Anchorage AK 995013024 9072724425 9072729683


    I have been messing around with beautiful soup
    (http://www.crummy.com/software/BeautifulSoup/index.html) but haven't
    gotten very far. (specially because the html is so sloppy)

    Any help would be really appreciated! Just point me in the right
    direction, what to use, examples... Thanks!

    -Sam
     
    samuels, Jul 16, 2005
    #1
    1. Advertising

  2. samuels

    William Park Guest

    samuels <> wrote:
    > Hello All,
    >
    > I am a total python newbie, and I need help writing a script.
    >
    > This is what I want to do:
    >
    > There is a list of links at http://www.rentalhq.com/fulllist.asp. Each
    > link goes to a page like,
    > http://www.rentalhq.com/store.asp?id=907/272-4425, that contains a
    > company name, address, phone, and fax. I want extract each page, parse
    > this information, and export it to a comma delimited text file, or tab
    > delimited. The important information in each page is:
    >
    > <table border="0" cellpadding="0" cellspacing="0"
    > style="border-collapse: collapse" bordercolor="#111111" width="100%"
    > id="AutoNumber1">
    > <tr>
    > <td width="100%" colspan="2">
    > <h2 style="text-align: center; margin-top:2; margin-bottom:2;
    > line-height:14px" class="title">
    > <font size="4">United Rentals Inc.</font>
    > </h2>
    >
    > <h3 style="text-align: center; margin-top:4;
    > margin-bottom:4">3401&nbsp;Commercial&nbsp;Dr.&nbsp;
    > Anchorage&nbsp;AK,&nbsp;99501-3024
    > </h3>
    > <p style="text-align: center; margin-top:4; margin-bottom:4">
    > <a target="_blank"
    > href="http://maps.google.com/maps?q=3401+Commercial+Dr%2E Anchorage AK
    > 99501-3024 ">
    > <!-- <a target="_blank"
    > href="http://www.mapquest.com/maps/map.adp?city=Anchorage&state=AK&address=3401+Commercial+Dr.&zip=99501-3024&country=&zoom=8">-->
    > <img height="15" src="Scraps/Rental_Images/map.gif" width="33"
    > border="0"></a>
    > </p>
    > </td>
    > </tr>
    > <tr>
    > <td width="50%" valign="top">
    > <p style="text-align: center; line-height:100%; margin-top:0;
    > margin-bottom:0">&nbsp;
    > </p>
    > <p style="text-align: center; line-height: 100%; margin-top:0;
    > margin-bottom:0">
    > <b>Phone</b> - 907/272-4425<br>
    > <b>Fax</b> - 907/272-9683 </p>
    >
    > So from that I want output like :
    >
    > United Rentals Inc.,3401 Commercial
    > Dr.,Anchorage,AK,"995013024","9072724425","9072729683"
    >
    > or
    >
    > United Rentals Inc. 3401 Commercial
    > Dr. Anchorage AK 995013024 9072724425 9072729683
    >
    >
    > I have been messing around with beautiful soup
    > (http://www.crummy.com/software/BeautifulSoup/index.html) but haven't
    > gotten very far. (specially because the html is so sloppy)
    >
    > Any help would be really appreciated! Just point me in the right
    > direction, what to use, examples... Thanks!


    I'm sure others will give proper Python solution. But, here, shell is
    not a bad tool.

    lynx -dump 'http://www.rentalhq.com/store.asp?id=907%2F272%2D4425' | \
    awk '/Return to List of Rental Stores/,/To reserve an item/' | \
    sed -n -e '3p;5p;10p;11p'

    gives me

    United Rentals Inc.
    3401 Commercial Dr. Anchorage AK, 99501-3024
    Phone - 907/272-4425
    Fax - 907/272-9683

    --
    William Park <>, Toronto, Canada
    ThinFlash: Linux thin-client on USB key (flash) drive
    http://home.eol.ca/~parkw/thinflash.html
    BashDiff: Super Bash shell
    http://freshmeat.net/projects/bashdiff/
     
    William Park, Jul 17, 2005
    #2
    1. Advertising

  3. samuels

    Paul McGuire Guest

    Pyparsing includes a sample program for extracting URLs from web pages.
    You should be able to adapt it to this problem.

    Download pyparsing at http://pyparsing.sourceforge.net

    -- Paul
     
    Paul McGuire, Jul 17, 2005
    #3
  4. samuels

    samuels Guest

    Thanks for the replies, I'll post here when/if I get it finally
    working.

    So, now I know how to extract the links for the big page, and extract
    the text from the individual page. Really what I need to find out is
    how run the script on each individual page automatically, and get the
    output in comma delimited format. Thanks for solving the two problems
    though :)

    -Sam
     
    samuels, Jul 18, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Yama

    Comma Delimited

    Yama, Dec 15, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    400
    Ken Cox [Microsoft MVP]
    Dec 16, 2003
  2. Luis Esteban Valencia

    Read Comma Delimited File

    Luis Esteban Valencia, Jul 27, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    2,812
    Paul Clement
    Jul 27, 2005
  3. Edward A Thompson

    JavaMail MimeMessage - comma delimited?

    Edward A Thompson, Oct 8, 2003, in forum: Java
    Replies:
    9
    Views:
    5,629
    GaryM
    Oct 10, 2003
  4. Replies:
    8
    Views:
    1,836
    glen herrmannsfeldt
    Jun 3, 2005
  5. RyanL
    Replies:
    6
    Views:
    690
    Paul McGuire
    Aug 28, 2007
Loading...

Share This Page