Converting HTML to ASCII

Discussion in 'Python' started by gf gf, Feb 25, 2005.

  1. gf gf

    gf gf Guest

    Hans,

    Thanks for the tip. I took a look at Beatiful Soup,
    and it looked like it was a framework to parse HTML.
    I'm not really interetsed in going through it tag by
    tag - just to get it converted to ASCII. How can I do
    this with B. Soup?

    --Thanks

    PS William - thanks for the reference to lynx, but I
    need a Python solution - forking and execing for each
    file I need to convert is too slow for my application


    Hans wrote:
    Try Beautiful Soup!

    > 1) Be able to handle badly formed, or illegal, HTML,
    > as best as possible.

    From the description:
    "It won't choke if you give it ill-formed markup:
    it'll just give you access to
    a correspondingly ill-formed data structure."

    > Can anyone direct me to something which could help

    me
    > for this?

    http://www.crummy.com/software/BeautifulSoup/

    Hans Christian



    __________________________________
    Do you Yahoo!?
    Yahoo! Mail - Easier than ever with enhanced search. Learn more.
    http://info.mail.yahoo.com/mail_250
     
    gf gf, Feb 25, 2005
    #1
    1. Advertising

  2. gf gf

    Jorgen Grahn Guest

    On Fri, 25 Feb 2005 10:51:47 -0800 (PST), gf gf <> wrote:
    > Hans,
    >
    > Thanks for the tip. I took a look at Beatiful Soup,
    > and it looked like it was a framework to parse HTML.


    This is my understanding, too.

    > I'm not really interetsed in going through it tag by
    > tag - just to get it converted to ASCII. How can I do
    > this with B. Soup?


    You should probably do what some other poster suggested -- download lynx or
    some other text-only browser and make your code execute it in -dump mode to
    get the text-formatted html. You'll get that working in an hour or so, and
    then you can see if you need something more complicated.

    /Jorgen

    --
    // Jorgen Grahn <jgrahn@ Ph'nglui mglw'nafh Cthulhu
    \X/ algonet.se> R'lyeh wgah'nagl fhtagn!
     
    Jorgen Grahn, Feb 25, 2005
    #2
    1. Advertising

  3. gf gf

    Paul Rubin Guest

    Jorgen Grahn <> writes:
    > You should probably do what some other poster suggested -- download
    > lynx or some other text-only browser and make your code execute it
    > in -dump mode to get the text-formatted html. You'll get that
    > working in an hour or so, and then you can see if you need something
    > more complicated.


    Lynx is pathetically slow for large files. It seems to use a
    quadratic algorithm for remembering where the links point, or
    something. I wrote a very crude but very fast renderer in C that I
    can post if someone wants it, which is what I use for this purpose.
     
    Paul Rubin, Feb 26, 2005
    #3
  4. gf gf

    Jorgen Grahn Guest

    On 26 Feb 2005 02:36:31 -0800, Paul Rubin <> wrote:
    > Jorgen Grahn <> writes:
    >> You should probably do what some other poster suggested -- download
    >> lynx or some other text-only browser and make your code execute it
    >> in -dump mode to get the text-formatted html. You'll get that
    >> working in an hour or so, and then you can see if you need something
    >> more complicated.

    >
    > Lynx is pathetically slow for large files. It seems to use a
    > quadratic algorithm for remembering where the links point, or
    > something. I wrote a very crude but very fast renderer in C that I
    > can post if someone wants it, which is what I use for this purpose.


    That may be so, but it's fast enough for all the people who use it as a
    general html->plaintext tool, so it's probably good enough for the OP.

    w3m and links are other options. They provide better formatting than lynx,
    and at least w3m has the -dump option.

    I wouldn't mind if there was a reusable library for rendering HTML to text,
    from various languages. I'd also like to see one (CSS-aware) for rendering
    to troff or Postscript.

    /Jorgen

    --
    // Jorgen Grahn <jgrahn@ Ph'nglui mglw'nafh Cthulhu
    \X/ algonet.se> R'lyeh wgah'nagl fhtagn!
     
    Jorgen Grahn, Feb 27, 2005
    #4
  5. On 2005-02-26, Paul Rubin <http> wrote:
    > Jorgen Grahn <> writes:
    >> You should probably do what some other poster suggested -- download
    >> lynx or some other text-only browser and make your code execute it
    >> in -dump mode to get the text-formatted html. You'll get that
    >> working in an hour or so, and then you can see if you need something
    >> more complicated.

    >
    > Lynx is pathetically slow for large files.


    First, make it work. Then make it work right. Then worry
    about how fast it is.

    "Premature optimization..."

    > It seems to use a quadratic algorithm for remembering where
    > the links point, or something. I wrote a very crude but very
    > fast renderer in C that I can post if someone wants it, which
    > is what I use for this purpose.


    If lynx really is too slow, try w3m or links. Both do a better
    job of rendering anyway.

    --
    Grant Edwards grante Yow! I know how to do
    at SPECIAL EFFECTS!!
    visi.com
     
    Grant Edwards, Feb 27, 2005
    #5
  6. Grant Edwards <> wrote:
    > First, make it work. Then make it work right. Then worry
    > about how fast it is.


    > "Premature optimization..."


    That could be - but then again, most of the comments I've seen for that
    particular issue are for rather old releases.

    >> It seems to use a quadratic algorithm for remembering where
    >> the links point, or something. I wrote a very crude but very
    >> fast renderer in C that I can post if someone wants it, which
    >> is what I use for this purpose.


    > If lynx really is too slow, try w3m or links. Both do a better
    > job of rendering anyway.


    They lay out tables more/less as expected (though navigation in tables
    for links seems to be an afterthought).

    --
    Thomas E. Dickey
    http://invisible-island.net
    ftp://invisible-island.net
     
    Thomas Dickey, Feb 27, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. gf gf

    Converting HTML to ASCII

    gf gf, Feb 25, 2005, in forum: Python
    Replies:
    3
    Views:
    346
    Kent Johnson
    Feb 26, 2005
  2. Michael Spencer

    Re: Converting HTML to ASCII

    Michael Spencer, Feb 25, 2005, in forum: Python
    Replies:
    3
    Views:
    362
    Mike Meyer
    Feb 27, 2005
  3. TOXiC
    Replies:
    5
    Views:
    1,259
    TOXiC
    Jan 31, 2007
  4. James O'Brien
    Replies:
    3
    Views:
    255
    Ben Morrow
    Mar 5, 2004
  5. Alextophi
    Replies:
    8
    Views:
    518
    Alan J. Flavell
    Dec 30, 2005
Loading...

Share This Page