Unstructured HTML extraction

Discussion in 'XML' started by dayzman@hotmail.com, Dec 7, 2004.

  1. Guest

    Hi,

    I'm interested in a program that extracts the structure of unstructured
    HTML documents. The program should be able to make good estimates about
    different font styles used to represent headings, for example, some may
    use <font size = 24> for headings and some may use <h1>, in the end,
    both should output the same structure. The output can be in XML or
    other formats. Manual driving should remain minimal. Does anyone know
    of such program (preferably open-source)?

    Cheers,
    Michael
     
    , Dec 7, 2004
    #1
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Himanshu Garg
    Replies:
    0
    Views:
    732
    Himanshu Garg
    Jan 27, 2004
  2. MaggieMagill

    HTML info extraction utility

    MaggieMagill, Mar 3, 2005, in forum: HTML
    Replies:
    5
    Views:
    456
    Andy Dingley
    Mar 4, 2005
  3. Replies:
    0
    Views:
    695
  4. hakhan
    Replies:
    0
    Views:
    508
    hakhan
    Oct 19, 2004
  5. Replies:
    4
    Views:
    527
    Nick Kew
    Dec 7, 2004
  6. Dave Kuhlman

    HTML data extraction?

    Dave Kuhlman, Dec 22, 2003, in forum: Python
    Replies:
    2
    Views:
    455
    John J. Lee
    Dec 22, 2003
  7. A. Novruzi
    Replies:
    2
    Views:
    591
    Robert Ferrell
    Jan 15, 2004
  8. Replies:
    1
    Views:
    326
    Fredrik Lundh
    Dec 8, 2004
Loading...