Unstructured HTML extraction

Discussion in 'XML' started by dayzman@hotmail.com, Dec 7, 2004.

  1. Guest

    Hi,

    I'm interested in a program that extracts the structure of unstructured
    HTML documents. The program should be able to make good estimates about
    different font styles used to represent headings, for example, some may
    use <font size = 24> for headings and some may use <h1>, in the end,
    both should output the same structure. The output can be in XML or
    other formats. Manual driving should remain minimal. Does anyone know
    of such program (preferably open-source)?

    Cheers,
    Michael
     
    , Dec 7, 2004
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Himanshu Garg
    Replies:
    0
    Views:
    625
    Himanshu Garg
    Jan 27, 2004
  2. MaggieMagill

    HTML info extraction utility

    MaggieMagill, Mar 3, 2005, in forum: HTML
    Replies:
    5
    Views:
    367
    Andy Dingley
    Mar 4, 2005
  3. hakhan
    Replies:
    0
    Views:
    424
    hakhan
    Oct 19, 2004
  4. Replies:
    4
    Views:
    446
    Nick Kew
    Dec 7, 2004
  5. A. Novruzi
    Replies:
    2
    Views:
    495
    Robert Ferrell
    Jan 15, 2004
Loading...

Share This Page