Unstructured HTML extraction

Discussion in 'XML' started by dayzman@hotmail.com, Dec 7, 2004.

  1. Guest

    Hi,

    I'm interested in a program that extracts the structure of unstructured
    HTML documents. The program should be able to make good estimates about
    different font styles used to represent headings, for example, some may
    use <font size = 24> for headings and some may use <h1>, in the end,
    both should output the same structure. The output can be in XML or
    other formats. Manual driving should remain minimal. Does anyone know
    of such program (preferably open-source)?

    Cheers,
    Michael
    , Dec 7, 2004
    #1
    1. Advertising

  2. Nick Kew Guest

    In article <>,
    writes:
    > Hi,
    >
    > I'm interested in a program that extracts the structure of unstructured
    > HTML documents. The program should be able to make good estimates about
    > different font styles used to represent headings, for example, some may
    > use <font size = 24> for headings and some may use <h1>, in the end,


    Not quite sure what you mean, but does http://valet.webthing.com/access/
    seem relevant to you?

    > both should output the same structure. The output can be in XML or
    > other formats.


    Output is XML. What you see on the Web is transformed with XSLT to
    HTML or RDF.

    --
    Nick Kew
    Nick Kew, Dec 7, 2004
    #2
    1. Advertising

  3. Guest

    Hi,

    Sorry for not making myself clear. What I'm trying to find is a program
    that handles the inconsistent styles used on webpages to imply, e.g.,
    headings -- where some people may use large fonts, some may use the
    <h1> tag, and some may use block letters etc. The program needs to be
    able to recognise the implication of the different styles used, and
    "cleans up" the structure and outputs it in a readable format.
    I hope it is more clear this time.

    Cheers,
    Michael
    , Dec 7, 2004
    #3
  4. Guest

    Hi,

    Sorry for not making myself clear. What I'm trying to find is a program
    that handles the inconsistent styles used on webpages to imply, e.g.,
    headings -- where some people may use large fonts, some may use the
    <h1> tag, and some may use block letters etc. The program needs to be
    able to recognise the implication of the different styles used, and
    "cleans up" the structure and outputs it in a readable format.
    I hope it is more clear this time.

    Cheers,
    Michael
    , Dec 7, 2004
    #4
  5. Nick Kew Guest

    In article <>,
    writes:

    > Sorry for not making myself clear. What I'm trying to find is a program
    > that handles the inconsistent styles used on webpages to imply, e.g.,
    > headings -- where some people may use large fonts, some may use the
    > <h1> tag, and some may use block letters etc. The program needs to be


    That's a hard AI problem. Incompetent authors abuse presentational
    markup when they mean a heading, but someone might also legitimately
    present something that isn't a heading as big and bold. That's why
    tools like AccessValet highlight cases that look like a bogus
    heading (e.g. <p><big>FOO</big></p>) for a human user to decide.

    This topic has been done to death on appropriate newsgroups, as
    google would have told you. Redirecting this.

    P.S. Please fix your news client not to post everything twice.

    --
    Nick Kew
    Nick Kew, Dec 7, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Himanshu Garg
    Replies:
    0
    Views:
    610
    Himanshu Garg
    Jan 27, 2004
  2. MaggieMagill

    HTML info extraction utility

    MaggieMagill, Mar 3, 2005, in forum: HTML
    Replies:
    5
    Views:
    351
    Andy Dingley
    Mar 4, 2005
  3. hakhan
    Replies:
    0
    Views:
    412
    hakhan
    Oct 19, 2004
  4. Replies:
    0
    Views:
    341
  5. A. Novruzi
    Replies:
    2
    Views:
    480
    Robert Ferrell
    Jan 15, 2004
Loading...

Share This Page