How to get the DOM from a XML page

Discussion in 'Perl Misc' started by novostik@googlemail.com, Nov 27, 2006.

  1. Guest

    Hello guys,
    I want to get the DOM of an XML page.for eg:an XML
    page, being converted from HTML using Tidy,is:

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
    <html>
    <head>
    <meta name="generator" content=
    "HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
    <title></title>
    </head>
    <body>
    </body>
    </html>

    should print out html---head---meta ----title.

    I have used the following code in perL....
    -------------------------------------------------------------------------------------------------------------------------------------
    use XML::DOM;
    my $parser = new XML::DOM::parser;
    my $doc = $parser->parsefile ("ig.xml");
    my $nodes=$doc->getDocumentElement();
    print "\n";
    print $nodes->getNodeName();
    print "--";
    @x=$nodes->getChildNodes();

    &find(@x);

    sub find
    {
    my (@z)=@_;
    foreach $z(@z)
    {
    @y=$z->getChildNodes();
    if($z->getNodeType == ELEMENT_NODE)
    {

    print $z->getNodeName();
    print"--";
    }
    &find(@y);
    }
    }

    # Avoid memory leaks - cleanup circular references for garbage
    collection
    $doc->dispose;
    ---------------------------------------------------------------------------------------------------------------------------------------------


    The problem is that it gives an output for some files but gives some
    error message for other like the google and yahoo hompages.
    could you please help me out on this as I was not able to rectify
    it.Why does it work for some page and why not for others?
    Could you please provide me a solution for this....
     
    , Nov 27, 2006
    #1
    1. Advertising

  2. John Bokma Guest

    "" <> wrote:

    > The problem is that it gives an output for some files but gives some
    > error message for other like the google and yahoo hompages.
    > could you please help me out on this as I was not able to rectify
    > it.Why does it work for some page and why not for others?
    > Could you please provide me a solution for this....


    I am guessing here, but XHTML is widely used, but wrong. Most people using
    it have no clue what XHTML means, and hence use it like HTML and end up
    with documents that are not well-formed. If you want to parse stuff that's
    out on the web, use something like HTML::TreeBuilder.

    If you make your own XHTML pages, you might want to think again, twice
    even.

    --
    John Experienced Perl programmer: http://castleamber.com/

    Perl help, tutorials, and examples: http://johnbokma.com/perl/
     
    John Bokma, Nov 27, 2006
    #2
    1. Advertising

  3. On Nov 27, 11:54 am, ""
    <> wrote:
    > Hello guys,
    > I want to get the DOM of an XML page.for eg:an XML
    > page, being converted from HTML using Tidy,is:
    >
    > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
    > <html>
    > <head>
    > <meta name="generator" content=
    > "HTML Tidy for Windows (vers 14 February 2006), seewww.w3.org">
    > <title></title>
    > </head>
    > <body>
    > </body>
    > </html>


    Excuse me stating the obvious but that's not XML, it's HTML. It's tidy
    HTML but still HTML. IIRC it's possible to instruct "tidy" to emit
    xhtml (which is XML).
     
    Brian McCauley, Nov 28, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Thorsten Meininger
    Replies:
    0
    Views:
    452
    Thorsten Meininger
    Jul 28, 2004
  2. Thorsten Meininger
    Replies:
    0
    Views:
    522
    Thorsten Meininger
    Jul 28, 2004
  3. Replies:
    0
    Views:
    575
  4. Replies:
    3
    Views:
    552
    Stefan Behnel
    Aug 3, 2007
  5. Alan
    Replies:
    6
    Views:
    1,670
Loading...

Share This Page