Converting HTML to XHTML (JTidy,OpenXML,Xerces)

Discussion in 'Java' started by anupamjain@gmail.com, Mar 23, 2006.

  1. Guest

    Hi,

    After 2 weeks of search/hit-and-trial I finally thought to revert to
    the group to find solution to my problem.(something I should have done
    much earlier)

    This is the deal :

    On a JSP page, I want to grab a URL and parse /change the HTML and send
    it to the JSP page. I take the URL from the user in a textbox (not the
    browser location box).

    In the Java class file (that I have imported in JSP), I tried to use
    Xerces parser earlier till I realised it only supports well-formed XML.

    So I switched to OpenXML which supports HTML (but it took like 10
    minutes to parse it and after that also it gave me the Out of Memory
    Exception - even when I increased the buffer size of Tomcat to a good
    amount and when I was parsing a page as simple as www.google.com)
    But if I dont use the DOCUMENT_HTML option in OpenXML and just treat
    the HTML as normal XML file, it does parse it properly(maybe it skips
    the non terminated tags) but there's no way to return the XML back to
    the browser because doc.getDocumentElement().toString() returns '
    HTML:
    1 nodes'
    
    So then I switched to Jtidy and tried to convert HTML to XHTML. But it
    seems the Document type returned by JTidy doesnt support most standard
    document methods (including converting XML to string using
    doc.getDocumentElement().toString()) leaving me at the same place where
    I started from.
    
    Can anybody suggest me what can be a good idea to approach my problem.
    All that I want to do is grab a URL's HTML, add some tags to it (a
    couple of appendChild()s) and then send the  HTML back to the user to
    be displayed(intrepreted) on the browser.
    
    I'll be really thankful for your help!
    Anupam
     
    , Mar 23, 2006
    #1
    1. Advertising

  2. wrote:
    > Hi,
    >
    > After 2 weeks of search/hit-and-trial I finally thought to revert to
    > the group to find solution to my problem.(something I should have done
    > much earlier)
    >
    > This is the deal :
    >
    > On a JSP page, I want to grab a URL and parse /change the HTML and send
    > it to the JSP page. I take the URL from the user in a textbox (not the
    > browser location box).
    >
    > In the Java class file (that I have imported in JSP), I tried to use
    > Xerces parser earlier till I realised it only supports well-formed XML.
    >
    > So I switched to OpenXML which supports HTML (but it took like 10
    > minutes to parse it and after that also it gave me the Out of Memory
    > Exception - even when I increased the buffer size of Tomcat to a good
    > amount and when I was parsing a page as simple as www.google.com)
    > But if I dont use the DOCUMENT_HTML option in OpenXML and just treat
    > the HTML as normal XML file, it does parse it properly(maybe it skips
    > the non terminated tags) but there's no way to return the XML back to
    > the browser because doc.getDocumentElement().toString() returns '
    HTML:
    > 1 nodes'
    >
    > So then I switched to Jtidy and tried to convert HTML to XHTML. But it
    > seems the Document type returned by JTidy doesnt support most standard
    > document methods (including converting XML to string using
    > doc.getDocumentElement().toString()) leaving me at the same place where
    > I started from.
    >
    > Can anybody suggest me what can be a good idea to approach my problem.
    > All that I want to do is grab a URL's HTML, add some tags to it (a
    > couple of appendChild()s) and then send the  HTML back to the user to
    > be displayed(intrepreted) on the browser.
    >
    > I'll be really thankful for your help!
    > Anupam
    >[/color]
    
    hi,
    
    I did exactly the same thing with NekoHTML : parsing the HTML to XML,
    then selecting some nodes with XPath, appending/replacing some nodes,
    and transforming or serializing it back to HTML
    http://people.apache.org/~andyc/neko/doc/html/index.html
    (a nice tool)
    
    --------------------------------------------
    
    Did you think on a full XML solution ?
    
    With Active Tags I used some tags/actions to achieve this. For this
    purpose you could use RefleX at the top of Tomcat :
    http://reflex.gforge.inria.fr/
    (a nice tool too)
    RefleX comes with a servlet that can run Active Tags
    
    Your code would then look like this :
    <web:service
    xmlns:web="http://www.inria.fr/xml/active-tags/web"
    xmlns:io="http://www.inria.fr/xml/active-tags/io"
    xmlns:xcl="http://www.inria.fr/xml/active-tags/xcl"
    xmlns:xhtml="http://www.w3.org/1999/xhtml"[color=blue]
    >[/color]
    <!--understand it as a HTTP service-->
    
    <!--things that are performed when the server starts-->
    <web:init>
    <!--share a stylesheet with all HTTP requests-->
    <xcl:parse-stylesheet name="ralyx.xsl"
    source="web:///WEB-INF/xslt/ralyx.xsl" scope="shared"/>
    </web:init>
    
    <!--map the URL-path with a regexp-->
    <web:mapping
    match="^/(\d{4})/Fiches/([\p{Lower}\d\-_+]+)/\2\.(html|xml)$"
    method="GET" mime-type="">
    <!--use an HTML parser because the documents are not
    well-formed ; <xcl:parse-html> uses NekoHTML-->
    <xcl:parse-html name="fiche"
    source="http://www.inria.fr/recherche/equipes/{ $web:match/node()[ 2 ]
    }.en.html"/>
    <xcl:set name="corps" value="{
    $fiche//xhtml:DIV[@class='corps'] }"/>
    <xcl:set name="about" value="{ $corps/xhtml:TABLE//xhtml:TD[2] }"/>
    <xcl:replace referent="{ $about }">
    <td width="200px" align="right" class="projet">
    <div class="menu_box">{ $about/node() }</div>
    </td>
    </xcl:replace>
    
    <!--rebuild a new document-->
    <xcl:document name="projet">
    <projet xml="xml" title="{ string(
    $corps/preceding-sibling::xhtml:H1 ) }">
    { $corps }
    </projet>
    </xcl:document>
    
    <!--relativizing URLs in <A href> and <IMG src>-->
    <xcl:for-each name="link" select="{ $projet//xhtml:A[@href] }">
    <xcl:attribute referent="{ $link }" name="href" value="{
    io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
    $link/@href ) ) }"/>
    </xcl:for-each>
    <xcl:for-each name="link" select="{ $projet//xhtml:IMG[@src] }">
    <xcl:attribute referent="{ $link }" name="src" value="{
    io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
    $link/@src ) ) }"/>
    </xcl:for-each>
    
    <!--selecting the stylesheet-->
    <xcl:set xcl:if="{ $web:match/node()[ 3 ] != 'xml' }"
    name="xslt" value="{ $ralyx.xsl }"/>
    <!--back to the browser-->
    <xcl:transform
    output="{ value( $web:response/@web:output ) }"
    source="{ $projet }"
    stylesheet="{ $xslt }"
    />
    </web:mapping>
    </web:service>
    
    the result is a new HTML document that contains an updated-part of
    another HTML document (this mapping act almost like a proxy) ; it is
    used in a real-application deployed at INRIA
    
    to use it, simply declares the ReflexServlet in Tomcat :
    <web-app>
    <display-name>RefleX application</display-name>
    <description>My RefleX application</description>
    <servlet>
    <servlet-name>ReflexServlet</servlet-name>
    <display-name>RefleX servlet</display-name>
    <description>Runs an Active Sheet</description>
    <servlet-class>org.inria.reflex.ReflexServlet</servlet-class>
    <init-param>
    <param-name>activeSheetPath</param-name>
    <param-value>web:///WEB-INF/active-sheet.xml</param-value>
    </init-param>
    <load-on-startup>1</load-on-startup>
    </servlet>
    <servlet-mapping><!--custom mappings-->
    <url-pattern>*.gif</url-pattern>
    <servlet-name>default</servlet-name>
    </servlet-mapping>
    <servlet-mapping><!--RefleX mapping-->
    <servlet-name>ReflexServlet</servlet-name>
    <url-pattern>/</url-pattern>
    </servlet-mapping>
    </web-app>
    
    when downloading RefleX, check the dependencies and ensure that NekoHTML
    0.9.5 is in the full distribution : for the moment, the last version of
    RefleX available (0.1.2) uses NekoHTML 0.9.4 that is bugged regarding
    namespace URIs ; this issue is fixed in NekoHTML 0.9.5 which is
    available online and that will be in RefleX 0.1.3 (coming soon) ;
    
    Enjoy :)
    
    --
    Cordialement,
    
    ///
    (. .)
    --------ooO--(_)--Ooo--------
    |      Philippe Poulard       |
    -----------------------------
    http://reflex.gforge.inria.fr/
    Have the RefleX !
     
    Philippe Poulard, Mar 23, 2006
    #2
    1. Advertising

  3. >> seems the Document type returned by JTidy doesnt support most standard
    >> document methods (including converting XML to string using
    >> doc.getDocumentElement().toString()) leaving me at the same place where
    >> I started from.


    Please note that using .toString() to get the XML is *NOT* part of the
    W3C DOM spec; it's a feature of one specific DOM implementation.

    See DOM Level 3's serialization API, or see the documentation that comes
    with a particular DOM for information about what serialization tools it
    provides/recommends.


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
     
    Joe Kesselman, Mar 23, 2006
    #3
  4. BTW, the W3C's Tidy tool can output serialized XHTML/XML directly rather
    than as a DOM; is there a reason you're reinventing that wheel?

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
     
    Joe Kesselman, Mar 23, 2006
    #4
  5. Guest

    Joe Kesselman wrote:
    > BTW, the W3C's Tidy tool can output serialized XHTML/XML directly rather
    > than as a DOM; is there a reason you're reinventing that wheel?
    >
    > --
    > () ASCII Ribbon Campaign | Joe Kesselman
    > /\ Stamp out HTML e-mail! | System architexture and kinetic poetry


    Because I want to 'edit' the XHTML returned, by adding a couple of tags
    here and there (using DOM methods like appendChild() )and after I get
    my desired DOM structure I want to return it as a string.

    - Anupam
     
    , Mar 23, 2006
    #5
  6. wrote:
    > Because I want to 'edit' the XHTML returned, by adding a couple of tags
    > here and there (using DOM methods like appendChild() )and after I get
    > my desired DOM structure I want to return it as a string.


    OK; in that case you either want the DOM Level 3 serializer methods (if
    they're supported) or an off-the-shelf DOM serializer... or, possibly,
    to write your editing operations as a stylesheet, pass the DOM to an
    XSLT processor, and let _its_ serializer deal with the problem.

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
     
    Joseph Kesselman, Mar 23, 2006
    #6
  7. Guest

    Philippe Poulard wrote:
    > wrote:
    > > Hi,
    > >
    > > After 2 weeks of search/hit-and-trial I finally thought to revert to
    > > the group to find solution to my problem.(something I should have done
    > > much earlier)
    > >
    > > This is the deal :
    > >
    > > On a JSP page, I want to grab a URL and parse /change the HTML and send
    > > it to the JSP page. I take the URL from the user in a textbox (not the
    > > browser location box).
    > >
    > > In the Java class file (that I have imported in JSP), I tried to use
    > > Xerces parser earlier till I realised it only supports well-formed XML.
    > >
    > > So I switched to OpenXML which supports HTML (but it took like 10
    > > minutes to parse it and after that also it gave me the Out of Memory
    > > Exception - even when I increased the buffer size of Tomcat to a good
    > > amount and when I was parsing a page as simple as www.google.com)
    > > But if I dont use the DOCUMENT_HTML option in OpenXML and just treat
    > > the HTML as normal XML file, it does parse it properly(maybe it skips
    > > the non terminated tags) but there's no way to return the XML back to
    > > the browser because doc.getDocumentElement().toString() returns '
    HTML:
    > > 1 nodes'
    > >
    > > So then I switched to Jtidy and tried to convert HTML to XHTML. But it
    > > seems the Document type returned by JTidy doesnt support most standard
    > > document methods (including converting XML to string using
    > > doc.getDocumentElement().toString()) leaving me at the same place where
    > > I started from.
    > >
    > > Can anybody suggest me what can be a good idea to approach my problem.
    > > All that I want to do is grab a URL's HTML, add some tags to it (a
    > > couple of appendChild()s) and then send the  HTML back to the user to
    > > be displayed(intrepreted) on the browser.
    > >
    > > I'll be really thankful for your help!
    > > Anupam
    > >[/color]
    >
    > hi,
    >
    > I did exactly the same thing with NekoHTML : parsing the HTML to XML,
    > then selecting some nodes with XPath, appending/replacing some nodes,
    > and transforming or serializing it back to HTML
    > http://people.apache.org/~andyc/neko/doc/html/index.html
    > (a nice tool)
    >
    > --------------------------------------------
    >
    > Did you think on a full XML solution ?
    >
    > With Active Tags I used some tags/actions to achieve this. For this
    > purpose you could use RefleX at the top of Tomcat :
    > http://reflex.gforge.inria.fr/
    > (a nice tool too)
    > RefleX comes with a servlet that can run Active Tags
    >
    > Your code would then look like this :
    > <web:service
    >      xmlns:web="http://www.inria.fr/xml/active-tags/web"
    >      xmlns:io="http://www.inria.fr/xml/active-tags/io"
    >      xmlns:xcl="http://www.inria.fr/xml/active-tags/xcl"
    >      xmlns:xhtml="http://www.w3.org/1999/xhtml"[color=green]
    >  >[/color]
    > <!--understand it as a HTTP service-->
    >
    >      <!--things that are performed when the server starts-->
    >      <web:init>
    >          <!--share a stylesheet with all HTTP requests-->
    >          <xcl:parse-stylesheet name="ralyx.xsl"
    > source="web:///WEB-INF/xslt/ralyx.xsl" scope="shared"/>
    >      </web:init>
    >
    >      <!--map the URL-path with a regexp-->
    >      <web:mapping
    > match="^/(\d{4})/Fiches/([\p{Lower}\d\-_+]+)/\2\.(html|xml)$"
    > method="GET" mime-type="">
    >          <!--use an HTML parser because the documents are not
    > well-formed ; <xcl:parse-html> uses NekoHTML-->
    >          <xcl:parse-html name="fiche"
    > source="http://www.inria.fr/recherche/equipes/{ $web:match/node()[ 2 ]
    > }.en.html"/>
    >          <xcl:set name="corps" value="{
    > $fiche//xhtml:DIV[@class='corps'] }"/>
    >          <xcl:set name="about" value="{ $corps/xhtml:TABLE//xhtml:TD[2] }"/>
    >          <xcl:replace referent="{ $about }">
    >              <td width="200px" align="right" class="projet">
    >                  <div class="menu_box">{ $about/node() }</div>
    >              </td>
    >          </xcl:replace>
    >
    >          <!--rebuild a new document-->
    >          <xcl:document name="projet">
    >              <projet xml="xml" title="{ string(
    > $corps/preceding-sibling::xhtml:H1 ) }">
    >                  { $corps }
    >              </projet>
    >          </xcl:document>
    >
    >          <!--relativizing URLs in <A href> and <IMG src>-->
    >          <xcl:for-each name="link" select="{ $projet//xhtml:A[@href] }">
    >              <xcl:attribute referent="{ $link }" name="href" value="{
    > io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
    > $link/@href ) ) }"/>
    >          </xcl:for-each>
    >          <xcl:for-each name="link" select="{ $projet//xhtml:IMG[@src] }">
    >              <xcl:attribute referent="{ $link }" name="src" value="{
    > io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
    > $link/@src ) ) }"/>
    >          </xcl:for-each>
    >
    >          <!--selecting the stylesheet-->
    >          <xcl:set xcl:if="{ $web:match/node()[ 3 ] != 'xml' }"
    > name="xslt" value="{ $ralyx.xsl }"/>
    >          <!--back to the browser-->
    >          <xcl:transform
    >              output="{ value( $web:response/@web:output ) }"
    >              source="{ $projet }"
    >              stylesheet="{ $xslt }"
    >          />
    >      </web:mapping>
    > </web:service>
    >
    > the result is a new HTML document that contains an updated-part of
    > another HTML document (this mapping act almost like a proxy) ; it is
    > used in a real-application deployed at INRIA
    >
    > to use it, simply declares the ReflexServlet in Tomcat :
    > <web-app>
    >    <display-name>RefleX application</display-name>
    >    <description>My RefleX application</description>
    >    <servlet>
    >      <servlet-name>ReflexServlet</servlet-name>
    >      <display-name>RefleX servlet</display-name>
    >      <description>Runs an Active Sheet</description>
    >      <servlet-class>org.inria.reflex.ReflexServlet</servlet-class>
    >      <init-param>
    >        <param-name>activeSheetPath</param-name>
    >        <param-value>web:///WEB-INF/active-sheet.xml</param-value>
    >      </init-param>
    >      <load-on-startup>1</load-on-startup>
    >    </servlet>
    >    <servlet-mapping><!--custom mappings-->
    >      <url-pattern>*.gif</url-pattern>
    >      <servlet-name>default</servlet-name>
    >    </servlet-mapping>
    >    <servlet-mapping><!--RefleX mapping-->
    >      <servlet-name>ReflexServlet</servlet-name>
    >      <url-pattern>/</url-pattern>
    >    </servlet-mapping>
    > </web-app>
    >
    > when downloading RefleX, check the dependencies and ensure that NekoHTML
    > 0.9.5 is in the full distribution : for the moment, the last version of
    > RefleX available (0.1.2) uses NekoHTML 0.9.4 that is bugged regarding
    > namespace URIs ; this issue is fixed in NekoHTML 0.9.5 which is
    > available online and that will be in RefleX 0.1.3 (coming soon) ;
    >
    > Enjoy :)
    >
    > --
    > Cordialement,
    >
    >                ///
    >               (. .)
    >   --------ooO--(_)--Ooo--------
    > |      Philippe Poulard       |
    >   -----------------------------
    >   http://reflex.gforge.inria.fr/
    >         Have the RefleX ![/color]
    
    
    Thanks so much Philippe. I'll try it and get back
    
    Thanks again,
    Anupam
     
    , Mar 24, 2006
    #7
  8. Guest

    I am not able to build nekohtml properly. After installing everything
    it required and moving all the jar files to it's lib folder, it gives
    me this error when i try to build it :


    >build -f build-html.xml


    Buildfile: build-html.xml

    version-init:
    [mkdir] Created dir: C:\Documents and Settings\Anupam
    Jain\Desktop\nekohtml-0.9.5\bin\html\src\org\cyberneko\html

    version:
    [echo] Generating bin/html/src/org/cyberneko/html/Version.java
    [echo] Generating bin/html/src/MANIFEST_html

    compile:
    [javac] Compiling 26 source files to C:\Documents and
    Settings\Anupam Jain\Desktop\nekohtml-0.9.5\bin\html
    [javac] C:\Documents and Settings\Anupam
    Jain\Desktop\nekohtml-0.9.5\src\html\org\cyberneko\html\HTMLScanner.java:89:
    org.cyberneko.html.HTM
    LScanner is not abstract and does not override abstract method
    getXMLVersion() in org.apache.xerces.xni.XMLLocator
    [javac] public class HTMLScanner
    [javac] ^
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 1 error

    BUILD FAILED
    C:\Documents and Settings\Anupam
    Jain\Desktop\nekohtml-0.9.5\build-html.xml:51: Compile failed; see the
    compiler error output for details.

    Total time: 16 seconds


    So basically the error is : org.cyberneko.html.HTMLScanner is not
    abstract and does not override abstract method getXMLVersion() in
    org.apache.xerces.xni.XMLLocator

    - Anupam





    Philippe Poulard wrote:
    > wrote:
    > > Hi,
    > >
    > > After 2 weeks of search/hit-and-trial I finally thought to revert to
    > > the group to find solution to my problem.(something I should have done
    > > much earlier)
    > >
    > > This is the deal :
    > >
    > > On a JSP page, I want to grab a URL and parse /change the HTML and send
    > > it to the JSP page. I take the URL from the user in a textbox (not the
    > > browser location box).
    > >
    > > In the Java class file (that I have imported in JSP), I tried to use
    > > Xerces parser earlier till I realised it only supports well-formed XML.
    > >
    > > So I switched to OpenXML which supports HTML (but it took like 10
    > > minutes to parse it and after that also it gave me the Out of Memory
    > > Exception - even when I increased the buffer size of Tomcat to a good
    > > amount and when I was parsing a page as simple as www.google.com)
    > > But if I dont use the DOCUMENT_HTML option in OpenXML and just treat
    > > the HTML as normal XML file, it does parse it properly(maybe it skips
    > > the non terminated tags) but there's no way to return the XML back to
    > > the browser because doc.getDocumentElement().toString() returns '
    HTML:
    > > 1 nodes'
    > >
    > > So then I switched to Jtidy and tried to convert HTML to XHTML. But it
    > > seems the Document type returned by JTidy doesnt support most standard
    > > document methods (including converting XML to string using
    > > doc.getDocumentElement().toString()) leaving me at the same place where
    > > I started from.
    > >
    > > Can anybody suggest me what can be a good idea to approach my problem.
    > > All that I want to do is grab a URL's HTML, add some tags to it (a
    > > couple of appendChild()s) and then send the  HTML back to the user to
    > > be displayed(intrepreted) on the browser.
    > >
    > > I'll be really thankful for your help!
    > > Anupam
    > >[/color]
    >
    > hi,
    >
    > I did exactly the same thing with NekoHTML : parsing the HTML to XML,
    > then selecting some nodes with XPath, appending/replacing some nodes,
    > and transforming or serializing it back to HTML
    > http://people.apache.org/~andyc/neko/doc/html/index.html
    > (a nice tool)
    >
    > --------------------------------------------
    >
    > Did you think on a full XML solution ?
    >
    > With Active Tags I used some tags/actions to achieve this. For this
    > purpose you could use RefleX at the top of Tomcat :
    > http://reflex.gforge.inria.fr/
    > (a nice tool too)
    > RefleX comes with a servlet that can run Active Tags
    >
    > Your code would then look like this :
    > <web:service
    >      xmlns:web="http://www.inria.fr/xml/active-tags/web"
    >      xmlns:io="http://www.inria.fr/xml/active-tags/io"
    >      xmlns:xcl="http://www.inria.fr/xml/active-tags/xcl"
    >      xmlns:xhtml="http://www.w3.org/1999/xhtml"[color=green]
    >  >[/color]
    > <!--understand it as a HTTP service-->
    >
    >      <!--things that are performed when the server starts-->
    >      <web:init>
    >          <!--share a stylesheet with all HTTP requests-->
    >          <xcl:parse-stylesheet name="ralyx.xsl"
    > source="web:///WEB-INF/xslt/ralyx.xsl" scope="shared"/>
    >      </web:init>
    >
    >      <!--map the URL-path with a regexp-->
    >      <web:mapping
    > match="^/(\d{4})/Fiches/([\p{Lower}\d\-_+]+)/\2\.(html|xml)$"
    > method="GET" mime-type="">
    >          <!--use an HTML parser because the documents are not
    > well-formed ; <xcl:parse-html> uses NekoHTML-->
    >          <xcl:parse-html name="fiche"
    > source="http://www.inria.fr/recherche/equipes/{ $web:match/node()[ 2 ]
    > }.en.html"/>
    >          <xcl:set name="corps" value="{
    > $fiche//xhtml:DIV[@class='corps'] }"/>
    >          <xcl:set name="about" value="{ $corps/xhtml:TABLE//xhtml:TD[2] }"/>
    >          <xcl:replace referent="{ $about }">
    >              <td width="200px" align="right" class="projet">
    >                  <div class="menu_box">{ $about/node() }</div>
    >              </td>
    >          </xcl:replace>
    >
    >          <!--rebuild a new document-->
    >          <xcl:document name="projet">
    >              <projet xml="xml" title="{ string(
    > $corps/preceding-sibling::xhtml:H1 ) }">
    >                  { $corps }
    >              </projet>
    >          </xcl:document>
    >
    >          <!--relativizing URLs in <A href> and <IMG src>-->
    >          <xcl:for-each name="link" select="{ $projet//xhtml:A[@href] }">
    >              <xcl:attribute referent="{ $link }" name="href" value="{
    > io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
    > $link/@href ) ) }"/>
    >          </xcl:for-each>
    >          <xcl:for-each name="link" select="{ $projet//xhtml:IMG[@src] }">
    >              <xcl:attribute referent="{ $link }" name="src" value="{
    > io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
    > $link/@src ) ) }"/>
    >          </xcl:for-each>
    >
    >          <!--selecting the stylesheet-->
    >          <xcl:set xcl:if="{ $web:match/node()[ 3 ] != 'xml' }"
    > name="xslt" value="{ $ralyx.xsl }"/>
    >          <!--back to the browser-->
    >          <xcl:transform
    >              output="{ value( $web:response/@web:output ) }"
    >              source="{ $projet }"
    >              stylesheet="{ $xslt }"
    >          />
    >      </web:mapping>
    > </web:service>
    >
    > the result is a new HTML document that contains an updated-part of
    > another HTML document (this mapping act almost like a proxy) ; it is
    > used in a real-application deployed at INRIA
    >
    > to use it, simply declares the ReflexServlet in Tomcat :
    > <web-app>
    >    <display-name>RefleX application</display-name>
    >    <description>My RefleX application</description>
    >    <servlet>
    >      <servlet-name>ReflexServlet</servlet-name>
    >      <display-name>RefleX servlet</display-name>
    >      <description>Runs an Active Sheet</description>
    >      <servlet-class>org.inria.reflex.ReflexServlet</servlet-class>
    >      <init-param>
    >        <param-name>activeSheetPath</param-name>
    >        <param-value>web:///WEB-INF/active-sheet.xml</param-value>
    >      </init-param>
    >      <load-on-startup>1</load-on-startup>
    >    </servlet>
    >    <servlet-mapping><!--custom mappings-->
    >      <url-pattern>*.gif</url-pattern>
    >      <servlet-name>default</servlet-name>
    >    </servlet-mapping>
    >    <servlet-mapping><!--RefleX mapping-->
    >      <servlet-name>ReflexServlet</servlet-name>
    >      <url-pattern>/</url-pattern>
    >    </servlet-mapping>
    > </web-app>
    >
    > when downloading RefleX, check the dependencies and ensure that NekoHTML
    > 0.9.5 is in the full distribution : for the moment, the last version of
    > RefleX available (0.1.2) uses NekoHTML 0.9.4 that is bugged regarding
    > namespace URIs ; this issue is fixed in NekoHTML 0.9.5 which is
    > available online and that will be in RefleX 0.1.3 (coming soon) ;
    >
    > Enjoy :)
    >
    > --
    > Cordialement,
    >
    >                ///
    >               (. .)
    >   --------ooO--(_)--Ooo--------
    > |      Philippe Poulard       |
    >   -----------------------------
    >   http://reflex.gforge.inria.fr/
    >         Have the RefleX ![/color]
     
    , Mar 24, 2006
    #8
  9. wrote:
    > I am not able to build nekohtml properly. After installing everything
    > it required and moving all the jar files to it's lib folder, it gives
    > me this error when i try to build it :


    Hi,

    I'm sure you'll get some help by contacting the developper

    However, you could also try to use directly the .jar available in the
    package

    >
    >
    >
    >>build -f build-html.xml

    >
    >
    > Buildfile: build-html.xml
    >
    > version-init:
    > [mkdir] Created dir: C:\Documents and Settings\Anupam
    > Jain\Desktop\nekohtml-0.9.5\bin\html\src\org\cyberneko\html
    >
    > version:
    > [echo] Generating bin/html/src/org/cyberneko/html/Version.java
    > [echo] Generating bin/html/src/MANIFEST_html
    >
    > compile:
    > [javac] Compiling 26 source files to C:\Documents and
    > Settings\Anupam Jain\Desktop\nekohtml-0.9.5\bin\html
    > [javac] C:\Documents and Settings\Anupam
    > Jain\Desktop\nekohtml-0.9.5\src\html\org\cyberneko\html\HTMLScanner.java:89:
    > org.cyberneko.html.HTM
    > LScanner is not abstract and does not override abstract method
    > getXMLVersion() in org.apache.xerces.xni.XMLLocator
    > [javac] public class HTMLScanner
    > [javac] ^
    > [javac] Note: Some input files use or override a deprecated API.
    > [javac] Note: Recompile with -Xlint:deprecation for details.
    > [javac] Note: Some input files use unchecked or unsafe operations.
    > [javac] Note: Recompile with -Xlint:unchecked for details.
    > [javac] 1 error
    >
    > BUILD FAILED
    > C:\Documents and Settings\Anupam
    > Jain\Desktop\nekohtml-0.9.5\build-html.xml:51: Compile failed; see the
    > compiler error output for details.
    >
    > Total time: 16 seconds
    >
    >
    > So basically the error is : org.cyberneko.html.HTMLScanner is not
    > abstract and does not override abstract method getXMLVersion() in
    > org.apache.xerces.xni.XMLLocator
    >
    > - Anupam
    >
    >
    >
    >
    >
    > Philippe Poulard wrote:
    >
    >> wrote:
    >>
    >>>Hi,
    >>>
    >>>After 2 weeks of search/hit-and-trial I finally thought to revert to
    >>>the group to find solution to my problem.(something I should have done
    >>>much earlier)
    >>>
    >>>This is the deal :
    >>>
    >>>On a JSP page, I want to grab a URL and parse /change the HTML and send
    >>>it to the JSP page. I take the URL from the user in a textbox (not the
    >>>browser location box).
    >>>
    >>>In the Java class file (that I have imported in JSP), I tried to use
    >>>Xerces parser earlier till I realised it only supports well-formed XML.
    >>>
    >>>So I switched to OpenXML which supports HTML (but it took like 10
    >>>minutes to parse it and after that also it gave me the Out of Memory
    >>>Exception - even when I increased the buffer size of Tomcat to a good
    >>>amount and when I was parsing a page as simple as www.google.com)
    >>>But if I dont use the DOCUMENT_HTML option in OpenXML and just treat
    >>>the HTML as normal XML file, it does parse it properly(maybe it skips
    >>>the non terminated tags) but there's no way to return the XML back to
    >>>the browser because doc.getDocumentElement().toString() returns '
    HTML:
    >>>1 nodes'
    >>>
    >>>So then I switched to Jtidy and tried to convert HTML to XHTML. But it
    >>>seems the Document type returned by JTidy doesnt support most standard
    >>>document methods (including converting XML to string using
    >>>doc.getDocumentElement().toString()) leaving me at the same place where
    >>>I started from.
    >>>
    >>>Can anybody suggest me what can be a good idea to approach my problem.
    >>>All that I want to do is grab a URL's HTML, add some tags to it (a
    >>>couple of appendChild()s) and then send the  HTML back to the user to
    >>>be displayed(intrepreted) on the browser.
    >>>
    >>>I'll be really thankful for your help!
    >>>Anupam
    >>>[/color]
    >>
    >>hi,
    >>
    >>I did exactly the same thing with NekoHTML : parsing the HTML to XML,
    >>then selecting some nodes with XPath, appending/replacing some nodes,
    >>and transforming or serializing it back to HTML
    >>http://people.apache.org/~andyc/neko/doc/html/index.html
    >>(a nice tool)
    >>
    >>--------------------------------------------
    >>
    >>Did you think on a full XML solution ?
    >>
    >>With Active Tags I used some tags/actions to achieve this. For this
    >>purpose you could use RefleX at the top of Tomcat :
    >>http://reflex.gforge.inria.fr/
    >>(a nice tool too)
    >>RefleX comes with a servlet that can run Active Tags
    >>
    >>Your code would then look like this :
    >><web:service
    >>     xmlns:web="http://www.inria.fr/xml/active-tags/web"
    >>     xmlns:io="http://www.inria.fr/xml/active-tags/io"
    >>     xmlns:xcl="http://www.inria.fr/xml/active-tags/xcl"
    >>     xmlns:xhtml="http://www.w3.org/1999/xhtml"[color=darkred]
    >> >[/color]
    >><!--understand it as a HTTP service-->
    >>
    >>     <!--things that are performed when the server starts-->
    >>     <web:init>
    >>         <!--share a stylesheet with all HTTP requests-->
    >>         <xcl:parse-stylesheet name="ralyx.xsl"
    >>source="web:///WEB-INF/xslt/ralyx.xsl" scope="shared"/>
    >>     </web:init>
    >>
    >>     <!--map the URL-path with a regexp-->
    >>     <web:mapping
    >>match="^/(\d{4})/Fiches/([\p{Lower}\d\-_+]+)/\2\.(html|xml)$"
    >>method="GET" mime-type="">
    >>         <!--use an HTML parser because the documents are not
    >>well-formed ; <xcl:parse-html> uses NekoHTML-->
    >>         <xcl:parse-html name="fiche"
    >>source="http://www.inria.fr/recherche/equipes/{ $web:match/node()[ 2 ]
    >>}.en.html"/>
    >>         <xcl:set name="corps" value="{
    >>$fiche//xhtml:DIV[@class='corps'] }"/>
    >>         <xcl:set name="about" value="{ $corps/xhtml:TABLE//xhtml:TD[2] }"/>
    >>         <xcl:replace referent="{ $about }">
    >>             <td width="200px" align="right" class="projet">
    >>                 <div class="menu_box">{ $about/node() }</div>
    >>             </td>
    >>         </xcl:replace>
    >>
    >>         <!--rebuild a new document-->
    >>         <xcl:document name="projet">
    >>             <projet xml="xml" title="{ string(
    >>$corps/preceding-sibling::xhtml:H1 ) }">
    >>                 { $corps }
    >>             </projet>
    >>         </xcl:document>
    >>
    >>         <!--relativizing URLs in <A href> and <IMG src>-->
    >>         <xcl:for-each name="link" select="{ $projet//xhtml:A[@href] }">
    >>             <xcl:attribute referent="{ $link }" name="href" value="{
    >>io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
    >>$link/@href ) ) }"/>
    >>         </xcl:for-each>
    >>         <xcl:for-each name="link" select="{ $projet//xhtml:IMG[@src] }">
    >>             <xcl:attribute referent="{ $link }" name="src" value="{
    >>io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
    >>$link/@src ) ) }"/>
    >>         </xcl:for-each>
    >>
    >>         <!--selecting the stylesheet-->
    >>         <xcl:set xcl:if="{ $web:match/node()[ 3 ] != 'xml' }"
    >>name="xslt" value="{ $ralyx.xsl }"/>
    >>         <!--back to the browser-->
    >>         <xcl:transform
    >>             output="{ value( $web:response/@web:output ) }"
    >>             source="{ $projet }"
    >>             stylesheet="{ $xslt }"
    >>         />
    >>     </web:mapping>
    >></web:service>
    >>
    >>the result is a new HTML document that contains an updated-part of
    >>another HTML document (this mapping act almost like a proxy) ; it is
    >>used in a real-application deployed at INRIA
    >>
    >>to use it, simply declares the ReflexServlet in Tomcat :
    >><web-app>
    >>   <display-name>RefleX application</display-name>
    >>   <description>My RefleX application</description>
    >>   <servlet>
    >>     <servlet-name>ReflexServlet</servlet-name>
    >>     <display-name>RefleX servlet</display-name>
    >>     <description>Runs an Active Sheet</description>
    >>     <servlet-class>org.inria.reflex.ReflexServlet</servlet-class>
    >>     <init-param>
    >>       <param-name>activeSheetPath</param-name>
    >>       <param-value>web:///WEB-INF/active-sheet.xml</param-value>
    >>     </init-param>
    >>     <load-on-startup>1</load-on-startup>
    >>   </servlet>
    >>   <servlet-mapping><!--custom mappings-->
    >>     <url-pattern>*.gif</url-pattern>
    >>     <servlet-name>default</servlet-name>
    >>   </servlet-mapping>
    >>   <servlet-mapping><!--RefleX mapping-->
    >>     <servlet-name>ReflexServlet</servlet-name>
    >>     <url-pattern>/</url-pattern>
    >>   </servlet-mapping>
    >></web-app>
    >>
    >>when downloading RefleX, check the dependencies and ensure that NekoHTML
    >>0.9.5 is in the full distribution : for the moment, the last version of
    >>RefleX available (0.1.2) uses NekoHTML 0.9.4 that is bugged regarding
    >>namespace URIs ; this issue is fixed in NekoHTML 0.9.5 which is
    >>available online and that will be in RefleX 0.1.3 (coming soon) ;
    >>
    >>Enjoy :)
    >>
    >>--
    >>Cordialement,
    >>
    >>               ///
    >>              (. .)
    >>  --------ooO--(_)--Ooo--------
    >>|      Philippe Poulard       |
    >>  -----------------------------
    >>  http://reflex.gforge.inria.fr/
    >>        Have the RefleX ![/color]
    >
    >[/color]
    
    
    --
    Cordialement,
    
    ///
    (. .)
    --------ooO--(_)--Ooo--------
    |      Philippe Poulard       |
    -----------------------------
    http://reflex.gforge.inria.fr/
    Have the RefleX !
     
    Philippe Poulard, Mar 24, 2006
    #9
  10. Philippe Poulard wrote:
    > I'm sure you'll get some help by contacting the developper


    You may also find some help on Apache's mailing list for Xerces, since
    NekoHTML is based on Xerces (and its author hangs out on that list).

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
     
    Joe Kesselman, Mar 24, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?TWlja2U=?=

    Syntax for OpenXML in ASP.NET

    =?Utf-8?B?TWlja2U=?=, Sep 29, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    756
    =?Utf-8?B?TWlja2U=?=
    Sep 29, 2004
  2. mike
    Replies:
    2
    Views:
    488
  3. cvissy
    Replies:
    0
    Views:
    638
    cvissy
    Nov 16, 2004
  4. xavirm

    Problems with OPENXML

    xavirm, May 14, 2009, in forum: .NET
    Replies:
    0
    Views:
    1,571
    xavirm
    May 14, 2009
  5. Andrew
    Replies:
    6
    Views:
    760
    Arne Vajhøj
    Aug 23, 2009
Loading...

Share This Page