Non-correcting library for parsing/modifying broken HTML/PHP files?

Discussion in 'Ruby' started by Markus Fischer, Apr 5, 2011.

  1. Hi,

    does anyone know of a library which can work with broken/malformed
    HTML/PHP and still produce the same output like the input?

    So far I've tried Nokogiri and Hpricot, they're absolutely amazing and
    excel in their purpose but fail to meet my requirement that, when saving
    the HTML, nothing which I haven't changed due DOM manipulation should
    change in the output.

    The thing is that I've to work with such horrible broken HTML (or say,
    PHP) documents that those libraries are ├╝ber-tempted to correct it. But
    this is troublesome for me, as I've fix a few hundreds, maybe up to
    thousands of documents and their versioned history should really only
    reflect the change I'm doing and not what the library needs to change so
    it can work with it. I looked up at rubygems but was unable to come up
    with more libraries, did I miss them?

    Many words, here's an example:

    $ cat test.php
    <?php include_once('whatever.php'); ?>
    <html><title> anything</title>
    <?php includeHtmlHeader(' blabla',',')?>
    <body topmargin="0" bgcolor="#ffffff" leftmargin="0"
    link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
    alink="#800000" >
    <?includeFile('/application/templates/whatever.shtml')?>
    <br>
    <?php echo more::code("andsuch"); ?>


    <script type="text/javascript">OAS_AD('Position1');</script>


    $ ruby -v ; gem list|grep nokogi
    ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]
    nokogiri (1.4.4)


    $ ruby -rnokogiri -e 'html = Nokogiri::HTML::Document.parse(
    open("test.php").read) ; open("test2.php", "w") { |f| f.write(
    html.to_html)}'


    $ diff -u test.php test2.php
    --- test.php 2011-04-05 10:50:00.000000000 +0200
    +++ test2.php 2011-04-05 10:52:31.000000000 +0200
    @@ -1,10 +1,11 @@
    -<?php include_once('whatever.php'); ?>
    -<html><title> anything</title>
    - <?php includeHtmlHeader(' blabla',',')?>
    - <body topmargin="0" bgcolor="#ffffff" leftmargin="0"
    link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
    alink="#800000" >
    - <?includeFile('/application/templates/whatever.shtml')?>
    - <br>
    - <?php echo more::code("andsuch"); ?>
    -
    -
    -<script type="text/javascript">OAS_AD('Position1');</script>
    +<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
    "http://www.w3.org/TR/REC-html40/loose.dtd">
    +<?php include_once('whatever.php'); ?><html>
    +<head>
    +<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    +<title> anything</title>
    +<?php includeHtmlHeader(' blabla',',')?>
    +</head>
    +<body topmargin="0" bgcolor="#ffffff" leftmargin="0" link="#003366"
    marginheight="0" marginwidth="0" vlink="#003366" alink="#800000">
    + <?includeFile
    ('/application/templates/whatever.shtml')?><br><?php echo
    more::code("andsuch"); ?><script
    type="text/javascript">OAS_AD('Position1');</script>
    +</body>
    +</html>


    Now with Hpricot:

    $ gem list|grep hpri
    hpricot (0.8.4)


    $ ruby -rhpricot -e 'html = Hpricot( open("test.php").read) ;
    open("test2.php", "w") { |f| f.write( html.to_html)}'


    $ diff -u test.php test2.php
    --- test.php 2011-04-05 10:50:00.000000000 +0200
    +++ test2.php 2011-04-05 10:53:19.000000000 +0200
    @@ -1,10 +1,11 @@
    <?php include_once('whatever.php'); ?>
    <html><title> anything</title>
    <?php includeHtmlHeader(' blabla',',')?>
    - <body topmargin="0" bgcolor="#ffffff" leftmargin="0"
    link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
    alink="#800000" >
    + <body topmargin="0" bgcolor="#ffffff" leftmargin="0"
    link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
    alink="#800000">
    <?includeFile('/application/templates/whatever.shtml')?>
    - <br>
    + <br />
    <?php echo more::code("andsuch"); ?>


    <script type="text/javascript">OAS_AD('Position1');</script>
    +</body></html>
    \ No newline at end of file


    Much better, still ... as documents are more complex then this sample,
    the changes done by the libraries grow bigger.

    thanks,
    - Markus
    Markus Fischer, Apr 5, 2011
    #1
    1. Advertising

  2. Re: Non-correcting library for parsing/modifying broken HTML/PHPfiles?

    On Tue, Apr 5, 2011 at 10:56 AM, Markus Fischer <> wrote=
    :

    > does anyone know of a library which can work with broken/malformed HTML/P=

    HP
    > and still produce the same output like the input?
    >
    > So far I've tried Nokogiri and Hpricot, they're absolutely amazing and ex=

    cel
    > in their purpose but fail to meet my requirement that, when saving the HT=

    ML,
    > nothing which I haven't changed due DOM manipulation should change in the
    > output.
    >
    > The thing is that I've to work with such horrible broken HTML (or say, PH=

    P)
    > documents that those libraries are =FCber-tempted to correct it. But this=

    is
    > troublesome for me, as I've fix a few hundreds, maybe up to thousands of
    > documents and their versioned history should really only reflect the chan=

    ge
    > I'm doing and not what the library needs to change so it can work with it=

    I
    > looked up at rubygems but was unable to come up with more libraries, did =

    I
    > miss them?


    What about one initial rework to get proper (X)HTML, submit it to your
    version control and then create those modifications that you need to
    do? That approach has served me quite well for example when enforcing
    a particular source code formatting.

    Cheers

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Apr 5, 2011
    #2
    1. Advertising

  3. Re: Non-correcting library for parsing/modifying broken HTML/PHPfiles?

    Hi Robert,

    On 05.04.2011 14:59, Robert Klemme wrote:
    > What about one initial rework to get proper (X)HTML, submit it to your
    > version control and then create those modifications that you need to
    > do? That approach has served me quite well for example when enforcing
    > a particular source code formatting.


    I considered this approach too, unfortunately it turns out it breaks the
    history too much, i.e. blaming of content. I mean, nothing gets "broken"
    but when you blame/annotate, and we do this, you get irrelevant noise in
    it, which I really try to avoid.

    thanks,
    - Markus
    Markus Fischer, Apr 5, 2011
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Wilq
    Replies:
    0
    Views:
    1,364
  2. MCP
    Replies:
    11
    Views:
    1,092
    Andrew Thompson
    Jun 11, 2004
  3. Sheldon
    Replies:
    13
    Views:
    594
    Scott David Daniels
    Oct 5, 2006
  4. Rajive Narain
    Replies:
    0
    Views:
    1,743
    Rajive Narain
    Sep 18, 2009
  5. Oleg Ogurok

    Parsing InnerProperties after modifying HTML (design time)

    Oleg Ogurok, Dec 29, 2003, in forum: ASP .Net Building Controls
    Replies:
    1
    Views:
    103
    Teemu Keiski
    Dec 30, 2003
Loading...

Share This Page