M
Markus Fischer
Hi,
does anyone know of a library which can work with broken/malformed
HTML/PHP and still produce the same output like the input?
So far I've tried Nokogiri and Hpricot, they're absolutely amazing and
excel in their purpose but fail to meet my requirement that, when saving
the HTML, nothing which I haven't changed due DOM manipulation should
change in the output.
The thing is that I've to work with such horrible broken HTML (or say,
PHP) documents that those libraries are über-tempted to correct it. But
this is troublesome for me, as I've fix a few hundreds, maybe up to
thousands of documents and their versioned history should really only
reflect the change I'm doing and not what the library needs to change so
it can work with it. I looked up at rubygems but was unable to come up
with more libraries, did I miss them?
Many words, here's an example:
$ cat test.php
<?php include_once('whatever.php'); ?>
<html><title> anything</title>
<?php includeHtmlHeader(' blabla',',')?>
<body topmargin="0" bgcolor="#ffffff" leftmargin="0"
link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
alink="#800000" >
<?includeFile('/application/templates/whatever.shtml')?>
<br>
<?php echo more::code("andsuch"); ?>
<script type="text/javascript">OAS_AD('Position1');</script>
$ ruby -v ; gem list|grep nokogi
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]
nokogiri (1.4.4)
$ ruby -rnokogiri -e 'html = Nokogiri::HTML:ocument.parse(
open("test.php").read) ; open("test2.php", "w") { |f| f.write(
html.to_html)}'
$ diff -u test.php test2.php
--- test.php 2011-04-05 10:50:00.000000000 +0200
+++ test2.php 2011-04-05 10:52:31.000000000 +0200
@@ -1,10 +1,11 @@
-<?php include_once('whatever.php'); ?>
-<html><title> anything</title>
- <?php includeHtmlHeader(' blabla',',')?>
- <body topmargin="0" bgcolor="#ffffff" leftmargin="0"
link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
alink="#800000" >
- <?includeFile('/application/templates/whatever.shtml')?>
- <br>
- <?php echo more::code("andsuch"); ?>
-
-
-<script type="text/javascript">OAS_AD('Position1');</script>
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
+<?php include_once('whatever.php'); ?><html>
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<title> anything</title>
+<?php includeHtmlHeader(' blabla',',')?>
+</head>
+<body topmargin="0" bgcolor="#ffffff" leftmargin="0" link="#003366"
marginheight="0" marginwidth="0" vlink="#003366" alink="#800000">
+ <?includeFile
('/application/templates/whatever.shtml')?><br><?php echo
more::code("andsuch"); ?><script
type="text/javascript">OAS_AD('Position1');</script>
+</body>
+</html>
Now with Hpricot:
$ gem list|grep hpri
hpricot (0.8.4)
$ ruby -rhpricot -e 'html = Hpricot( open("test.php").read) ;
open("test2.php", "w") { |f| f.write( html.to_html)}'
$ diff -u test.php test2.php
--- test.php 2011-04-05 10:50:00.000000000 +0200
+++ test2.php 2011-04-05 10:53:19.000000000 +0200
@@ -1,10 +1,11 @@
<?php include_once('whatever.php'); ?>
<html><title> anything</title>
<?php includeHtmlHeader(' blabla',',')?>
- <body topmargin="0" bgcolor="#ffffff" leftmargin="0"
link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
alink="#800000" >
+ <body topmargin="0" bgcolor="#ffffff" leftmargin="0"
link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
alink="#800000">
<?includeFile('/application/templates/whatever.shtml')?>
- <br>
+ <br />
<?php echo more::code("andsuch"); ?>
<script type="text/javascript">OAS_AD('Position1');</script>
+</body></html>
\ No newline at end of file
Much better, still ... as documents are more complex then this sample,
the changes done by the libraries grow bigger.
thanks,
- Markus
does anyone know of a library which can work with broken/malformed
HTML/PHP and still produce the same output like the input?
So far I've tried Nokogiri and Hpricot, they're absolutely amazing and
excel in their purpose but fail to meet my requirement that, when saving
the HTML, nothing which I haven't changed due DOM manipulation should
change in the output.
The thing is that I've to work with such horrible broken HTML (or say,
PHP) documents that those libraries are über-tempted to correct it. But
this is troublesome for me, as I've fix a few hundreds, maybe up to
thousands of documents and their versioned history should really only
reflect the change I'm doing and not what the library needs to change so
it can work with it. I looked up at rubygems but was unable to come up
with more libraries, did I miss them?
Many words, here's an example:
$ cat test.php
<?php include_once('whatever.php'); ?>
<html><title> anything</title>
<?php includeHtmlHeader(' blabla',',')?>
<body topmargin="0" bgcolor="#ffffff" leftmargin="0"
link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
alink="#800000" >
<?includeFile('/application/templates/whatever.shtml')?>
<br>
<?php echo more::code("andsuch"); ?>
<script type="text/javascript">OAS_AD('Position1');</script>
$ ruby -v ; gem list|grep nokogi
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]
nokogiri (1.4.4)
$ ruby -rnokogiri -e 'html = Nokogiri::HTML:ocument.parse(
open("test.php").read) ; open("test2.php", "w") { |f| f.write(
html.to_html)}'
$ diff -u test.php test2.php
--- test.php 2011-04-05 10:50:00.000000000 +0200
+++ test2.php 2011-04-05 10:52:31.000000000 +0200
@@ -1,10 +1,11 @@
-<?php include_once('whatever.php'); ?>
-<html><title> anything</title>
- <?php includeHtmlHeader(' blabla',',')?>
- <body topmargin="0" bgcolor="#ffffff" leftmargin="0"
link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
alink="#800000" >
- <?includeFile('/application/templates/whatever.shtml')?>
- <br>
- <?php echo more::code("andsuch"); ?>
-
-
-<script type="text/javascript">OAS_AD('Position1');</script>
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
+<?php include_once('whatever.php'); ?><html>
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<title> anything</title>
+<?php includeHtmlHeader(' blabla',',')?>
+</head>
+<body topmargin="0" bgcolor="#ffffff" leftmargin="0" link="#003366"
marginheight="0" marginwidth="0" vlink="#003366" alink="#800000">
+ <?includeFile
('/application/templates/whatever.shtml')?><br><?php echo
more::code("andsuch"); ?><script
type="text/javascript">OAS_AD('Position1');</script>
+</body>
+</html>
Now with Hpricot:
$ gem list|grep hpri
hpricot (0.8.4)
$ ruby -rhpricot -e 'html = Hpricot( open("test.php").read) ;
open("test2.php", "w") { |f| f.write( html.to_html)}'
$ diff -u test.php test2.php
--- test.php 2011-04-05 10:50:00.000000000 +0200
+++ test2.php 2011-04-05 10:53:19.000000000 +0200
@@ -1,10 +1,11 @@
<?php include_once('whatever.php'); ?>
<html><title> anything</title>
<?php includeHtmlHeader(' blabla',',')?>
- <body topmargin="0" bgcolor="#ffffff" leftmargin="0"
link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
alink="#800000" >
+ <body topmargin="0" bgcolor="#ffffff" leftmargin="0"
link="#003366" marginheight="0" marginwidth="0" vlink="#003366"
alink="#800000">
<?includeFile('/application/templates/whatever.shtml')?>
- <br>
+ <br />
<?php echo more::code("andsuch"); ?>
<script type="text/javascript">OAS_AD('Position1');</script>
+</body></html>
\ No newline at end of file
Much better, still ... as documents are more complex then this sample,
the changes done by the libraries grow bigger.
thanks,
- Markus