piping input to an external script

Tim Arnold · May 11, 2009

Hi, I have some html files that I want to validate by using an external
script 'validate'. The html files need a doctype header attached before
validation. The files are in utf8 encoding. My code:
---------------
import os,sys
import codecs,subprocess
HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'

filename = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read()
fd.close()

p = subprocess.Popen(['validate'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
validate = p.communicate(unicode(s,encoding='utf8'))
print validate
---------------

I get lots of lines like this:
Error at line 1, character 66:\tillegal character number 0
etc etc.

But I can give the command in a terminal 'cat mytest.html | validate' and
get reasonable output. My subprocess code must be wrong, but I could use
some help to see what the problem is.

python2.5.1, freebsd6
thanks,
--Tim

Steve Howell · May 12, 2009

Hi, I have some html files that I want to validate by using an external
script 'validate'. The html files need a doctype header attached before
validation. The files are in utf8 encoding. My code:
---------------
import os,sys
import codecs,subprocess
HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'

filename = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read()
fd.close()

p = subprocess.Popen(['validate'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
validate = p.communicate(unicode(s,encoding='utf8'))
print validate
---------------

I get lots of lines like this:
Error at line 1, character 66:\tillegal character number 0
etc etc.

But I can give the command in a terminal 'cat mytest.html | validate' and
get reasonable output. My subprocess code must be wrong, but I could use
some help to see what the problem is.

Newline missing after the header is my guess.

norseman · May 12, 2009

Tim said:
Hi, I have some html files that I want to validate by using an external
script 'validate'. The html files need a doctype header attached before
validation. The files are in utf8 encoding. My code:
---------------
import os,sys
import codecs,subprocess
HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'

filename = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read()
fd.close()

p = subprocess.Popen(['validate'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
validate = p.communicate(unicode(s,encoding='utf8'))
print validate
---------------

I get lots of lines like this:
Error at line 1, character 66:\tillegal character number 0
etc etc.

But I can give the command in a terminal 'cat mytest.html | validate' and
get reasonable output. My subprocess code must be wrong, but I could use
some help to see what the problem is.

python2.5.1, freebsd6
thanks,
--Tim

============================
If you search through the recent Python-List for UTF-8 things you might
get the same understanding I have come to.

the problem is the use of python's 'print' subcommand or what ever it
is. It 'cooks' things and someone decided that it would only handle 1/2
of a byte (in the x'00 to x'7f' range) and ignore or send error messages
against anything else. I guess the person doing the deciding read the
part that says ASCII printables are in the 7 bit range and chose to
ignore the part about the rest of the byte being undefined. That is
undefined, not disallowed. Means the high bit half can be used as
wanted since it isn't already taken. Nor did whoever it was take a look
around the computer world and realize the conflict that was going to be
generated by using only 1/2 of a byte in a 1byte+ world.

If you can modify your code to use read and write you can bypass print
and be OK. Or just have python do the 'cat mytest.html | validate' for
you. (Apply a var for html and let python accomplish the the equivalent
of Unix's:
for f in *.html; do cat $f | validate; done
or
for f in *.html; do validate $f; done #file name available this way

If you still have problems, take a look at os.POPEN2 (and its popen3)
Also take look at os.spawn.. et al

HTH

Steve

Steve Howell · May 12, 2009

Tim said:
Tim said:

Hi, I have some html files that I want to validate by using an external
script 'validate'. The html files need a doctype header attached before
validation. The files are in utf8 encoding. My code:

Click to expand...

filename = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read()
fd.close()

Click to expand...

p = subprocess.Popen(['validate'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
validate = p.communicate(unicode(s,encoding='utf8'))
print validate
---------------

Click to expand...

I get lots of lines like this:
Error at line 1, character 66:\tillegal character number 0
etc etc.

Click to expand...

But I can give the command in a terminal 'cat mytest.html | validate' and
get reasonable output. My subprocess code must be wrong, but I could use
some help to see what the problem is.

Click to expand...

python2.5.1, freebsd6
thanks,
--Tim

Click to expand...

============================
If you search through the recent Python-List for UTF-8 things you might
get the same understanding I have come to.

the problem is the use of python's 'print' subcommand or what ever it
is. It 'cooks' things and someone decided that it would only handle 1/2
of a byte (in the x'00 to x'7f' range) and ignore or send error messages
against anything else. I guess the person doing the deciding read the
part that says ASCII printables are in the 7 bit range and chose to
ignore the part about the rest of the byte being undefined. That is
undefined, not disallowed. Means the high bit half can be used as
wanted since it isn't already taken. Nor did whoever it was take a look
around the computer world and realize the conflict that was going to be
generated by using only 1/2 of a byte in a 1byte+ world.

If you can modify your code to use read and write you can bypass print
and be OK. Or just have python do the 'cat mytest.html | validate' for
you. (Apply a var for html and let python accomplish the the equivalent
of Unix's:
for f in *.html; do cat $f | validate; done
or
for f in *.html; do validate $f; done #file name available this way

If you still have problems, take a look at os.POPEN2 (and its popen3)
Also take look at os.spawn.. et al

Wow. Unicode and subprocessing and printing can have dark corners,
but common sense does apply in MOST situations.

If you send the header, add the newline.

But you do not need the header if you can cat the input file sans
header and get sensible input.

Finally, if you are concerned about adding the header, then it belongs
in the original input file; otherwise, you are creating a false
positive.

norseman · May 12, 2009

Steve said:
Tim said:

Hi, I have some html files that I want to validate by using an external
script 'validate'. The html files need a doctype header attached before
validation. The files are in utf8 encoding. My code:
---------------
import os,sys
import codecs,subprocess
HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'
filename = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read()
fd.close()
p = subprocess.Popen(['validate'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
validate = p.communicate(unicode(s,encoding='utf8'))
print validate
---------------
I get lots of lines like this:
Error at line 1, character 66:\tillegal character number 0
etc etc.
But I can give the command in a terminal 'cat mytest.html | validate' and
get reasonable output. My subprocess code must be wrong, but I could use
some help to see what the problem is.
python2.5.1, freebsd6
thanks,
--Tim

Click to expand...

============================
If you search through the recent Python-List for UTF-8 things you might
get the same understanding I have come to.

the problem is the use of python's 'print' subcommand or what ever it
is. It 'cooks' things and someone decided that it would only handle 1/2
of a byte (in the x'00 to x'7f' range) and ignore or send error messages
against anything else. I guess the person doing the deciding read the
part that says ASCII printables are in the 7 bit range and chose to
ignore the part about the rest of the byte being undefined. That is
undefined, not disallowed. Means the high bit half can be used as
wanted since it isn't already taken. Nor did whoever it was take a look
around the computer world and realize the conflict that was going to be
generated by using only 1/2 of a byte in a 1byte+ world.

If you can modify your code to use read and write you can bypass print
and be OK. Or just have python do the 'cat mytest.html | validate' for
you. (Apply a var for html and let python accomplish the the equivalent
of Unix's:
for f in *.html; do cat $f | validate; done
or
for f in *.html; do validate $f; done #file name available this way

If you still have problems, take a look at os.POPEN2 (and its popen3)
Also take look at os.spawn.. et al

Click to expand...

Wow. Unicode and subprocessing and printing can have dark corners,
but common sense does apply in MOST situations.

If you send the header, add the newline.

But you do not need the header if you can cat the input file sans
header and get sensible input.

Yep! The problem is with 'print'

Finally, if you are concerned about adding the header, then it belongs
in the original input file; otherwise, you are creating a false
positive.

Steve

Steve Howell · May 12, 2009

Steve said:
Steve said:

Tim Arnold wrote:
Hi, I have some html files that I want to validate by using an external
script 'validate'. The html files need a doctype header attached before
validation. The files are in utf8 encoding. My code:
---------------
import os,sys
import codecs,subprocess
HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'
filename = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read()
fd.close()
p = subprocess.Popen(['validate'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
validate = p.communicate(unicode(s,encoding='utf8'))
print validate
---------------
I get lots of lines like this:
Error at line 1, character 66:\tillegal character number 0
etc etc.
But I can give the command in a terminal 'cat mytest.html | validate' and
get reasonable output. My subprocess code must be wrong, but I could use
some help to see what the problem is.
python2.5.1, freebsd6
thanks,
--Tim
============================
If you search through the recent Python-List for UTF-8 things you might
get the same understanding I have come to.
the problem is the use of python's 'print' subcommand or what ever it
is. It 'cooks' things and someone decided that it would only handle 1/2
of a byte (in the x'00 to x'7f' range) and ignore or send error messages
against anything else. I guess the person doing the deciding read the
part that says ASCII printables are in the 7 bit range and chose to
ignore the part about the rest of the byte being undefined. That is
undefined, not disallowed. Means the high bit half can be used as
wanted since it isn't already taken. Nor did whoever it was take a look
around the computer world and realize the conflict that was going to be
generated by using only 1/2 of a byte in a 1byte+ world.
If you can modify your code to use read and write you can bypass print
and be OK. Or just have python do the 'cat mytest.html | validate' for
you. (Apply a var for html and let python accomplish the the equivalent
of Unix's:
for f in *.html; do cat $f | validate; done
or
for f in *.html; do validate $f; done #file name available this way
If you still have problems, take a look at os.POPEN2 (and its popen3)
Also take look at os.spawn.. et al

Click to expand...

Click to expand...

Wow. Unicode and subprocessing and printing can have dark corners,
but common sense does apply in MOST situations.

Click to expand...

If you send the header, add the newline.

Click to expand...

But you do not need the header if you can cat the input file sans
header and get sensible input.

Click to expand...

Yep! The problem is with 'print'

Huh? Print is printing exactly what you expect it to print.

Dave Angel · May 12, 2009

Tim said:
Hi, I have some html files that I want to validate by using an external
script 'validate'. The html files need a doctype header attached before
validation. The files are in utf8 encoding. My code:
---------------
import os,sys
import codecs,subprocess
HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'

filename = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read()
fd.close()

p = subprocess.Popen(['validate'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
validate = p.communicate(unicode(s,encoding='utf8'))
print validate
---------------

I get lots of lines like this:
Error at line 1, character 66:\tillegal character number 0
etc etc.

But I can give the command in a terminal 'cat mytest.html | validate' and
get reasonable output. My subprocess code must be wrong, but I could use
some help to see what the problem is.

python2.5.1, freebsd6
thanks,
--Tim

The usual rule in debugging: split the problem into two parts, and test
each one separately, starting with the one you think most likely to be
the culprit

In this case the obvious place to split is with the data you're passing
to the communicate call. I expect it's already wrong, long before you
hand it to the subprocess. So write it to a file instead, and inspect
it with a binary file viewer. And of course test it manually with your
validate program. Is validate really expecting a Unicode stream in stdin ?

norseman · May 12, 2009

Steve said:
Steve said:

Tim Arnold wrote:
Hi, I have some html files that I want to validate by using an external
script 'validate'. The html files need a doctype header attached before
validation. The files are in utf8 encoding. My code:
---------------
import os,sys
import codecs,subprocess
HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'
filename = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read()
fd.close()
p = subprocess.Popen(['validate'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
validate = p.communicate(unicode(s,encoding='utf8'))
print validate
---------------
I get lots of lines like this:
Error at line 1, character 66:\tillegal character number 0
etc etc.
But I can give the command in a terminal 'cat mytest.html | validate' and
get reasonable output. My subprocess code must be wrong, but I could use
some help to see what the problem is.
python2.5.1, freebsd6
thanks,
--Tim
============================
If you search through the recent Python-List for UTF-8 things you might
get the same understanding I have come to.
the problem is the use of python's 'print' subcommand or what ever it
is. It 'cooks' things and someone decided that it would only handle 1/2
of a byte (in the x'00 to x'7f' range) and ignore or send error messages
against anything else. I guess the person doing the deciding read the
part that says ASCII printables are in the 7 bit range and chose to
ignore the part about the rest of the byte being undefined. That is
undefined, not disallowed. Means the high bit half can be used as
wanted since it isn't already taken. Nor did whoever it was take a look
around the computer world and realize the conflict that was going to be
generated by using only 1/2 of a byte in a 1byte+ world.
If you can modify your code to use read and write you can bypass print
and be OK. Or just have python do the 'cat mytest.html | validate' for
you. (Apply a var for html and let python accomplish the the equivalent
of Unix's:
for f in *.html; do cat $f | validate; done
or
for f in *.html; do validate $f; done #file name available this way
If you still have problems, take a look at os.POPEN2 (and its popen3)
Also take look at os.spawn.. et al
Wow. Unicode and subprocessing and printing can have dark corners,
but common sense does apply in MOST situations.
If you send the header, add the newline.
But you do not need the header if you can cat the input file sans
header and get sensible input.

Click to expand...

Yep! The problem is with 'print'

Click to expand...

Huh? Print is printing exactly what you expect it to print.

===============
My apologies.

Tim: Using what you posted;
Is the third char of the first line read from file a TAB?

Just curious. len(HEADER) is 63, error at 66 char number 0, doesn't
seem quite consistent math wise.
63 + cr + lf gives 65. But, as another noted, you don't have those.
"...66:\tillegal..." is '\t' a tab on screen or byte 1 or 3 of file?
If you have mc available, in it - highlight file and press Shift-F3 then
F4. 09 is TAB

</title> is closing, should not exist as opener
<html> can be opener, did the h somehow become a '\'
(still - that would put x'09' at byte 2 of file)

Most validate programs I have used will let me know the header is
missing if in fact it is and give me a choice of how to process (XML,
XHTML, HTML 1.1, ...) or quit.

is HEADER ('<!DOC...>') itself already in utf-8?
Or are you mixing things?

Last but not least - if you have source of validate process, check that
over carefully. The numbers don't work for me.

Just thinking on paper. No need to respond.

Steve

Tim Arnold · May 12, 2009

Dave Angel said:
Tim said:

Hi, I have some html files that I want to validate by using an external
script 'validate'. The html files need a doctype header attached before
validation. The files are in utf8 encoding. My code:
---------------
import os,sys
import codecs,subprocess
HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN">'

filename = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read()
fd.close()

p = subprocess.Popen(['validate'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
validate = p.communicate(unicode(s,encoding='utf8'))
print validate
---------------

I get lots of lines like this:
Error at line 1, character 66:\tillegal character number 0
etc etc.

But I can give the command in a terminal 'cat mytest.html | validate' and
get reasonable output. My subprocess code must be wrong, but I could use
some help to see what the problem is.

python2.5.1, freebsd6
thanks,
--Tim

Click to expand...

The usual rule in debugging: split the problem into two parts, and test
each one separately, starting with the one you think most likely to be the
culprit

In this case the obvious place to split is with the data you're passing to
the communicate call. I expect it's already wrong, long before you hand
it to the subprocess. So write it to a file instead, and inspect it with
a binary file viewer. And of course test it manually with your validate
program. Is validate really expecting a Unicode stream in stdin ?

Good advice from everyone. The example was simpler than my actual situation,
but it did show the problem. Dave's final question was the right one: I
needed to pass the html content as a string, not unicode object:

HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n'

filename = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read().encode('utf8') # <- made the difference
fd.close()

p = subprocess.Popen(['validate',],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
validate = p.communicate(s)
print validate

Steve Howell · May 12, 2009

See suggested debugging tip inline of your program....

Hi, I have some html files that I want to validate by using an external
script 'validate'. The html files need a doctype header attached before
validation. The files are in utf8 encoding. My code:
---------------
import os,sys
import codecs,subprocess
HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'

filename = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read()

# Try inserting lines like below, to see what characters are actually
near char 66.
print '---'
print repr(s[65])
print repr(s[66])
print repr(s[:70])
print repr(unicode(s,encoding='utf8')[:70])
print '---'

fd.close()

p = subprocess.Popen(['validate'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
validate = p.communicate(unicode(s,encoding='utf8'))
print validate
---------------

I get lots of lines like this:
Error at line 1, character 66:\tillegal character number 0
etc etc.

See above, it's pretty easy to see what the 66th character of "s" is.

But I can give the command in a terminal 'cat mytest.html | validate' and
get reasonable output. My subprocess code must be wrong, but I could use
some help to see what the problem is.

Your disconnect is that in your program you are NOT actually
simulating the sending of mytest.html to the validate program, so you
are comparing apples and oranges.

The fact that you can send mytest.html to the validate program without
a header from the shell suggest to me that it is equally unnecessary
in your Python program, or maybe you just haven't thought through what
you're really trying to accomplish here.

Validating XML with an external DTD	8	Aug 3, 2007
elementtree w/utf8	6	Oct 25, 2007
[ANN] fm.rb 0.5.0 - program to split fm newsletters	0	May 1, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005
Problem with deployment of Servlet	1	Feb 12, 2004
Problem with the deployment of Servlet (Con't)	0	Feb 11, 2004

piping input to an external script

Tim Arnold

Steve Howell

norseman

Steve Howell

norseman

Steve Howell

Dave Angel

norseman

Tim Arnold

Steve Howell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads