bad data from urllib when run from MS .bat file

S

Stuart McGraw

I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

================================================

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.

Note: To reproduce this problem, it helps to have East Asian font
support installed on the test system. In Windows 2000:
Control Panel,
Regional Options, General tab
check mark in Japanese in the "Language seetings..." area.
Python also needs either the cjkcodecs (http://cjkpython.berlios.de/)
or Tamito KAJIYAMA's japanese codecs
(http://www.asahi-net.or.jp/~rd6t-kjym/python/)
installed.

To reproduce the problem...

1. Create a python file, test.py:
test.py:
----------------
import sys, urllib, cjkcodecs
f = urllib.urlopen (sys.argv[1])
for ln in f:
ln = ln.decode ("cjkcodecs.euc-jp")
print ln.encode("utf-8"),
----------------

2. Create a batch file that will run test.py:
test.bat:
----------------
python test.py http://etext.lib.virginia.edu/cgi-local/breen/wwwjdic?1W%BF%A9%A4%D9%A4%EB_v1
----------------

3. In a cmd.exe window run the following two commands:
python test.py http://etext.lib.virginia.edu/cgi-local/breen/wwwjdic?1W%BF%A9%A4%D9%A4%EB_v1 >out1.txt
test.bat >out2.txt

4. out1.txt and out2.txt should be identical. But they are not.

The url used will return a EUC-JP encoded page with some japanese
characters in it. Test.py reads the page line by line, decodes
the lines to unicode, reencodes them to UTF-8, and writes to a file.
Thus the output file should be a UTF-8 version of the EUC-JP web page.

The first command runs test.py directly. The second command runs
the identical command from a Windows batch file. One should expect
out1.txt and out2.txt to be identical.

out1.txt (created by running test.py from the command line) is
correct (verify by opening out1.txt in notepad, and selecting a
Japanese capable font, e.g. Lucida Sans Unicode). The string in
the first cell of the html table is the three japanese characters
for word "taberu".

But in out2.txt (created by running test.py from a windows .bat
file), instead of japanese characters there, we see an ascii text
string "A9D9EB". (The EUC-JP value of the actual japanese characters
that should be there are \xBF\xA9\xA4\xD9\XA4\xEB, so the printed
hex digits seems to come from alternate bytes of the EUC-JP string.

In other lines with japanese characters a similar effect is seen:
the first two japanese character are replaced with with a string of
hex digits. Strangely, remaining japanese characters on the line
are not corrupted.

Running with a debugger shows that the corruption is in the text
received from urllib; it is not a result of the euc-jp decoding,
UTF-8 encoding, or writing to the output file.

So it looks like some bad mojo between urllib and the Windows
batch environment.
 
J

John J. Lee

Stuart McGraw said:
2. Create a batch file that will run test.py:
test.bat:
----------------
python test.py http://etext.lib.virginia.edu/cgi-local/breen/wwwjdic?1W%BF%A9%A4%D9%A4%EB_v1
----------------

3. In a cmd.exe window run the following two commands:
python test.py http://etext.lib.virginia.edu/cgi-local/breen/wwwjdic?1W%BF%A9%A4%D9%A4%EB_v1 >out1.txt
test.bat >out2.txt

4. out1.txt and out2.txt should be identical. But they are not. [...]
Running with a debugger shows that the corruption is in the text
received from urllib; it is not a result of the euc-jp decoding,
UTF-8 encoding, or writing to the output file.
Hmm...


So it looks like some bad mojo between urllib and the Windows
batch environment.

Just a guess, without actually bothering to think about the numerology
in detail:

test.bat:
----------------
python -u test.py http://etext.lib.virginia.edu/cgi-local/breen/wwwjdic?1W%BF%A9%A4%D9%A4%EB_v1
----------------

Note the -u switch (for 'unbuffered', but also 'um, binary mode'
<wink>).


John
 
S

Stuart McGraw

John J. Lee said:
Just a guess, without actually bothering to think about the numerology
in detail:

test.bat:

Did you try doing that? Did it work for you? I just tried here, and
still have the same problem.

Even worse, in the original script that the test script is derived from
I encountered a new problem. Intermixed with the web page data
returned by urllib, is bits and pieces (10-20 characters long) of local
file and directory names. Only happens reading some web pages
(EUC-JP encoded as with the original problem) but I'm wondering
if there are some single-byte/double-byte character issues with urllib.
That would be surprising to me given that urllib is shipped with the
Python distribution, I would think that any core libs would be pretty
bombproof. (Am I being naive? :) Of course, still possible I hosed
something in my script, so I will double check...
 
B

Bengt Richter

I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

================================================

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.
Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>echo %pathext%
.COM;.EXE;.BAT;.CMD

Regards,
Bengt Richter
 
S

Stuart McGraw

Bengt Richter said:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

================================================

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.
Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>echo %pathext%
.COM;.EXE;.BAT;.CMD

Well, I'm pretty sure cmd.exe was executing it, but I tried your
suggestion to make absolutely sure. Same results :-(
Given the other (seeming) urllib problem I mentioned in another
post in this thread, which appeared without any involvement
of batch scripts, I am getting more and more suspicious that
urllib is buggy, at least with non-single byte data.
 
B

Bengt Richter

Bengt Richter said:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

================================================

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.
Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>echo %pathext%
.COM;.EXE;.BAT;.CMD

Well, I'm pretty sure cmd.exe was executing it, but I tried your
suggestion to make absolutely sure. Same results :-(
Given the other (seeming) urllib problem I mentioned in another
post in this thread, which appeared without any involvement
of batch scripts, I am getting more and more suspicious that
urllib is buggy, at least with non-single byte data.
Hm, what happens if you make a test2.py and pass it the name of an output
file instead of piping the output from print? In fact, eliminate the
encoding and the line generator and everything, and just let test2 copy the entire
server data in one single read and write it in binary. I.e,
open(sys.argv[2],'wb').write(urllib.urlopen(...).read())

That should show whether python is seeing the identical input from the server.
Then you could do it line-wise (not with a print line ending in ",", but with
a binary file write). That would say whether line generation chunking on input
was doing anything to the data -- if possibly urrlib is buffering/chunking
differently for interactive vs bat file. Just grasping at straws, but eliminating
chunking, piping, re/encoding, binary vs text mode doubts from the test should
show why interactive vs .bat is different IWT.

Also, your mention of two-character errors made me wonder about spurious BOMs
or such from encoding file substrings as though they were entire files?
Would a final print for a final '\n' do anything that might trigger a final flush
differently with potential cooking consequence? (why the print with space instead BTW)?
What if you just do your own file.write output in binary and control everything?

Just some additional thoughts. Sorry the cmd vs bat thing didn't do anything.
BTW, what command line options are in use to start your interactive session
(it is console, not idle, right?). You didn't seem to have any (e.g. -u) in test.py.
Could the .BAT file be seeing a different environment? could the http://.. need quoting?
I.e., could the server be seeing a glitched url tail and be sending the same file but with some
different option?

Hope something gives you a useful idea. That's all I can think of for the moment ;-)

Regards,
Bengt Richter
 
J

John J. Lee

Stuart McGraw said:
Just a guess
[...]
Did you try doing that?

No. It was a guess.


[...]
That would be surprising to me given that urllib is shipped with the
Python distribution, I would think that any core libs would be pretty
bombproof. (Am I being naive? :)
[...]

Possibly:

http://article.gmane.org/gmane.comp.python.devel/63911


But I have a fairly strong suspicion that this *isn't* a bug in urllib
or Python: I think urllib regards HTTP response data simply as a
binary string (as opposed to the case of URLs, where things are... uh,
complicated).

*I'm* certainly more naive about encodings, character sets &c. than
Of course, still possible I hosed

Yes, that's possible :)


John
 
F

Fredrik Lundh

John said:
But I have a fairly strong suspicion that this *isn't* a bug in urllib
or Python: I think urllib regards HTTP response data simply as a
binary string

that's correct.

the bug is most likely elsewhere; probably in the interaction between
print and cmd redirection, or perhaps in the interaction between cmd
and sys.argv. to test it, get rid of all the junk you can get rid of: that
is, embed the URL in the test script, and write output to a binary file,
and use a hexdump tool to look at the file (use "debug" if you don't
have any better tool).

</F>
 
S

Stuart McGraw

Thanks everyone for all the suggestions. I will follow up on them,
but not right now. I am about to move halfway around the world
for a few months so I will need to get settled in before I have
time to look into this more.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top