Binary data, command output, and Ruby

P

Phrogz

I have a script that pulls pages from our wiki server. It was working
using Net:HTTP and open-uri with basic_authentication, but our
sysadmin disabled basic authentication and left NTLM as the only
authentication method.

Instead of trying to figure out how to use the Ruby NTLM library, I
decide to just use curl. It was working nicely for the HTML pages
using this form:
def fetch_http_ntlm( url )
`curl #{url} --ntlm -# -u #{USER}:#{PASS}`
end

However, the above fails for binary files. (Pulling down images
embedded in pages.) So I had to switch it to this:
def fetch_http_ntlm( url )
file_name = "C:\\tmp_#{Time.new.to_i}"
`curl #{url} --ntlm -# -u #{USER}:#{PASS} -o #{file_name}`
raw = File.open( file_name, 'rb' ){ |f| f.read }
File.delete( file_name )
raw
end

In other words, I have curl write the output to a file, and then read
in the file using binary mode, and delete the file.

Should I have to do this? Is it a general problem that commands can't
cleanly return binary data to the 'console', and hence can't be
captured using the above format? Or is curl on Windows at fault, and
should be doing something different? Or is Ruby Windows at fault? Or
is Windows itself at fault?


Also - I didn't try using the Tempfile library for the above, since
the documentation for Tempfile.new says:
'Creates a temporary file of mode 0600 in the temporary directory
whose name is basename.pid.n and opens with mode "w+".' If this
documentation is correct, does this mean that the Tempfile library
doesn't work for binary files on Windows?
 
P

Phrogz

I have a script that pulls pages from our wiki server. It was working
using Net:HTTP and open-uri with basic_authentication, but our
sysadmin disabled basic authentication and left NTLM as the only
authentication method.

Instead of trying to figure out how to use the Ruby NTLM library, I
decide to just use curl. It was working nicely for the HTML pages
using this form:
def fetch_http_ntlm( url )
`curl #{url} --ntlm -# -u #{USER}:#{PASS}`
end

However, the above fails for binary files. (Pulling down images
embedded in pages.) So I had to switch it to this:
def fetch_http_ntlm( url )
file_name = "C:\\tmp_#{Time.new.to_i}"
`curl #{url} --ntlm -# -u #{USER}:#{PASS} -o #{file_name}`
raw = File.open( file_name, 'rb' ){ |f| f.read }
File.delete( file_name )
raw
end

In other words, I have curl write the output to a file, and then read
in the file using binary mode, and delete the file.

Should I have to do this? Is it a general problem that commands can't
cleanly return binary data to the 'console', and hence can't be
captured using the above format? Or is curl on Windows at fault, and
should be doing something different? Or is Ruby Windows at fault? Or
is Windows itself at fault?

Followup - this does not seem to be a core problem of terminal
commands returning binary data, or a core failing of Ruby. From my OS
X box at home:

Slim2:~/Desktop phrogz$ cat send_bytes.rb
print [13,7,129,250,0,70,111,111].map{ |b| b.chr }.join

Slim2:~/Desktop phrogz$ cat get_bytes.rb
result = `ruby send_bytes.rb`
p result.length, result

Slim2:~/Desktop phrogz$ ruby get_bytes.rb
8
"\r\a\201\372\000Foo"

This is also not a problem with curl (at least on *nix):

Slim2:~/Desktop phrogz$ curl -s -O http://phrogz.net/tmp/gkhead.jpg
Slim2:~/Desktop phrogz$ irb
irb(main):001:0> good = IO.read( 'gkhead.jpg' ); good.length
=> 21443
irb(main):002:0> url = 'http://phrogz.net/tmp/gkhead.jpg'
=> "http://phrogz.net/tmp/gkhead.jpg"
irb(main):003:0> test = `curl -s #{url}`; test.length
=> 21443
irb(main):004:0> test == good
=> true

Tomorrow I'll see which of the above fails back on my Windows box.
Glad this isn't a fundamental Ruby or shell workflow problem, anyhow.
 
P

Phrogz

Followup - this does not seem to be a core problem of terminal
commands returning binary data, or a core failing of Ruby. From my OS
X box at home:

Slim2:~/Desktop phrogz$ cat send_bytes.rb
print [13,7,129,250,0,70,111,111].map{ |b| b.chr }.join

Slim2:~/Desktop phrogz$ cat get_bytes.rb
result = `ruby send_bytes.rb`
p result.length, result

Slim2:~/Desktop phrogz$ ruby get_bytes.rb
8
"\r\a\201\372\000Foo"

This is also not a problem with curl (at least on *nix):

Slim2:~/Desktop phrogz$ curl -s -Ohttp://phrogz.net/tmp/gkhead.jpg
Slim2:~/Desktop phrogz$ irb
irb(main):001:0> good = IO.read( 'gkhead.jpg' ); good.length
=> 21443
irb(main):002:0> url = 'http://phrogz.net/tmp/gkhead.jpg'
=> "http://phrogz.net/tmp/gkhead.jpg"
irb(main):003:0> test = `curl -s #{url}`; test.length
=> 21443
irb(main):004:0> test == good
=> true

Tomorrow I'll see which of the above fails back on my Windows box.

Here are the results from Windows. Binary per se doesn't fail, but
using it with curl makes it break eventually.

Any suggestions on how to further pare this down to see if this is a
Ruby-Windows problem, a Windows shell problem, or a Curl-Windows
problem?


c:\>type send_bytes.rb
print [13,7,129,250,0,70,111,111].map{ |b| b.chr }.join

c:\>type get_bytes.rb
result = `ruby send_bytes.rb`
p result.length, result

c:\>ruby get_bytes.rb
8
"\r\a\201\372\000Foo"


c:\>curl -s -O http://phrogz.net/tmp/gkhead.jpg

c:\>irb
irb(main):001:0> good = File.open( 'gkhead.jpg', 'rb' ){ |f| f.read };
good.length
=> 21443

irb(main):002:0> url = 'http://phrogz.net/tmp/gkhead.jpg'
=> "http://phrogz.net/tmp/gkhead.jpg"

irb(main):003:0> test = `curl -s #{url}`; test.length
=> 2010

irb(main):008:0> 0.step( test.length, 100 ){ |i|
irb(main):009:1* range = i...(i+100)
irb(main):010:1> if good[ range ] != test[ range ]
irb(main):011:2> p good[ range ], test[ range ], range
irb(main):012:2> break
irb(main):013:2> end
irb(main):014:1> }
"\000\000\000\004\000\000\000\0008BIM\004\032\006Slices
\000\000\000\000m
\000\000\000\006\000\000\000\000\000\000\000\000\000\000\001\276\000\000\001\231\000\000\000\006\000g
\000k\000h\000e\000a\000d
\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000\001\231\000\000"
"\000\000\000\004\000\000\000\0008BIM\004$\023\222\vDW$\026\020EG
\377\320\346\177\335q9}K\236:{5C\357L\026\372\330\251\207\261W>
\372\301v\346O\222b\373\027/\276p\310\372\351\370\246\036\314\327~
\366\260\\\t\037\002\236\253\356X\373\267\237\346)\352{\221\221\367I
\352\177\322\2223z`\227\335W"
700...800
 
P

Phrogz

OK, so this seems like a Ruby Windows problem:

C:\>curl -s -O http://phrogz.net/tmp/gkhead.jpg
C:\>curl -s http://phrogz.net/tmp/gkhead.jpg > test.jpg
C:\>irb
irb(main):001:0> good = File.open( 'gkhead.jpg', 'rb' ){ |f| f.read };
good.length
=> 21443
irb(main):002:0> test = File.open( 'test.jpg', 'rb' ){ |f| f.read };
test.length
=> 21443
irb(main):003:0> suck = `curl -s http://phrogz.net/tmp/gkhead.jpg`;
suck.length
=> 2010


good = File.open( 'gkhead.jpg', 'rb' ){ |f| f.read }
test = `curl -s http://phrogz.net/tmp/gkhead.jpg`

0.upto( test.length-1 ){ |i|
if test[ i ] != good[ i ]
s1 = good[ (i-5)..(i+2) ]
s2 = test[ (i-5)..(i+2) ]
p s1, s2
puts
[ s1, s2 ].each{ |str|
puts str.unpack( 'B8'*str.length ).join('|')
}
break
end
}

#=> "8BIM\004\032\006S"
#=> "8BIM\004$\023\222"
#=>
#=> 00111000|01000010|01001001|01001101|00000100|00011010|00000110|
01010011
#=> 00111000|01000010|01001001|01001101|00000100|00100100|00010011|
10010010


Windows console can properly redirect binary command output to a file,
but (after a certain point or certain binary sequence?) Ruby gets
munged binary data back instead.

I'll take this to ruby-core unless someone can point out why this flaw
isn't Ruby's.
 
P

Phrogz

For my last post on this topic, a simpler test case showing Ruby on OS
X behaving as expected, and Ruby on Windows...not.

====

Darwin Slim2.local 8.10.1 Darwin Kernel Version 8.10.1: Wed May 23
16:33:00 PDT 2007; root:xnu-792.22.5~1/RELEASE_I386 i386 i386
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.9.1]

Slim2:~/Desktop phrogz$ cat put_bytes.rb
File.open( 'gkhead.jpg', 'rb' ){ |f| print f.read }

Slim2:~/Desktop phrogz$ cat get_bytes.rb
raw_bytes = File.open( 'gkhead.jpg', 'rb' ){ |f| f.read }
rcv_bytes = `ruby put_bytes.rb`
p raw_bytes.length, rcv_bytes.length

Slim2:~/Desktop phrogz$ ruby get_bytes.rb
21443
21443

====

Windows XP SP 2 (Microsoft Windows XP [Version 5.1.2600])
ruby 1.8.6 (2007-03-13 patchlevel 0) [i386-mswin32] (latest one-click
installer)

C:\Documents and Settings\gavin.kistner\Desktop>type put_bytes.rb
File.open( 'gkhead.jpg', 'rb' ){ |f| print f.read }

C:\Documents and Settings\gavin.kistner\Desktop>type get_bytes.rb
raw_bytes = File.open( 'gkhead.jpg', 'rb' ){ |f| f.read }
rcv_bytes = `ruby put_bytes.rb`
p raw_bytes.length, rcv_bytes.length

C:\Documents and Settings\gavin.kistner\Desktop>ruby get_bytes.rb
21443
5159
 
D

Daniel Sheppard

I have a script that pulls pages from our wiki server. It was working
using Net:HTTP and open-uri with basic_authentication, but our
sysadmin disabled basic authentication and left NTLM as the only
authentication method.

Install http://ntlmaps.sourceforge.net/ and direct Net::HTTP through
that
as a proxy.
 
P

Phrogz

I would hazard a guess that if you took that 'b' off of the File.open,
you'd get the same bytes `` is returning?

I doubt it, but will try when I get into work. My understanding was
that (on Windows) opening a file without 'b' "helpfully" converts \n
bytes to \r\n pairs; the 'b' is needed to say "Hey, don't be munging
my data!".

But like I said, I'll give it a shot.
 
P

Phrogz

I doubt it, but will try when I get into work. My understanding was
that (on Windows) opening a file without 'b' "helpfully" converts \n
bytes to \r\n pairs; the 'b' is needed to say "Hey, don't be munging
my data!".

But like I said, I'll give it a shot.

OK, so this has nothing to do with reading files from disk. The crazy
thing is that it isn't even deterministic! See the following:

C:\>type put_bytes.rb
print (0..12000).map{ |i| ((i % 255) + 1).chr }.join
$stdout.flush
sleep 1
$stdout.flush

C:\>type get_bytes.rb
p `ruby put_bytes.rb`.length

C:\>type multiget.bat
@echo off
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb

C:\>multiget.bat
944
696
944
1192
944
919
1192
1192
944
944
1192
1192
944
1167
1192
1192
944
1192
1192
1192

Note that it also does the above with or without the sleep, and with
or without the $stdout.flush calls.

What is going on here?!
 
P

Peña, Botp

From: Phrogz [mailto:p[email protected]]=20
# OK, so this has nothing to do with reading files=20
# from disk. The crazy thing is that it isn't even=20
# deterministic! See the following:
# <snip>
#...
# What is going on here?!

can't help you there, but mine has a different yet consistent output...

C:\family\ruby>type put_bytes.rb
print (0..12000).map{ |i| ((i % 255) + 1).chr }.join
$stdout.flush
sleep 1
$stdout.flush

C:\family\ruby>type get_bytes.rb
p `ruby put_bytes.rb`.length

C:\family\ruby>type multi_get.bat
@echo off
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb
ruby get_bytes.rb

C:\family\ruby> multi_get.bat
348
348
348
348
348
348
348
348
348
348
348
348
348
348
348
348
348
348
348
348

C:\family\ruby>ver

Microsoft Windows XP [Version 5.1.2600]

C:\family\ruby>ruby -v
ruby 1.8.6 (2007-09-23 patchlevel 110) [i386-mswin32]

maybe we differ on the patchlevel?

kind regards -botp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top