iconv transfer code

P

Pen Ttt

in my computer(ubuntu9.1+ruby1.9):
pt@pt-laptop:~$ irb
irb(main):001:0> require 'iconv'
=> true
irb(main):002:0> str = Iconv.iconv('GBK', 'UTF-8', '我说').to_s
=> "[\"��˵\"]"

in my friend's(ubuntu9.1+ruby1.9):
$ irb
irb(main):001:0> require 'iconv'
=> true
irb(main):002:0> str = Iconv.iconv('GBK', 'UTF-8', '我说').to_s
=> "\316\322\313\265"
irb(main):003:0> puts Iconv.iconv('UTF-8', 'GBK', str).to_s
我说
=> nil

what's wrong in my system?
 
B

Brian Candler

Pen said:
in my computer(ubuntu9.1+ruby1.9):
pt@pt-laptop:~$ irb
irb(main):001:0> require 'iconv'
=> true
irb(main):002:0> str = Iconv.iconv('GBK', 'UTF-8', '我说').to_s
=> "[\"��˵\"]"

in my friend's(ubuntu9.1+ruby1.9):
$ irb
irb(main):001:0> require 'iconv'
=> true
irb(main):002:0> str = Iconv.iconv('GBK', 'UTF-8', '我说').to_s
=> "\316\322\313\265"
irb(main):003:0> puts Iconv.iconv('UTF-8', 'GBK', str).to_s
我说
=> nil

what's wrong in my system?

One of the joys of ruby 1.9 is that the same program run on two
different machines can behave differently. That's even if the two
machines have identical versions of ruby and OS *and* you are feeding in
the same input data.

My advice is to stick with ruby 1.8.x, where the behaviour is both sane
and predictable. However there are other people who will vociferously
tell you that I am doing the entire ruby community a disservice by
recommending this to you. It's up to you whose advice to follow.

If you want to persevere with ruby 1.9, I suggest the following:

* Check you have exactly identical versions of 1.9 (check the
RUBY_DESCRIPTION constant) on both machines. The behaviour is subtle,
and a lot of it has changed.

* Look at str.bytes.to_a to see if the byte sequence is correct or not.
That is, the fact that irb displays the string wrongly or rightly
doesn't mean anything; don't trust what you see.

* Instead of using irb, write a .rb script, and run it from the command
line directly.

* Check the environments are the same on both. You could try
experimenting with setting LANG and/or LC_ALL environment variables
before starting ruby.

* I tried to understand how this all works, and I documented what I
found at http://github.com/candlerb/string19/blob/master/string19.rb

There are about 200 cases of encoding behaviour described there.

Also, it's possible to do what you're trying to do in ruby 1.9 without
using Iconv, but instead tagging str with its correct encoding, and then
using encode! to convert it to another. Whether it appears correctly on
the terminal or not, especially within irb, is still not something to
trust. Again, use str.bytes.to_a to see if it is the expected sequence
of bytes in the new encoding.

Good luck,

Brian.
 
B

Benoit Daloze

[Note: parts of this message were removed to make it a legal post.]

Hi,
One of the joys of ruby 1.9 is that the same program run on two
different machines can behave differently. That's even if the two
machines have identical versions of ruby and OS *and* you are feeding in
the same input data.

Please don't be so pessimist without real reason :)
(that said, show some code that has different result in the conditions you
said).

Maybe what you're describing is caused by different revisions, but that
happened also in 1.8, no?

* Look at str.bytes.to_a to see if the byte sequence is correct or not.
That is, the fact that irb displays the string wrongly or rightly
doesn't mean anything; don't trust what you see.

Yes, that's true, encoding in irb is still ,often, having a bad result.

B.D.
 
J

James Edward Gray II

One of the joys of ruby 1.9 is that the same program run on two=20
different machines can behave differently. That's even if the two=20
machines have identical versions of ruby and OS *and* you are feeding = in=20
the same input data.

I'm pretty sure that's true with Ruby 1.8 as well. For example, don't =
the encodings available to iconv vary depending on the platform?

James Edward Gray II=
 
B

Brian Candler

Benoit said:
Please don't be so pessimist without real reason :)
(that said, show some code that has different result in the conditions
you
said).

Sure. Here's a simple one:

File.open("myfile.txt") do |f|
line = f.gets
line =~ /./
end

You can run this script on two machines, with the same version of OS and
ruby and the same myfile.txt but with different environment variable
settings, and get it to crash on one but not the other. (One way: if the
default external encoding on one machine is US-ASCII and myfile.txt
contains any byte with the top bit set)
Maybe what you're describing is caused by different revisions, but that
happened also in 1.8, no?

This is intentional behaviour in ruby 1.9.
 
B

Brian Candler

James said:
I'm pretty sure that's true with Ruby 1.8 as well. For example, don't
the encodings available to iconv vary depending on the platform?

Perhaps, but I was talking about an identical platform, O/S, and
installation of ruby - but different configured locale (such as LANG,
LC_CTYPE or LC_ALL environment variables)

Unless you write your ruby script defensively, it will behave
differently dependent on those environment settings when everything else
is identical.
 
J

James Edward Gray II

Perhaps, but I was talking about an identical platform, O/S, and
installation of ruby - but different configured locale (such as LANG,
LC_CTYPE or LC_ALL environment variables)

So your main complaint is that Ruby honors the settings of your environment?

James Edward Gray II
 
B

Benoit Daloze

[Note: parts of this message were removed to make it a legal post.]

So your main complaint is that Ruby honors the settings of your
environment?

James Edward Gray II

Beautiful that one :D (couldn't get a cool answer so I waited somebody else
answer)

Yeah, I think it's normal it saves in the encoding depending on the
environment.
And if you want something that doesn't depend on the environment there is
many possibilities.

The easiest with File: File.open("myfile.ext", "w:UTF-8")
 
B

Brian Candler

Benoit said:
The easiest with File: File.open("myfile.ext", "w:UTF-8")

This is a poor example of the point in question, although a good example
of how hard ruby 1.9 is to understand.

In fact: the default external encoding is nil for files opened for
write, and does not depend on the environment at all. That is,

File.open("myfile.ext","w") { |f| f.puts str }

just outputs whatever bytes are in str, without meddling with them.
Whereas

File.open("myfile.ext","w:UTF-8") { |f| f.puts str}

will attempt to re-encode str from its current encoding to UTF-8, and
may raise an exception if it cannot do so.

So if you want to write programs which don't crash, the first is
arguably better.

The rules for *reading* from files are completely different, and indeed
"r:UTF-8" is the right thing to do if you are reading from a file which
contains UTF-8 text and you don't want this to be affected by
environment variable magic.
 
B

botp

...=A0File.open("myfile.ext","w:UTF-8") { |f| f.puts str}

will attempt to re-encode str from its current encoding to UTF-8, and
may raise an exception if it cannot do so.
good

So if you want to write programs which don't crash, the first is
arguably better.

we disagree there but what do you mean by "crash"?

best regards -botp
 
B

Brian Candler

Perhaps, but I was talking about an identical platform, O/S, and
So your main complaint is that Ruby honors the settings of your
environment?

My complaints are listed at
http://github.com/candlerb/string19/blob/master/soapbox.rb - but I guess
the main one is what the OP saw. Same program, same data, same ruby,
different behaviour.

Normally when analysing a program you only need to look at the program
and its input, but ruby 1.9 has extra "hidden" input data in the form of
environment variables which can alter your program's behaviour, or not,
depending on the content of the input data as well.

I wonder how many Ruby users are fully aware of which environment
variables influence POSIX locales, and which ones take precendence over
the others?

I also note that there is an effort underway to standardise the Ruby
language definition, and this has chosen 1.8.7 as its baseline.
 
B

Brian Candler

botp said:
we disagree there but what do you mean by "crash"?

I mean "raise an exception". The first example I wrote will never raise
an exception. The second can.

Code to demonstrate:

str = "\xff"
File.open("out1","w") { |f| f.puts str }
File.open("out2","w:UTF-8") { |f| f.puts str }

Line 2 will never raise an exception, regardless of the content or the
encoding of str, and regardless of environment variable settings. It
just writes the string to the file.

Line 3 may raise an exception. It does in this particular program
because str has data tagged as ASCII-8BIT which cannot be transcoded to
UTF-8.
 
J

James Edward Gray II

Code to demonstrate:
=20
str =3D "\xff"
File.open("out1","w") { |f| f.puts str }
File.open("out2","w:UTF-8") { |f| f.puts str }
=20
Line 2 will never raise an exception, regardless of the content or the=20=
encoding of str, and regardless of environment variable settings. It=20=
just writes the string to the file.

That's grossly inaccurate. You may not have write permission to the =
file, the volume you are trying to place the file on may be out of =
space, etc.

These are more examples of how you could move the same code to a new =
machine and have it fail. Ignoring the environment code runs in will =
not make it go away.

James Edward Gray II=
 
B

Brian Candler

James said:
That's grossly inaccurate. You may not have write permission to the
file, the volume you are trying to place the file on may be out of
space, etc.

Of course syscalls can fail due to insufficient resources and other
system-level problems. I'm talking about the normal flow of execution.

The point remains: Benoit said that one way to make your program immune
to influence from environment variables was to use
File.open("myfile.ext","w:UTF-8"). I was trying to highlight that advice
is incorrect, because the regular File.open("myfile.ext","w") is immune
to environment variables already. Furthermore, "w:UTF-8" can crash in
the normal flow under more circumstances than "w" - and those
circumstances depend on string contents and encodings, which _can_ be
affected by environment variables.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top