Reading a CSV file with UTF-16LE encoding

  • Thread starter Daniel de Angelis Cordeiro
  • Start date
D

Daniel de Angelis Cordeiro

Hy all,


I'm using ruby 1.9.2p0 and I'm=C2=A0trying to read a CSV file encoded in
UTF-16LE using the following script:

# encoding: utf-8
require 'csv'
CSV.foreach("file_path", {:col_sep =3D> ";", :encoding =3D>
"UTF-16LE:UTF-8"}) { |row|
p row
}

When I run this, I got the following exception:

/usr/lib/ruby/1.9.1/csv.rb:2020:in `=3D~': invalid byte sequence in
UTF-8 (ArgumentError)
from /usr/lib/ruby/1.9.1/csv.rb:2020:in `init_separators'
from /usr/lib/ruby/1.9.1/csv.rb:1570:in `initialize'
from /usr/lib/ruby/1.9.1/csv.rb:1335:in `new'
from /usr/lib/ruby/1.9.1/csv.rb:1335:in `open'
from /usr/lib/ruby/1.9.1/csv.rb:1201:in `foreach'
from test.rb:3:in `<main>'

The csv module reads a sample from the file (using IO.read(),
csv.rb:2309) and tries to match it against a Regexp of possible line
endings. This sample have its encoding forced to the encoding I've
chose (UTF-8), but the result is sample.valid_encoding? =3D=3D false. When
the regexp match takes place, the result is this exception I showed.


Am I missing something here or this is a bug on csv module?


Thanks in advance,
Daniel
 
J

James Edward Gray II

I'm using ruby 1.9.2p0 and I'm trying to read a CSV file encoded in
UTF-16LE using the following script:
=20
# encoding: utf-8
require 'csv'
CSV.foreach("file_path", {:col_sep =3D> ";", :encoding =3D>
"UTF-16LE:UTF-8"}) { |row|
p row
}
=20
When I run this, I got the following exception:
=20
/usr/lib/ruby/1.9.1/csv.rb:2020:in `=3D~': invalid byte sequence in
UTF-8 (ArgumentError)
from /usr/lib/ruby/1.9.1/csv.rb:2020:in `init_separators'
from /usr/lib/ruby/1.9.1/csv.rb:1570:in `initialize'
from /usr/lib/ruby/1.9.1/csv.rb:1335:in `new'
from /usr/lib/ruby/1.9.1/csv.rb:1335:in `open'
from /usr/lib/ruby/1.9.1/csv.rb:1201:in `foreach'
from test.rb:3:in `<main>'
=20
The csv module reads a sample from the file (using IO.read(),
csv.rb:2309) and tries to match it against a Regexp of possible line
endings. This sample have its encoding forced to the encoding I've
chose (UTF-8), but the result is sample.valid_encoding? =3D=3D false. = When
the regexp match takes place, the result is this exception I showed.
=20
=20
Am I missing something here or this is a bug on csv module?

It does look like it's probably a bug. I think it only affects the line =
ending guessing though, so set :row_sep manually to avoid it for now. =
Sorry!

James Edward Gray II=
 
D

Daniel de Angelis Cordeiro

Hi,

It does look like it's probably a bug. =C2=A0I think it only affects the =
line ending guessing though, so set :row_sep manually to avoid it for now. =
=C2=A0Sorry!

Exactly, setting :row_sep manually works.

Since I also don't know which line ending the file has, I was thinking
in use instead IO.readline() (maybe reading only 1024 each time like
in csv code) and look at the end of the string to see which line
separator the files uses. I don't know if there is a more efficient
solution...
James Edward Gray II

Thanks for the great work in csv module! :)


Best regards,
Daniel
 
J

James Edward Gray II

Exactly, setting :row_sep manually works.
=20
Since I also don't know which line ending the file has, I was thinking
in use instead IO.readline() (maybe reading only 1024 each time like
in csv code) and look at the end of the string to see which line
separator the files uses. I don't know if there is a more efficient
solution...

Yeah, I'll fix that code to be more encoding friendly, but that's what =
needs doing.

James Edward Gray II=
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,560
Members
45,035
Latest member
HoTaKeDai

Latest Threads

Top