Dir.entires and UTF-8

T

Timo Hoepfner

Hi,

What's going on here? Ths is on MacOS X 10.4.4. Looks like =20
Dir#entries returns strings encoded with some encoding I didn't =20
expect. How can I convert the string to UTF8?

$KCODE=3D'UTF8'
require 'jcode'
s=3D"=E4=F6=FC=DF=C4=D6=DC"
puts s.split(//).inspect
# =3D> ["=E4", "=F6", "=FC", "=DF", "=C4", "=D6", "=DC"]
test_dir=3D"/tmp/test"
`mkdir #{test_dir}`
`touch #{test_dir}/#{s}`
f=3DDir.entries(test_dir).last
puts f.split(//).inspect
# =3D> ["a", "", "o", "", "u", "", "=DF", "A", "", "O", "", "U", ""]

Timo
 
Y

Yukihiro Matsumoto

Hi,

What's going on here? Ths is on MacOS X 10.4.4. Looks like
Dir#entries returns strings encoded with some encoding I didn't
expect. How can I convert the string to UTF8?

You have got a corrent UTF-8 string. Unlike Windows XP, Mac OS X
decomposes character components as much as possible (Sorry I forgot
the correct term for this policy). So what you got:
# =3D> ["a", "", "o", "", "u", "", "=DF", "A", "", "O", "", "U", ""]

is decomposed form of your string, a+umlaut, o+umlaut, etc.

matz.
 
A

A LeDonne

Hi,

What's going on here? Ths is on MacOS X 10.4.4. Looks like
Dir#entries returns strings encoded with some encoding I didn't
expect. How can I convert the string to UTF8?

You have got a corrent UTF-8 string. Unlike Windows XP, Mac OS X
decomposes character components as much as possible (Sorry I forgot
the correct term for this policy). So what you got:
# =3D> ["a", "", "o", "", "u", "", "=DF", "A", "", "O", "", "U", ""]

is decomposed form of your string, a+umlaut, o+umlaut, etc.

matz.

Matz refers to Unicode Normalization Form D (NFD). According to
http://developer.apple.com/technotes/tn/tn1150.html (HFS Plus Volume
Format):

"HFS Plus stores strings fully decomposed and in canonical order. HFS
Plus compares strings in a case-insensitive fashion. Strings may
contain Unicode characters that must be ignored by this comparison.
For more details on these subtleties, see Unicode Subtleties."

-A
 
A

Austin Ziegler

You have got a corrent UTF-8 string. Unlike Windows XP, Mac OS X
decomposes character components as much as possible (Sorry I forgot
the correct term for this policy). So what you got:

IIRC, that's the correct term. (Decomposed.)

-austin
 
T

Timo Hoepfner

How can I convert the string to UTF8?
You have got a corrent UTF-8 string. Unlike Windows XP, Mac OS X
decomposes character components as much as possible (Sorry I forgot
the correct term for this policy). So what you got:
# =3D> ["a", "", "o", "", "u", "", "=DF", "A", "", "O", "", "U", ""]

is decomposed form of your string, a+umlaut, o+umlaut, etc.

Hi Matz, Austin and A.

Thanks for the clarification. Unicode is more comlex than it seems in =20=

the first place...

Nevertheless that doesn't solve my current problem. What I'm trying =20
to do is to organize files within a directory into subfolders based =20
on the first N characters of the file name. Here's my code (w/o error =20=

handling) which works fine for 8bit characters, but doesn't work for =20
e.g. umlauts:

$KCODE=3D'UTF8'
require 'jcode'
require 'pathname'
require 'fileutils'
wd, len =3D Pathname.new(ARGV[0]), ARGV[1].to_i
files=3Dwd.children.reject{|f| f.directory?}
files.each do |f|
dir =3D wd + Pathname.new(f.basename.to_s.split(//)[0..len-1].join)
dir.mkdir unless dir.exist?
FileUtils.mv f, dir
end

I guess I have to recompose the decomposed filename somehow. Are =20
there any tools for that in the standard library or somewhere else?

Thanks for your help,

Timo
 
T

Timo Hoepfner

Hi,

to answer my own question, here's a solution. Use the 'unicode' gem
and change the line
dir = wd + Pathname.new(f.basename.to_s.split(//)[0..len-1].join)

to

dir = wd + Pathname.new(Unicode::compose(f.basename.to_s).split(//)
[0..len-1].join)

Then it works.

Timo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
SterlingLa
Top