Changing filenames from Greeklish => Greek (subprocess complain)

  • Thread starter Íéêüëáïò Êïýñáò
  • Start date
Í

Íéêüëáïò Êïýñáò

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 9:32:15 ì.ì.UTC+3, ï ÷ñÞóôçò MRAB Ýãñáøå:' leade to that unknown encoding of this bytestream '\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3'

Actually you were correct i was typing greek and is aw the fileneme here ingogole groups as:

so maybe the filenames have to be decoded to greek-iso but then agian the contain both greek letters but their extension are in english chars like '.mp3'

Using Python, I think you could get the filenames using os.listdir,
passing the directory name as a bytestring so that it'll return the
names as bytestrings.

Then, for each name, you could decode from its current encoding and
encode to UTF-8 and rename the file, passing the old and new paths to
os.rename as bytestrings.

Iam not sure i follow:

Change this:

# Compute a set of current fullpaths
fullpaths = set()
path = "/home/nikos/public_html/data/apps/"

for root, dirs, files in os.walk(path):
for fullpath in files:
fullpaths.add( os.path.join(root, fullpath) )


to what to make the full url readable by files.py?
 
Í

Íéêüëáïò Êïýñáò

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 9:43:18 ì.ì.UTC+3, ï ÷ñÞóôçò Íéêüëáïò Êïýñáò Ýãñáøå:
Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 9:32:15 ì.ì. UTC+3, ï ÷ñÞóôçò MRAB Ýãñáøå:


' leade to that unknown encoding of this bytestream '\305\365\367\336\ \364\357\365\ \311\347\363\357\375.mp3'






Actually you were correct i was typing greek and is aw the fileneme here in gogole groups as:






so maybe the filenames have to be decoded to greek-iso but then agian thecontain both greek letters but their extension are in english chars like '..mp3'












Iam not sure i follow:



Change this:



# Compute a set of current fullpaths

fullpaths = set()

path = "/home/nikos/public_html/data/apps/"



for root, dirs, files in os.walk(path):

for fullpath in files:

fullpaths.add( os.path.join(root, fullpath) )





to what to make the full url readable by files.py?

MRAB can you please explain in more clarity your idea of solution?
 
Í

Íéêüëáïò Êïýñáò

Can someone else explain to me what MRAB is trying to say to me?
Is there a way even if we dont know the encoding used from filanems to become bytestreams still to be able to open the greek filenames?
 
J

jmfauth

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 8:56:36 ð.ì. UTC+3, ï ÷ñÞóôçò Steven D'Aprano Ããñáøå:

Somehow, I don't know how because I didn't see it happen, you have one or
more files in that directory where the file name as bytes is invalid when
decoded as UTF-8, but your system is set to use UTF-8. So to fix this you
need to rename the file using some tool that doesn't care quite so much
about encodings. Use the bash command line to rename each file in turn
until the problem goes away.

But renaming ia hsell access like 'mv 'Euxi tou Ihsou.mp3' 'Åõ÷Þ ôïõ Éçóïõ.mp3' leade to that unknown encoding of this bytestream '\305\365\367\336\ \364\357\365\\311\347\363\357\375.mp3'

But please tell me Steven what linux tool you think it can encode the weird filename to proper 'Åõ÷Þ ôïõ Éçóïõ.mp3' utf-8?

or we cna write a script as i suggested to decode back the bytestream using all sorts of available decode charsets boiling down to the original greek letters.

---------------

see
http://bugs.python.org/issue13643, msg msg149949 - (view) Author:
Antoine Pitrou (pitrou)


Quote:

So, you're complaining about something which works, kind of:

$ touch héhé
$ LANG=C python3 -c "import os; print(os.listdir())"
['h\udcc3\udca9h\udcc3\udca9']
This makes robustly working with non-ascii filenames on different
platforms needlessly annoying, given no modern nix should have problems
just using UTF-8 in these cases.

So why don't these supposedly "modern" systems at least set the
appropriate environment variables for Python to infer the proper
character encoding?
(since these "modern" systems don't have a well-defined encoding...)

Answer: because they are not modern at all, they are antiquated,
inadapted and obsolete pieces of software designed and written by
clueless Anglo-American people. Please report bugs against these
systems. The culprit is not Python, it's the Unix crap and the utterly
clueless attitude of its maintainers ("filesystems are just bytes",
yeah, whatever...).

jmf
 
Í

Íéêüëáïò Êïýñáò

Yes this is a linxu issue although locale is se to utf-8

root@nikos [~]# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
root@nikos [~]#


Since 'locale' is set to 'utf-8' why when i:

'mv 'Euxi tou Ihsou.mp3' 'Åõ÷Þ ôïõ Éçóïý.mp3'

lead to that unknown encoded bytestream '\305\365\367\336\\364\357\365\311\347\363\357\375.mp3'

which isn't by default an utf-8 bytestream as locale indicated and python expected?

how 'files.py' is supposed to read this file now using:

# Compute a set of current fullpaths
fullpaths = set()
path = "/home/nikos/public_html/data/apps/"

for root, dirs, files in os.walk(path):
for fullpath in files:
fullpaths.add( os.path.join(root, fullpath) )

????
 
H

Heiko Wundram

Am 05.06.2013 18:44, schrieb MRAB:
From the previous posts I guessed that the filename might be encoded
using ISO-8859-7:

'Åõ÷Þ\\ ôïõ\\ Éçóïý.mp3'

Yes, that looks the same.

Most probably, his terminal is set to ISO-8859-7, so that when he issues
the rename command on the command-line of his shell session, the "mv"
command gets a stream of bytes as the new file name which happens to be
the ISO-8859-7 encoding of the file name he'd like the file to have.
This is what's stored on disk.

So, his biggest problem isn't that the operating system is encoding
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but
rather that he's using an ISO-7 terminal window when having set up UTF-8
as his operating system locale and expects filenames to be encoded in
UTF-8 when he's not passing in UTF-8 byte streams from his client
computer at all.
 
M

Mark Lawrence

Sure. You tell me what a proper Unicode rendition of an animated GIF is.

ChrisA

It's obviously one that doesn't use the flawed Python Flexible String
Representation :)

--
"Steve is going for the pink ball - and for those of you who are
watching in black and white, the pink is next to the green." Snooker
commentator 'Whispering' Ted Lowe.

Mark Lawrence
 
C

Cameron Simpson

| Τη ΤετάÏτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χÏήστης MRAB έγÏαψε:
| > Using Python, I think you could get the filenames using os.listdir,
| > passing the directory name as a bytestring so that it'll return the
| > names as bytestrings.
|
| > Then, for each name, you could decode from its current encoding and
| > encode to UTF-8 and rename the file, passing the old and new paths to
| > os.rename as bytestrings.
|
| Iam not sure i follow:
|
| Change this:
|
| # Compute a set of current fullpaths
| fullpaths = set()
| path = "/home/nikos/public_html/data/apps/"
|
| for root, dirs, files in os.walk(path):
[...]

Have a read of this:

http://docs.python.org/3/library/os.html#os.listdir

The UNIX API accepts bytes for filenames and paths.

Python 3 strs are sequences of Unicode code points. If you try to
open a file or directory on a UNIX system using a Python str, that
string must be converted to a sequence of bytes before being handed
to the OS.

This is done implicitly using your locale settings if you just use a str.

However, if you pass a bytes to open or listdir, this conversion
does not take place. You put bytes in and in the case of listdir
you get bytes out.

You can work on pathnames in bytes and never concern yourself with
encode/decode at all.

In this way you can write code that does not care about the translation
between Unicode and some arbitrary byte encoding.

Of course, the issue will still arise when accepting user input;
your shell has done exactly this kind of thing when you renamed
your MP3 file. But it is possible to write pure utility code that
doesn't care about filenames as Unicode or str if you work purely
in bytes.

Regarding user filenames, the common policy these days is to use
utf-8 throughout. Of course you need to get everything into that
regime to start with.
 
Í

Íéêüëáïò Êïýñáò

Ôç ÐÝìðôç, 6 Éïõíßïõ 2013 11:50:55 ð.ì. UTC+3, ï ÷ñÞóôçò Heiko Wundram Ýãñáøå:
Am 05.06.2013 18:44, schrieb MRAB:






Most probably, his terminal is set to ISO-8859-7, so that when he issues

the rename command on the command-line of his shell session, the "mv"

command gets a stream of bytes as the new file name which happens to be

the ISO-8859-7 encoding of the file name he'd like the file to have.

This is what's stored on disk.



So, his biggest problem isn't that the operating system is encoding

agnostic wrt. filenames (i.e., treats them as a stream of bytes), but

rather that he's using an ISO-7 terminal window when having set up UTF-8

as his operating system locale and expects filenames to be encoded in

UTF-8 when he's not passing in UTF-8 byte streams from his client

computer at all.

(e-mail address removed) [~/www/data/apps]# ls -l | file -
/dev/stdin: ASCII text


# Compute a set of current fullpaths
fullpaths = set()
path = "/home/nikos/public_html/data/apps/"

for root, dirs, files in os.walk(path):
for fullpath in files:
fullpaths.add( os.path.join(root, fullpath) )

----------------------------
[Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173] cur.execute('''SELECT url FROM files WHERE url = %s''', fullpath.encode('iso-8859-7') )
[Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173] File "/usr/local/lib/python3.3/encodings/iso8859_7.py", line 12, in encode
[Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173] return codecs..charmap_encode(input,errors,encoding_table)
[Thu Jun 06 13:34:19 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'charmap' codec can't encode characters in position 34-37: character maps to <undefined>


[Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173] Traceback (most recent call last):
[Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173] File "files.py", line 73, in <module>
[Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173] cur.execute('''SELECT url FROM files WHERE url = %s''', fullpath.decode('iso-8859-7') )
[Thu Jun 06 13:27:17 2013] [error] [client 79.103.41.173] AttributeError: 'str' object has no attribute 'decode'

Same when i encode in latin
 
H

Heiko Wundram

Am 06.06.2013 12:35, schrieb Íéêüëáïò Êïýñáò:
(e-mail address removed) [~/www/data/apps]# ls -l | file -
/dev/stdin: ASCII text

Did you actually try to understand what I wrote?
 
Í

Íéêüëáïò Êïýñáò

Heiko, the ssh client i used to 'mv' the .mp3 was putty.Do you mean that putty is responsible for the encoding mess?


the rename command on the command-line of his shell session, the "mv" 
command gets a stream of bytes as the new file name which happens to be 
the ISO-8859-7 encoding of the file name he'd like the file to have. 
This is what's stored on disk. 




So, his biggest problem isn't that the operating system is encoding 
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
rather that he's using an ISO-7 terminal window when having set up UTF-8 
as his operating system locale and expects filenames to be encoded in 
UTF-8 when he's not passing in UTF-8 byte streams from his client 
computer at all. 

the rename command on the command-line of his shell session, the "mv" 
command gets a stream of bytes as the new file name which happens to be 
the ISO-8859-7 encoding of the file name he'd like the file to have. 
This is what's stored on disk. 




So, his biggest problem isn't that the operating system is encoding 
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
rather that he's using an ISO-7 terminal window when having set up UTF-8 
as his operating system locale and expects filenames to be encoded in 
UTF-8 when he's not passing in UTF-8 byte streams from his client 

the rename command on the command-line of his shell session, the "mv" 
command gets a stream of bytes as the new file name which happens to be 
the ISO-8859-7 encoding of the file name he'd like the file to have. 
This is what's stored on disk. 




So, his biggest problem isn't that the operating system is encoding 
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
rather that he's using an ISO-7 terminal window when having set up UTF-8 
as his operating system locale and expects filenames to be encoded in 
UTF-8 when he's not passing in UTF-8 byte streams from his client 
computer at all. 

the rename command on the command-line of his shell session, the "mv" 
command gets a stream of bytes as the new file name which happens to be 
the ISO-8859-7 encoding of the file name he'd like the file to have. 
This is what's stored on disk. 




So, his biggest problem isn't that the operating system is encoding 
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
rather that he's using an ISO-7 terminal window when having set up UTF-8 
as his operating system locale and expects filenames to be encoded in 
UTF-8 when he's not passing in UTF-8 byte streams from his client 
computer at all. 

the rename command on the command-line of his shell session, the "mv" 
command gets a stream of bytes as the new file name which happens to be 
the ISO-8859-7 encoding of the file name he'd like the file to have. 
This is what's stored on disk. 




So, his biggest problem isn't that the operating system is encoding 
agnostic wrt. filenames (i.e., treats them as a stream of bytes), but 
rather that he's using an ISO-7 terminal window when having set up UTF-8 
as his operating system locale and expects filenames to be encoded in 
UTF-8 when he's not passing in UTF-8 byte streams from his client 
computer at all. 
 
H

Heiko Wundram

Am 06.06.2013 13:00, schrieb Îικόλαος ΚοÏÏας:
Heiko, the ssh client i used to 'mv' the .mp3 was putty.Do you mean that putty is responsible for the encoding mess?

Exactly. Check the encoding that putty uses for the terminal session. If
it doesn't use UTF-8, switch your terminal session to UTF-8 and try the
rename again. If it does, try to use another terminal client (I
recommend the Cygwin-Suite).
 
Í

Íéêüëáïò Êïýñáò

Ôç ÐÝìðôç, 6 Éïõíßïõ 2013 1:24:16 ì.ì. UTC+3, ï ÷ñÞóôçò Cameron Simpson Ýãñáøå:
| Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 9:32:15 ì.ì. UTC+3, ï ÷ñÞóôçò MRAB Ýãñáøå:

| > Using Python, I think you could get the filenames using os.listdir,

| > passing the directory name as a bytestring so that it'll return the

| > names as bytestrings.

|

| > Then, for each name, you could decode from its current encoding and

| > encode to UTF-8 and rename the file, passing the old and new paths to

| > os.rename as bytestrings.

|

| Iam not sure i follow:

|

| Change this:

|

| # Compute a set of current fullpaths

| fullpaths = set()

| path = "/home/nikos/public_html/data/apps/"

|

| for root, dirs, files in os.walk(path):

[...]



Have a read of this:



http://docs.python.org/3/library/os.html#os.listdir



The UNIX API accepts bytes for filenames and paths.



Python 3 strs are sequences of Unicode code points. If you try to

open a file or directory on a UNIX system using a Python str, that

string must be converted to a sequence of bytes before being handed

to the OS.



This is done implicitly using your locale settings if you just use a str.



However, if you pass a bytes to open or listdir, this conversion

does not take place. You put bytes in and in the case of listdir

you get bytes out.



You can work on pathnames in bytes and never concern yourself with

encode/decode at all.



In this way you can write code that does not care about the translation

between Unicode and some arbitrary byte encoding.



Of course, the issue will still arise when accepting user input;

your shell has done exactly this kind of thing when you renamed

your MP3 file. But it is possible to write pure utility code that

doesn't care about filenames as Unicode or str if you work purely

in bytes.



Regarding user filenames, the common policy these days is to use

utf-8 throughout. Of course you need to get everything into that

regime to start with





Ôç ÐÝìðôç, 6 Éïõíßïõ 2013 1:24:16 ì.ì. UTC+3, ï ÷ñÞóôçò Cameron Simpson Ýãñáøå:
| Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 9:32:15 ì.ì. UTC+3, ï ÷ñÞóôçò MRAB Ýãñáøå:

| > Using Python, I think you could get the filenames using os.listdir,

| > passing the directory name as a bytestring so that it'll return the

| > names as bytestrings.

|

| > Then, for each name, you could decode from its current encoding and

| > encode to UTF-8 and rename the file, passing the old and new paths to

| > os.rename as bytestrings.

|

| Iam not sure i follow:

|

| Change this:

|

| # Compute a set of current fullpaths

| fullpaths = set()

| path = "/home/nikos/public_html/data/apps/"

|

| for root, dirs, files in os.walk(path):

[...]



Have a read of this:



http://docs.python.org/3/library/os.html#os.listdir



The UNIX API accepts bytes for filenames and paths.



Python 3 strs are sequences of Unicode code points. If you try to

open a file or directory on a UNIX system using a Python str, that

string must be converted to a sequence of bytes before being handed

to the OS.



This is done implicitly using your locale settings if you just use a str.



However, if you pass a bytes to open or listdir, this conversion

does not take place. You put bytes in and in the case of listdir

you get bytes out.



You can work on pathnames in bytes and never concern yourself with

encode/decode at all.



In this way you can write code that does not care about the translation

between Unicode and some arbitrary byte encoding.



Of course, the issue will still arise when accepting user input;

your shell has done exactly this kind of thing when you renamed

your MP3 file. But it is possible to write pure utility code that

doesn't care about filenames as Unicode or str if you work purely

in bytes.



Regarding user filenames, the common policy these days is to use

utf-8 throughout. Of course you need to get everything into that

regime to start with.

So i i nee to use os.listdir() to grab those filenames into bytes. okey.

So by changing this to:

fullpaths = set()
path = "/home/nikos/public_html/data/apps/"

for root, dirs, files in os.walk(path):
for fullpath in files:
fullpaths.add( os.path.join(root, fullpath) )



# Compute a set of current fullpaths
fullpaths = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for fullpath in fullpaths:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath,) )
data = cur.fetchone() #URL is unique, so should only be one


-----------------------------
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] Original exception was:
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] Traceback (most recent call last):
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] File "files.py", line 67, in <module>
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath,) )
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] File "/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py", line 108, in execute
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] query = query.encode(charset)
[Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'utf-8' codec can't encode character '\\udcc5' in position 35: surrogates not allowed
 
Í

Íéêüëáïò Êïýñáò

Ôç ÐÝìðôç, 6 Éïõíßïõ 2013 2:09:22 ì.ì. UTC+3, ï ÷ñÞóôçò Heiko Wundram Ýãñáøå:
Am 06.06.2013 13:00, schrieb Íéêüëáïò Êïýñáò:




Exactly. Check the encoding that putty uses for the terminal session. If

it doesn't use UTF-8, switch your terminal session to UTF-8 and try the

rename again. If it does, try to use another terminal client (I

recommend the Cygwin-Suite).

Okey, indeed it was using greek-sio encoding, i changed it to uf-8 and reopned the terminal session.

(e-mail address removed) [~/www/data/apps]# mv *.mp3 'Åõ÷Þ ôïõ Éçóïý.mp3'
mv: `\305\365\367\336 \364\357\365 \311\347\363\357\375.mp3' and `\305\365\367\3 36 \364\357\365 \311\347\363\357\375.mp3' are the same file
(e-mail address removed) [~/www/data/apps]# mv *.mp3 'Åõ÷Þ ôïõ Éçóïý.mp33'
(e-mail address removed) [~/www/data/apps]# mv *.mp33 'Åõ÷Þ ôïõ Éçóïý.mp3'
(e-mail address removed) [~/www/data/apps]# ls -l
total 368548
drwxr-xr-x 2 nikos nikos 4096 Jun 6 14:22 ./
drwxr-xr-x 6 nikos nikos 4096 May 26 21:13 ../
-rwxr-xr-x 1 nikos nikos 13157283 Mar 17 12:57 100\ Mythoi\ tou\ Aiswpou.pdf*
-rwxr-xr-x 1 nikos nikos 29524686 Mar 11 18:17 Anekdotologio.exe*
-rw-r--r-- 1 nikos nikos 42413964 Jun 2 20:29 Battleship.exe
-rw-r--r-- 1 nikos nikos 236032 Jun 4 14:10 \323\352\335\370\357\365\ \335\35 5\341\355\ \341\361\351\350\354\374.exe
-rwxr-xr-x 1 nikos nikos 66896732 Mar 17 13:13 Kosmas\ o\ Aitwlos\ -\ Profiteies .pdf*
-rw-r--r-- 1 nikos nikos 51819750 Jun 2 20:04 Luxor\ Evolved.exe
-rw-r--r-- 1 nikos nikos 60571648 Jun 2 14:59 Monopoly.exe
-rw-r--r-- 1 nikos nikos 3511233 Jun 4 14:11 \305\365\367\336\ \364\357\365\ \ 311\347\363\357\375.mp3
-rwxr-xr-x 1 nikos nikos 1788164 Mar 14 11:31 Online\ Movie\ Player.zip*
-rw-r--r-- 1 nikos nikos 5277287 Jun 1 18:35 O\ Nomos\ tou\ Merfy\ v1-2-3..zip
-rwxr-xr-x 1 nikos nikos 16383001 Jun 22 2010 Orthodoxo\ Imerologio.exe*
-rw-r--r-- 1 nikos nikos 6084806 Jun 1 18:22 Pac-Man.exe
-rw-r--r-- 1 nikos nikos 25476584 Jun 2 19:50 Scrabble.exe
-rwxr-xr-x 1 nikos nikos 49141166 Mar 17 12:48 To\ 1o\ mou\ vivlio\ gia\ to\ ska ki.pdf*
-rwxr-xr-x 1 nikos nikos 3298310 Mar 17 12:45 Vivlos\ gia\ Atheofovous.pdf*
-rw-r--r-- 1 nikos nikos 1764864 May 29 21:50 V-Radio\ v2.4.msi
(e-mail address removed) [~/www/data/apps]# ls *.mp3 | file -
/dev/stdin: ASCII text
(e-mail address removed) [~/www/data/apps]#

still same error.
 
Í

Íéêüëáïò Êïýñáò

# Compute a set of current fullpaths
fullpaths = os.listdir( '/home/nikos/public_html/data/apps/' )

# Load'em
for fullpath in fullpaths:
try:
# Check the presence of a file against the database and insert if it doesn't exist
cur.execute('''SELECT url FROM files WHERE url = %s''', fullpath.encode('utf-8') )
data = cur.fetchone() #URL is unique, so should only be one

print( fullpath.encode('utf-8') )


Now why this does not print out the filenames when iterated in the for loop?
One step forward is that when i run it liek this no error is being displyed in the error log.

Please help, i ahve tried os.listdir() as Cameron suggested.
 
H

Heiko Wundram

Am 06.06.2013 13:24, schrieb Íéêüëáïò Êïýñáò:
(e-mail address removed) [~/www/data/apps]# ls *.mp3 | file -
/dev/stdin: ASCII text

Again, did you actually read (and try to understand) what I wrote? I
said to redo the rename after you change your terminal session to UTF-8.
 
M

MRAB

Ôç ÔåôÜñôç, 5 Éïõíßïõ 2013 9:43:18 ì.ì. UTC+3, ï ÷ñÞóôçò Íéêüëáïò Êïýñáò Ýãñáøå:

MRAB can you please explain in more clarity your idea of solution?
I was suggesting a way to rename the files so that their names are
encoded in UTF-8 (they appear to be encoded in ISO-8859-7).

You MUST TEST IT thoroughly first, of course, before trying it on the
actual files.

It could go something like this:


import os

# Give the path as a bytestring so that we'll get the names as bytestrings.
root_folder = b"/home/nikos/public_html/data/apps/"

# Setting TESTING to True will make it print out what renamings it will
do, but
# not actually do them.
TESTING = True

# Walk through the files.
for root, dirs, files in os.walk(root_folder):
for name in files:
try:
# Is this name encoded in UTF-8?
name.decode("utf-8")
except UnicodeDecodeError:
# Decoding from UTF- failed, which means that the name is
not valid
# UTF-8.

# It appears (from elsewhere) that the names are encoded in
# ISO-8859-7, so decode from that and re-encode to UTF-8.
new_name = name.decode("iso-8859-7").encode("utf-8")

old_path = os.path.join(root, name)
new_path = os.path.join(root, new_name)
if TESTING:
print("Will rename {!r} to {!r}".format(old_path,
new_path))
else:
print("Renaming {!r} to {!r}".format(old_path, new_path))
os.rename(old_path, new_path)
 
Í

Íéêüëáïò Êïýñáò

First of all thank you for helping me MRAB.
After make some alternation to your code ia have this:

----------------------------------------
# Give the path as a bytestring so that we'll get the filenames as bytestrings
path = b"/home/nikos/public_html/data/apps/"

# Setting TESTING to True will make it print out what renamings it will do, but not actually do them
TESTING = True

# Walk through the files.
for root, dirs, files in os.walk( path ):
for filename in files:
try:
# Is this name encoded in UTF-8?
filename.decode('utf-8')
except UnicodeDecodeError:
# Decoding from UTF-8 failed, which means that the name is not valid UTF-8
# It appears that the filenames are encoded in ISO-8859-7, so decode from that and re-encode to UTF-8
new_filename = filename.decode('iso-8859-7').encode('utf-8')

old_path = os.path.join(root, filename)
new_path = os.path.join(root, new_filename)
if TESTING:
print( '''<br>Will rename {!r} ---> {!r}<br><br>'''.format( old_path, new_path ) )
else:
print( '''<br>Renaming {!r} ---> {!r}<br><br>'''.format( old_path, new_path ) )
os.rename( old_path, new_path )
sys.exit(0)
-------------------------

and the output can be seen here: http://superhost.gr/cgi-bin/files.py

We are in test mode so i dont know if when renaming actually take place what the encodings will be.

Shall i switch off test mode and try it for real?
 
S

Steven D'Aprano

Τη ΤÏίτη, 4 Ιουνίου 2013 11:47:01 Ï€.μ. UTC+3, ο χÏήστης Steven D'Aprano
έγÏαψε:
Please run these commands, and show what result they give:
[...]
(e-mail address removed) [~/www/data/apps]# alias ls
alias ls='/bin/ls $LS_OPTIONS'

And what does

echo $LS_OPTIONS


give?

[...]
Seems that the way the system used to actually rename the file matters.

Yes. This is where you get interactions between different systems that
use different encodings, and they don't work well together.

Some day, everything will use UTF-8, and these problems will go away.

Yes, but why you are doing it it 2 steps and not as:

mv *Ο.mp3 'Eυχή του ΙησοÏ.mp3'

I don't remember. I had a reason that made sense at the time, but I can't
remember what it was.


I think I can reproduce your problem. If I open a terminal, set to use
UTF-8, I can do this:

[steve@ando ~]$ cd /tmp
[steve@ando tmp]$ touch '999-Eυχή-του-ΙησοÏ'
[steve@ando tmp]$ ls 999*
999-Eυχή-του-ΙησοÏ


Now if I change the terminal to use Greek ISO-8859-7, and hit UP-ARROW to
grab the previous command line from history, the *displayed* file name
changes, but the actual file being touched remains the same:

[steve@ando tmp]$ touch '999-EΟΟΞ�-ΟΞÎΟ-ΞΞ·ΟΞÎΟ'
[steve@ando tmp]$ ls 999*
999-EΟΟΞ�-ΟΞÎΟ-ΞΞ·ΟΞÎΟ


In Python 3.3, I can demonstrate the same sort of thing:

py> s = '999-Eυχή-του-ΙησοÏ'
py> bytes_as_utf8 = s.encode('utf-8')
py> t = bytes_as_utf8.decode('iso-8859-7', errors='replace')
py> print(t)
999-EΟΟΞ�-ΟΞÎΟ-ΞΞ·ΟΞÎΟ


So that demonstrates part of your problem: even though your Linux system
is using UTF-8, your terminal is probably set to ISO-8859-7. The
interaction between these will lead to strange and disturbing Unicode
errors.


To continue, back in the terminal set to ISO-8859-7, if instead of using
the history line, if I re-copy and paste the file name:

[steve@ando tmp]$ touch '999-Eυχή-του-ΙησοÏ'
[steve@ando tmp]$ ls 999*
999-E???-???-????? 999-EΟΟΞ�-ΟΞÎΟ-ΞΞ·ΟΞÎΟ


So now I end up with two files, one with a file name that is utter
garbage bytes, and one that is only a little better, being mojibake.

Resetting the terminal to use UTF-8 at least now restores the *display*
of the earlier file's name:

[steve@ando tmp]$ ls 999*
999-E???-???-????? 999-Eυχή-του-ΙησοÏ
[steve@ando tmp]$ ls -b 999*
999-E\365\367\336-\364\357\365-\311\347\363\357\375 999-Eυχή-του-ΙησοÏ

but the other file name is still made of garbage bytes.


So I believe I understand how your file name has become garbage. To fix
it, make sure that your terminal is set to use UTF-8, and then rename it.
Do the same with every file in the directory until the problem goes away.

(If one file has garbage bytes in the file name, chances are that more
than one do.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top