Unicode Chars in Windows Path

S

Steve

Hi All,

I'm in need of some encoding/decoding help for a situation for a Windows Path that contains Unicode characters in it.

---- CODE ----

import os.path
import codecs
import sys

All_Tests = [u"c:\automation_common\Python\TestCases\list_dir_script.txt"]


for curr_test in All_Tests:
print("\n raw : " + repr(curr_test) + "\n")
print("\n encode : %s \n\n" ) % os.path.normpath(codecs.encode(curr_test, "ascii"))
print("\n decode : %s \n\n" ) % curr_test.decode('string_escape')

---- CODE ----


Screen Output :

raw : u'c:\x07utomation_common\\Python\\TestCases\\list_dir_script.txt'

encode : c:utomation_common\Python\TestCases\list_dir_script.txt

decode : c:utomation_common\Python\TestCases\list_dir_script.txt


My goal is to have the properly formatting path in the output :


c:\automation_common\Python\TestCases\list_dir_script.txt


What is the "magic" encode/decode sequence here??

Thanks!

Steve
 
S

Steven D'Aprano

Hi All,

I'm in need of some encoding/decoding help for a situation for a Windows
Path that contains Unicode characters in it.

---- CODE ----

import os.path
import codecs
import sys

All_Tests =
[u"c:\automation_common\Python\TestCases\list_dir_script.txt"]

I don't think this has anything to do with Unicode encoding or decoding.
In Python string literals, the backslash makes the next character
special. So \n makes a newline, \t makes a tab, and so forth. Only if the
character being backslashed has no special meaning does Python give you a
literal backslash:

py> print("x\tx")
x x
py> print("x\Tx")
x\Tx


In this case, \a has special meaning, and is converted to the ASCII BEL
control character:

py> u"...\automation"
u'...\x07utomation'


When working with Windows paths, you should make a habit of either
escaping every backslash:

u"c:\\automation_common\\Python\\TestCases\\list_dir_script.txt"

using a raw-string:

ur"c:\automation_common\Python\TestCases\list_dir_script.txt"

or just use forward slashes:

u"c:/automation_common/Python/TestCases/list_dir_script.txt"


Windows accepts both forward and backslashes in file names.


If you fix that issue, I expect your problem will go away.
 
C

Chris Angelico

Windows accepts both forward and backslashes in file names.

Small clarification: The Windows *API* accepts both types of slash
(you can open a file using forward slashes, for instance), but not all
Windows *applications* are aware of this (generally only
cross-platform ones take notice of this), and most Windows *users*
prefer backslashes. So when you come to display a Windows path, you
may want to convert to backslashes. But that's for display.

ChrisA
 
M

Marko Rauhamaa

Chris Angelico said:
Small clarification: The Windows *API* accepts both types of slash
(you can open a file using forward slashes, for instance), but not all
Windows *applications* are aware of this (generally only
cross-platform ones take notice of this), and most Windows *users*
prefer backslashes. So when you come to display a Windows path, you
may want to convert to backslashes. But that's for display.

Didn't know that. More importantly, I had thought forward slashes were
valid file basename characters, but Windows is surprisingly strict about
that:

< > : " / \ | ? * NUL

are not allowed in basenames. Unix/linux disallows only:

/ NUL

In fact, proper dealing with punctuation in pathnames is one of the main
reasons to migrate to Python from bash. Even if it is often possible to
write bash scripts that handle arbitrary pathnames correctly, few script
writers are pedantic enough to do it properly. For example, newlines in
filenames are bound to confuse 99.9% of bash scripts.


Marko
 
P

Peter Otten

Marko said:
Didn't know that. More importantly, I had thought forward slashes were
valid file basename characters, but Windows is surprisingly strict about
that:

< > : " / \ | ? * NUL

are not allowed in basenames. Unix/linux disallows only:

/ NUL

In fact, proper dealing with punctuation in pathnames is one of the main
reasons to migrate to Python from bash. Even if it is often possible to
write bash scripts that handle arbitrary pathnames correctly, few script
writers are pedantic enough to do it properly. For example, newlines in
filenames are bound to confuse 99.9% of bash scripts.

That doesn't bother me much as 99.8% of all bash scripts are already
confused by ordinary space chars ;)
 
R

random832

In fact, proper dealing with punctuation in pathnames is one of the main
reasons to migrate to Python from bash. Even if it is often possible to
write bash scripts that handle arbitrary pathnames correctly, few script
writers are pedantic enough to do it properly. For example, newlines in
filenames are bound to confuse 99.9% of bash scripts.

Incidentally, these rules mean there are different norms about how
command line arguments are parsed on windows. Since * and ? are not
allowed in filenames, you don't have to care whether they were quoted.
An argument [in a position where a list of filenames is expected] with *
or ? in it _always_ gets globbed, so "C:\dir with spaces\*.txt" can be
used. This is part of the reason the program is responsible for globbing
rather than the shell - because only the program knows if it expects a
list of filenames in that position vs a text string for some other
purpose.

This is unfortunate, because it means that most python programs do not
handle filename patterns at all (expecting the shell to do it for them)
- it would be nice if there was a cross-platform way to do this.

Native windows wildcards are also weird in a number of ways not emulated
by the glob module. Most of these are not expected by users, but some
users may expect, for example, *.htm to match files ending in .html; *.*
to match files with no dot in them, and *. to match _only_ files with no
dot in them. The latter two are guaranteed by the windows API, the first
is merely common due to default shortname settings. Also, native windows
wildcards do not support [character classes].
 
C

Chris Angelico

An argument [in a position where a list of filenames is expected] with *
or ? in it _always_ gets globbed, so "C:\dir with spaces\*.txt" can be
used. This is part of the reason the program is responsible for globbing
rather than the shell - because only the program knows if it expects a
list of filenames in that position vs a text string for some other
purpose.

Which, I might mention, is part of why the old DOS way (still
applicable under Windows, but I first met it with MS-DOS) of searching
for files was more convenient than it can be with Unix tools. Compare:

-- Get info on all .pyc files in a directory --
C:\>dir some_directory\*.pyc
$ ls -l some_directory/*.pyc

So far, so good.

-- Get info on all .pyc files in a directory and all its subdirectories --
C:\>dir some_directory\*.pyc /s
$ ls -l `find some_directory -name \*.pyc`

Except that the ls version there can't handle names with spaces in
them, so you need to faff around with null termination and stuff. With
bash, you can use 'shopt -s globstar; ls -l **/*.py', but that's not a
default-active option (at least, it's not active on any of the systems
I use, but they're all Debians and Ubuntus; it might be active by
default on others), and I suspect a lot of people don't even know it
exists; I know of it, but don't always think of it, and often end up
doing the above flawed version.

On the flip side, having the shell handle it does mean you
automatically get this on *any* command. You can go and delete all
those .pyc files by just changing "ls -l" into "rm" or "dir" into
"del", but that's only because del happens to support /s; other DOS
programs may well not.

ChrisA
 
T

Terry Reedy

Small clarification: The Windows *API* accepts both types of slash

To me, that is what Steven said.
(you can open a file using forward slashes, for instance), but not all
Windows *applications* are aware of this (generally only
cross-platform ones take notice of this), and most Windows *users*
prefer backslashes.

Do you have a source for that?
 
C

Chris Angelico

To me, that is what Steven said.

Yes, which is why I said "clarification" not "correction".
Do you have a source for that?

Hardly need one for the first point - it's proven by a single Windows
application that parses a path name by dividing it on backslashes.
Even if there isn't one today, I could simply write one, and prove my
own point trivially (albeit not usefully). Anything that simply passes
its arguments to an API (eg it just opens the file) won't need to take
notice of slash type, but otherwise, it's very *VERY* common for a
Windows program to assume that it can split paths manually.

The second point would be better sourced, yes, but all I can say is
that I've written programs that use and display slashes, and had
non-programmers express surprise at it; similarly when you see certain
programs that take one part of a path literally, and then build on it
with either type of slash, like zip and unzip - if you say "zip -r
C:\Foo\Bar", it'll tell you that it's archiving
C:\Foo\Bar/Quux/Asdf.txt and so on. Definitely inspires surprise in
non-programmers.

ChrisA
 
D

David

-- Get info on all .pyc files in a directory and all its subdirectories --
C:\>dir some_directory\*.pyc /s
$ ls -l `find some_directory -name \*.pyc`

Except that the ls version there can't handle names with spaces in
them, so you need to faff around with null termination and stuff.

Nooo, that stinks! There's no need to abuse 'find' like that, unless
the version you have is truly ancient. Null termination is only
necessary to pass 'find' results *via the shell*. Instead, ask 'find'
to invoke the task itself.

The simplest way is:

find some_directory -name '*.pyc' -ls

'find' is the tool to use for *finding* things, not 'ls', which is
intended for terminal display of directory information.

If you require a particular feature of 'ls', or any other command, you
can ask 'find' to invoke it directly (not via a shell):

find some_directory -name '*.pyc' -exec ls -l {} \;

'Find' is widely under utilised and poorly understood because its
command line syntax is extremely confusing compared to other tools,
plus its documentation compounds the confusion. For anyone interested,
I offer these key insights:

Most important to understand is that the -name, -exec and -ls that I
used above (for example) are *not* command-line "options". Even though
they look like command-line options, they aren't. They are part of an
*expression* in 'find' syntax. And the crucial difference is that the
expression is order-dependent. So unlike most other commands, it is a
mistake to put them in arbitrary order.

Also annoyingly, the -exec syntax utilises characters that must be
escaped from shell processing. This is more arcane knowledge that just
frustrates people when they are in a rush to get something done.

In fact, the only command-line *options* that 'find' takes are -H -L
-P -D and -O, but these are rarely used. They come *before* the
directory name(s). Everything that comes after the directory name is
part of a 'find' expression.

But, the most confusing thing of all, in the 'find' documentation,
expressions are composed of tests, actions, and ... options! These
so-called options are expression-options, not command-line-options. No
wonder everyone's confused, when one word describes two
similar-looking but behaviourally different things!

So 'info find' must be read very carefully indeed. But it is
worthwhile, because in the model of "do one thing and do it well",
'find' is the tool intended for such tasks, rather than expecting
these capabilities to be built into all other command line utilities.

I know this is off-topic but because I learn so much from the
countless terrific contributions to this list from Chris (and others)
with wide expertise, I am motivated to give something back when I can.
And given that in the past I spent a little time and effort and
eventually understood this, I summarise it here hoping it helps
someone else. The unix-style tools are far more capable than the
Microsoft shell when used as intended.

There is good documentation on find at: http://mywiki.wooledge.org/UsingFind
 
C

Chris Angelico

Nooo, that stinks! There's no need to abuse 'find' like that, unless
the version you have is truly ancient. Null termination is only
necessary to pass 'find' results *via the shell*. Instead, ask 'find'
to invoke the task itself.

The simplest way is:

find some_directory -name '*.pyc' -ls

'find' is the tool to use for *finding* things, not 'ls', which is
intended for terminal display of directory information.

I used ls only as a first example, and then picked up an extremely
common next example (deleting files). It so happens that find can
'-delete' its found files, but my point is that on DOS/Windows, every
command has to explicitly support subdirectories. If, instead, the
'find' command has to explicitly support everything you might want to
do to files, that's even worse! So we need an execution form...
If you require a particular feature of 'ls', or any other command, you
can ask 'find' to invoke it directly (not via a shell):

find some_directory -name '*.pyc' -exec ls -l {} \;

.... which this looks like, but it's not equivalent. That will execute
'ls -l' once for each file. You can tell, because the columns aren't
aligned; for anything more complicated than simply 'ls -l', you
potentially destroy any chance at bulk operations. No, to be
equivalent it *must* pass all the args to a single invocation of the
program. You need to instead use xargs if you want it to be
equivalent, and it's now getting to be quite an incantation:

find some_directory -name \*.pyc -print0|xargs -0 ls -l

And *that* is equivalent to the original, but it's way *way* longer
and less convenient, which was my point. Plus, it's extremely tempting
to shorten that, because this will almost always work:

find some_directory -name \*.pyc|xargs ls -l

But it'll fail if you have newlines in file names. It'd probably work
every time you try it, and then you'll put that in a script and boom,
it stops working. (That's what I meant by "faffing about with null
termination". You have to go through an extra level of indirection,
making the command fairly unwieldy.)
I know this is off-topic but because I learn so much from the
countless terrific contributions to this list from Chris (and others)
with wide expertise, I am motivated to give something back when I can.

Definitely! This is how we all learn :) And thank you, glad to hear that.
And given that in the past I spent a little time and effort and
eventually understood this, I summarise it here hoping it helps
someone else. The unix-style tools are far more capable than the
Microsoft shell when used as intended.

More specifically, the Unix model ("each tool should do one thing and
do it well") tends to make for more combinable tools. The DOS style
requires every program to reimplement the same directory-search
functionality, and then requires the user to figure out how it's been
written in this form ("zip -r" (or is it "zip -R"...), "dir /s", "del
/s", etc, etc). The Unix style requires applications to accept
arbitrary numbers of arguments (which they probably would anyway), and
then requires the user to learn some incantations that will then work
anywhere. If you're writing a script, you should probably use the
-print0|xargs -0 method (unless you already require bash for some
other reason); interactively, you more likely want to enable globstar
and use the much shorter double-star notation. Either way, it works
for any program, and that is indeed "far more capable".

ChrisA
 
D

David

I used ls only as a first example, and then picked up an extremely
common next example (deleting files). It so happens that find can
'-delete' its found files, but my point is that on DOS/Windows, every
command has to explicitly support subdirectories. If, instead, the
'find' command has to explicitly support everything you might want to
do to files, that's even worse! So we need an execution form...


... which this looks like, but it's not equivalent.
That will execute
'ls -l' once for each file. You can tell, because the columns aren't
aligned; for anything more complicated than simply 'ls -l', you
potentially destroy any chance at bulk operations.

Thanks for elaborating that point. But still ...
equivalent it *must* pass all the args to a single invocation of the
program. You need to instead use xargs if you want it to be
equivalent, and it's now getting to be quite an incantation:

find some_directory -name \*.pyc -print0|xargs -0 ls -l

And *that* is equivalent to the original, but it's way *way* longer
and less convenient, which was my point.

If you are not already aware, it might interest you that 'find' in
(GNU findutils) 4.4.2. has

-- Action: -execdir command {} +
This works as for `-execdir command ;', except that the `{}' at
the end of the command is expanded to a list of names of matching
files. This expansion is done in such a way as to avoid exceeding
the maximum command line length available on the system. Only one
`{}' is allowed within the command, and it must appear at the end,
immediately before the `+'. A `+' appearing in any position other
than immediately after `{}' is not considered to be special (that
is, it does not terminate the command).

I believe that achieves the goal, without involving the shell.

It also has an -exec equivalent that works the same but has an
unrelated security issue and not recommended.

But if that '+' instead of ';' feature is not available on the
target system, then as far as I am aware it would be necessary
to use xargs as you say.

Anyway, the two points I wished to contribute are:

1) It is preferable to avoid shell command substitutions (the
backticks in the first example) and expansions where possible.

2) My observations on 'find' syntax, for anyone interested.

Cheers,
David
 
L

Lele Gaifax

Steven D'Aprano said:
When working with Windows paths, you should make a habit of either
escaping every backslash:

u"c:\\automation_common\\Python\\TestCases\\list_dir_script.txt"

using a raw-string:

ur"c:\automation_common\Python\TestCases\list_dir_script.txt"

or just use forward slashes:

u"c:/automation_common/Python/TestCases/list_dir_script.txt"

The latter should be preferred, in case Python3 compatibility is a goal.

ciao, lele.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,044
Messages
2,570,388
Members
47,052
Latest member
ketan

Latest Threads

Top