os.path and Path


E

Ethan Furman

In my continuing quest for Python Mastery (and because I felt like it ;)
I decided to code a Path object so I could dispense with all the
os.path.join and os.path.split and os.path.splitext, etc., etc., and so
forth.

While so endeavoring a couple threads came back and had a friendly
little chat in my head:

Thread 1: "objects of different types compare unequal"
self: "nonsense! we have the power to say what happens in __eq__!"

Thread 2: "objects that __hash__ the same *must* compare __eq__!"
self: "um, what? ... wait, only immutable objects hash..."

Thread 2: "you're Path object is immutable..."
self: "argh!"

Here's the rub: I'm on Windows (yes, pity me...) but I prefer the
unices, so I'd like to have / seperate my paths. But I'm on Windows...

So I thought, "Hey! I'll just do some conversions in __eq__ and life
will be great!"

--> some_path = Path('/source/python/some_project')
--> some_path == '/source/python/some_project'
True
--> some_path == r'\source\python\some_project'
True
--> # if on a Mac
--> some_path == ':source:python:some_project'
True
--> # oh, and because I'm on Windows with case-insensitive file names...
--> some_path == '/source/Python/some_PROJECT'
True

And then, of course, the ghosts of threads past came and visited. For
those that don't know, the __hash__ must be the same if __eq__ is the
same because __hash__ is primarily a shortcut for __eq__ -- this is
important when you have containers that are relying on this behavior,
such as set() and dict().

So, I suppose I shall have to let go of my dreams of

--> Path('/some/path/and/file') == '\\some\\path\\and\\file'

and settle for

--> Path('...') == Path('...')

but I don't have to like it. :(

</whine>

~Ethan~

What, you didn't see the opening 'whine' tag? Oh, well, my xml isn't
very good... ;)
 
Ad

Advertisements

L

Laurent Claessens

So, I suppose I shall have to let go of my dreams of

--> Path('/some/path/and/file') == '\\some\\path\\and\\file'

and settle for

--> Path('...') == Path('...')

but I don't have to like it. :(


Why not define the hash method to first convert to '/some/path/and/file'
and then hash ?

By the way it remains some problems with

/some/another/../path/and/file

which should also be the same.

Laurent
 
S

Steven D'Aprano

Thread 1: "objects of different types compare unequal" self:
"nonsense! we have the power to say what happens in __eq__!"

Thread 2: "objects that __hash__ the same *must* compare __eq__!" self:
"um, what? ... wait, only immutable objects hash..."

Incorrect. And impossible. There are only a fixed number of hash values
(2**31 I believe...) and a potentially infinite number of unique, unequal
objects that can be hashed. So by the pigeon-hole principle, there must
be at least one pigeon-hole (the hash value) containing two or more
pigeons (unequal objects).

For example:
4


What you mean to say is that if objects compare equal, they must hash the
same. Not the other way around.

Thread 2: "you're Path object is immutable..." self: "argh!"

Here's the rub: I'm on Windows (yes, pity me...) but I prefer the
unices, so I'd like to have / seperate my paths. But I'm on Windows...

Any sensible Path object should accept path components in a form
independent of the path separator, and only care about the separator when
converting to and from strings.


[...]
So, I suppose I shall have to let go of my dreams of

--> Path('/some/path/and/file') == '\\some\\path\\and\\file'

To say nothing of:

Path('a/b/c/../d') == './a/b/d'


Why do you think there's no Path object in the standard library? *wink*
 
S

Steven D'Aprano

Why not define the hash method to first convert to '/some/path/and/file'
and then hash ?

It's not so simple. If Path is intended to be platform independent, then
these two paths could represent the same location:

'a/b/c:d/e' # on Linux or OS X
'a:b:c/d:e' # on classic Mac pre OS X

and be impossible on Windows. So what's the canonical path it should be
converted to?
 
E

Ethan Furman

Steven said:
Incorrect.
What you mean to say is that if objects compare equal, they must hash the
same. Not the other way around.

Ack. I keep saying that backwards. Thanks for the correction.

Any sensible Path object should accept path components in a form
independent of the path separator, and only care about the separator when
converting to and from strings.

Our ideas of 'sensible' apparently differ.

One of my goals with my Path objects was to be a drop-in replacement for
the strings currently used as paths; consequently, they are a sub-class
of string, and can still be passed to, for example, os.path.splitext().
Another was to be able to use '/' across all platforms, but still have
the appropriate separator used when the Path object was passed to, for
example, open().

To me, a path is an ambiguous item: /temp/here/xyz.abc
where does the directory structure stop and the filename begin? xyz.abc
could be either the last subdirectory, or the filename, and the only way
to know for sure is to look at the disk. However, the Path may not be
complete yet, or the final item may not exist yet -- so what then? I'm
refusing the temptation to guess. ;) The programmer can explicity look,
or create, appropriately.

[...]
So, I suppose I shall have to let go of my dreams of

--> Path('/some/path/and/file') == '\\some\\path\\and\\file'

To say nothing of:

Path('a/b/c/../d') == './a/b/d'

I think I'll make my case-insensitive Paths compare, and hash, as
all-lowercase, so direct string comparison can still work. I'll add an
..eq() method to handle the other fun stuff.

Why do you think there's no Path object in the standard library? *wink*

Because I can't find one in either 2.7 nor 3.2, and every reference I've
found has indicated that the other Path contenders were too
all-encompassing.

~Ethan~
 
E

Ethan Furman

Steven said:
If Path is intended to be platform independent, then
these two paths could represent the same location:

'a/b/c:d/e' # on Linux or OS X
'a:b:c/d:e' # on classic Mac pre OS X

and be impossible on Windows. So what's the canonical path it should be
converted to?

Are these actual valid paths? I thought Linux used '/' and Mac used ':'.

~Ethan~
 
Ad

Advertisements

S

Steven D'Aprano

Are these actual valid paths? I thought Linux used '/' and Mac used
':'.

Er, perhaps I wasn't as clear as I intended... sorry about that.

On a Linux or OS X box, you could have a file e inside a directory c:d
inside b inside a. It can't be treated as platform independent, because
c:d is not a legal path component under classic Mac or Windows.

On a classic Mac (does anyone still use them?), you could have a file e
inside a directory c/d inside b inside a. Likewise c/d isn't legal under
POSIX or Windows.

So there are paths that are legal under one file system, but not others,
and hence there is no single normalization that can represent all legal
paths under arbitrary file systems.
 
E

Ethan Furman

Steven said:
Er, perhaps I wasn't as clear as I intended... sorry about that.

On a Linux or OS X box, you could have a file e inside a directory c:d
inside b inside a. It can't be treated as platform independent, because
c:d is not a legal path component under classic Mac or Windows.

On a classic Mac (does anyone still use them?), you could have a file e
inside a directory c/d inside b inside a. Likewise c/d isn't legal under
POSIX or Windows.

So there are paths that are legal under one file system, but not others,
and hence there is no single normalization that can represent all legal
paths under arbitrary file systems.

Yeah, I was just realizing that about two minutes before I read this
reply. Drat. This also makes your comment about sensible path objects
more sensible. ;)

~Ethan~
 
E

Eric Snow

On a Linux or OS X box, you could have a file e inside a directory c:d
inside b inside a. It can't be treated as platform independent, because
c:d is not a legal path component under classic Mac or Windows.

On a classic Mac (does anyone still use them?), you could have a file e
inside a directory c/d inside b inside a. Likewise c/d isn't legal under
POSIX or Windows.

So there are paths that are legal under one file system, but not others,
and hence there is no single normalization that can represent all legal
paths under arbitrary file systems.

Perhaps one solution is to have the Path class accept registrations of
valid path formats:

class PathFormat:
@abstractmethod
def map_path(self, pathstring):
"""Map the pathstring to the canonical path.

This could take the form of some regex or an even a more
explicit conversion.

If there is no match, return None.

"""

@abstractmethod
def unmap_path(self, pathstring):
"""Map the pathstring from a canonical path to this format.

If there is no match, return None.

"""

class Path:
...
_formats = []
@classmethod
def register_format(cls, format):
cls._formats.append(format)

def map_path(self, pathstring):
for format in self._formats:
result = format.map_path(pathstring)
if result is None:
continue
# remember which format matched?
return result
raise TypeError("No formatters could map the pathstring.")

def unmap_path(self, pathstring):
...

With something like that, you have a PathFormat class for each
platform that matters. Anyone would be able to add more, as they
like, through register_format. This module could also include a few
lines to register a particular PathFormat depending on the platform
determined through sys.platform or whatever.

This way your path class doesn't have to try to worry about the
conversion to and from the canonical path format.

-eric
 
C

Chris Torek

Because I can't find one in either 2.7 nor 3.2, and every reference I've
found has indicated that the other Path contenders were too
all-encompassing.

What I think Steven D'Aprano is suggesting here is that the general
problem is too hard, and specific solutions too incomplete, to
bother with.

Your own specific solution might work fine for your case(s), but it
is unlikely to work in general.

I am not aware of any Python implementations for VMS, CMS, VM,
EXEC-8, or other dinosaurs, but it would be ... interesting.
Consider a typical VMS "full pathname":

DRA0:[SYS0.SYSCOMMON]FILE.TXT;3

The first part is the (literal) disk drive (a la MS-DOS A: or C:
but slightly more general). The part in [square brackets] is the
directory path. The extension (.txt) is limited to three characters,
and the part after the semicolon is the file version number, so
you can refer to a backup version. (Typically one would use a
"logical name" like SYS$SYSROOT in place of the disk and/or
directory-sequence, so as to paper over the overly-rigid syntax.)

Compare with an EXEC-8 (now, apparently, OS 2200 -- I guess it IS
still out there somewhere) "file" name:

QUAL*FILE(cyclenumber)

where cycle-numbers are relative, i.e., +0 means "use the current
file" while "+1" means "create a new one" and "-1" means "use the
first backup". (However, one normally tied external file names to
"internal names" before running a program, via the "@USE" statement.)
The vile details are still available here:

http://www.bitsavers.org/pdf/univac/1100/UE-637_1108execUG_1970.pdf

(Those of you who have never had to deal with these machines, as I
did in the early 1980s, should consider yourselves lucky. :) )
 
E

Ethan Furman

Chris said:
What I think Steven D'Aprano is suggesting here is that the general
problem is too hard, and specific solutions too incomplete, to
bother with.

Ah. In that case I completely misunderstood. Thanks for the insight!

~Ethan~
 
Ad

Advertisements

N

Ned Deily

Because I can't find one in either 2.7 nor 3.2, and every reference I've
found has indicated that the other Path contenders were too
all-encompassing.

What I think Steven D'Aprano is suggesting here is that the general
problem is too hard, and specific solutions too incomplete, to
bother with.

Your own specific solution might work fine for your case(s), but it
is unlikely to work in general.[/QUOTE]

Note there was quite a bit of discussion some years back about adding
Jason Orendorff's Path module to the standard library, a module which
had and still has its fans. Ultimately, though, it was vetoed by Guido.

http://bugs.python.org/issue1226256
http://wiki.python.org/moin/PathModule
 
Ad

Advertisements

R

rusi

What I think Steven D'Aprano is suggesting here is that the general
problem is too hard, and specific solutions too incomplete, to
bother with.
Your own specific solution might work fine for your case(s), but it
is unlikely to work in general.

Note there was quite a bit of discussion some years back about adding
Jason Orendorff's Path module to the standard library, a module which
had and still has its fans.  Ultimately, though, it was vetoed by Guido..

http://bugs.python.org/issue1226256
http://wiki.python.org/moin/PathModule
[/QUOTE]

A glance at these links (only cursory I admit) suggests that this was
vetoed because of cross OS compatibility issues.

This is unfortunate.

As an analogy I note that emacs tries to run compatibly on all major
OSes and as a result is running increasingly badly on all (A mere
print functionality which runs easily on apps one hundredth the size
of emacs wont run on windows without wrestling).

More OT but... When a question is asked on this list about which
environment/IDE people use many people seem to say "Emacs"

But when a question is raised about python-emacs issues eg
http://groups.google.com/group/comp.lang.python/browse_thread/thread/acb0f2a01fe50151#
there are usually no answers...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top