Flexable Collating (feedback please)

R

Ron Adam

I put together the following module today and would like some feedback on any
obvious problems. Or even opinions of weather or not it is a good approach.

While collating is not a difficult thing to do for experienced programmers, I
have seen quite a lot of poorly sorted lists in commercial applications, so it
seems it would be good to have an easy to use ready made API for collating.

I tried to make this both easy to use and flexible. My first thoughts was to
try and target actual uses such as Phone directory sorting, or Library sorting,
etc., but it seemed using keywords to alter the behavior is both easier and more
flexible.

I think the regular expressions I used to parse leading and trailing numerals
could be improved. They work, but you will probably get inconsistent results if
the strings are not well formed. Any suggestions on this would be appreciated.

Should I try to extend it to cover dates and currency sorting? Probably those
types should be converted before sorting, but maybe sometimes it's useful
not to?

Another variation is collating dewy decimal strings. It should be easy to add
if someone thinks that might be useful.

I haven't tested this in *anything* yet, so don't plug it into production code
of any type. I also haven't done any performance testing.

See the doc tests below for examples of how it's used.

Cheers,
Ron Adam



"""
Collate.py

A general purpose configurable collate module.

Collation can be modified with the following keywords:

CAPS_FIRST -> Aaa, aaa, Bbb, bbb
HYPHEN_AS_SPACE -> Don't ignore hyphens
UNDERSCORE_AS_SPACE -> Underscores as white space
IGNORE_LEADING_WS -> Disregard leading white space
NUMERICAL -> Digit sequences as numerals
COMMA_IN_NUMERALS -> Allow commas in numerals

* See doctests for examples.

Author: Ron Adam, (e-mail address removed), 10/18/2006

"""
import re
import locale


locale.setlocale(locale.LC_ALL, '') # use current locale settings

# The above line may change the string constants from the string
# module. This may have unintended effects if your program
# assumes they are always the ascii defaults.


CAPS_FIRST = 1
NUMERICAL = 2
HYPHEN_AS_SPACE = 4
UNDERSCORE_AS_SPACE = 8
IGNORE_LEADING_WS = 16
COMMA_IN_NUMERALS = 32

class Collate(object):
""" A general purpose and configurable collator class.
"""
def __init__(self, flag):
self.flag = flag
def transform(self, s):
""" Transform a string for collating.
"""
if self.flag & CAPS_FIRST:
s = s.swapcase()
if self.flag & HYPHEN_AS_SPACE:
s = s.replace('-', ' ')
if self.flag & UNDERSCORE_AS_SPACE:
s = s.replace('_', ' ')
if self.flag & IGNORE_LEADING_WS:
s = s.strip()
if self.flag & NUMERICAL:
if self.flag & COMMA_IN_NUMERALS:
rex = re.compile('^(\d*\,?\d*\.?\d*)(\D*)(\d*\,?\d*\.?\d*)',
re.LOCALE)
else:
rex = re.compile('^(\d*\.?\d*)(\D*)(\d*\.?\d*)', re.LOCALE)
slist = rex.split(s)
for i, x in enumerate(slist):
if self.flag & COMMA_IN_NUMERALS:
x = x.replace(',', '')
try:
slist = float(x)
except:
slist = locale.strxfrm(x)
return slist
return locale.strxfrm(s)

def __call__(self, a, b):
""" This allows the Collate class work as a sort key.

USE: list.sort(key=Collate(flags))
"""
return cmp(self.transform(a), self.transform(b))

def collate(slist, flags=0):
""" Collate list of strings in place.
"""
return slist.sort(Collate(flags))

def collated(slist, flags=0):
""" Return a collated list of strings.

This is a decorate-undecorate collate.
"""
collator = Collate(flags)
dd = [(collator.transform(x), x) for x in slist]
dd.sort()
return list([B for (A, B) in dd])

def _test():
"""
DOC TESTS AND EXAMPLES:

Sort (and sorted) normally order all words beginning with caps
before all words beginning with lower case.
>>> t = ['tuesday', 'Tuesday', 'Monday', 'monday']
>>> sorted(t) # regular sort
['Monday', 'Tuesday', 'monday', 'tuesday']

Locale collation puts words beginning with caps after words
beginning with lower case of the same letter.
['monday', 'Monday', 'tuesday', 'Tuesday']

The CAPS_FIRST option can be used to put all words beginning
with caps after words beginning in lowercase of the same letter.
['Monday', 'monday', 'Tuesday', 'tuesday']


The HYPHEN_AS_SPACE option causes hyphens to be equal to space.
>>> t = ['a-b', 'b-a', 'aa-b', 'bb-a']
>>> collated(t)
['aa-b', 'a-b', 'b-a', 'bb-a']
['a-b', 'aa-b', 'b-a', 'bb-a']


The IGNORE_LEADING_WS and UNDERSCORE_AS_SPACE options can be
used together to improve ordering in some situations.
>>> t = ['sum', '__str__', 'about', ' round']
>>> collated(t)
[' round', '__str__', 'about', 'sum']
['__str__', 'about', ' round', 'sum']
[' round', '__str__', 'about', 'sum']
['about', ' round', '__str__', 'sum']


The NUMERICAL option orders leading and trailing digits as numerals.
>>> t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
>>> collated(t, NUMERICAL)
['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']


The COMMA_IN_NUMERALS option ignores commas instead of using them to
seperate numerals.
>>> t = ['a5', 'a4,000', '500b', '100,000b']
>>> collated(t, NUMERICAL|COMMA_IN_NUMERALS)
['500b', '100,000b', 'a5', 'a4,000']


Collating also can be done in place using collate() instead of collated().
>>> t = ['Fred', 'Ron', 'Carol', 'Bob']
>>> collate(t)
>>> t
['Bob', 'Carol', 'Fred', 'Ron']

"""
import doctest
doctest.testmod()


if __name__ == '__main__':
_test()
 
R

Ron Adam

Fixed...


Changed the collate() function to return None the same as sort() since it is an
in place collate.

A comment in _test() doctests was reversed. CAPS_FIRST option puts words
beginning with capitals before, not after, words beginning with lower case of
the same letter.


It seems I always find a few obvious glitches right after I post something. ;-)

Cheers,
Ron
 
G

georgeryoung

I put together the following module today and would like some feedback on any
obvious problems. Or even opinions of weather or not it is a good approach.
,,,
def __call__(self, a, b):
""" This allows the Collate class work as a sort key.

USE: list.sort(key=Collate(flags))
"""
return cmp(self.transform(a), self.transform(b))

You document _call__ as useful for the "key" keyword to sort, but you
implement it for the "cmp" keyword. The "key" allows much better
performance, since it's called only once per value. Maybe just :
return self.transform(a)

-- George
 
R

Ron Adam

,,,
def __call__(self, a, b):
""" This allows the Collate class work as a sort key.

USE: list.sort(key=Collate(flags))
"""
return cmp(self.transform(a), self.transform(b))

You document _call__ as useful for the "key" keyword to sort, but you
implement it for the "cmp" keyword. The "key" allows much better
performance, since it's called only once per value. Maybe just :
return self.transform(a)

-- George


Thanks, I changed it to the following...



def __call__(self, a):
""" This allows the Collate class work as a sort key.

USE: list.sort(key=Collate(flags))
"""
return self.transform(a)



And also changed the sort call here ...


def collate(slist, flags=0):
""" Collate list of strings in place.
"""
slist.sort(key=Collate(flags)) <<<


Today I'll do some performance tests to see how much faster it is for moderate
sized lists.


Cheers,
Ron
 
R

Ron Adam

I made a number of changes ... (the new version is listed below)


These changes also resulted in improving the speed by about 3 times when all
flags are specified.

Collating now takes about 1/3 (or less) time. Although it is still quite a bit
slower than a bare list.sort(), that is to be expected as collate is locale
aware and does additional transformations on the data which you would need to do
anyways. The tests where done with Unicode strings as well.

Changed the flag types from integer values to a list of named strings. The
reason for this is it makes finding errors easier and you can examine the flags
attribute and get a readable list of flags.

A better regular expression for separating numerals. It now separates numerals
in the middle of the string.

Changed flag COMMA_IN_NUMERALS to IGNORE_COMMAS, This was how it was implemented.

Added flag PERIOD_AS_COMMAS

This lets you collate decimal separated numbers correctly such as version
numbers and internet address's. It also prevents numerals from being
interpreted as floating point or decimal.

It might make more since to implement it as PERIOD_IS_SEPARATOR. Needed?

Other minor changes to doc strings and tests were made.


Any feedback is welcome.

Cheers,
Ron



"""
Collate.py

A general purpose configurable collate module.

Collation can be modified with the following keywords:

CAPS_FIRST -> Aaa, aaa, Bbb, bbb
HYPHEN_AS_SPACE -> Don't ignore hyphens
UNDERSCORE_AS_SPACE -> Underscores as white space
IGNORE_LEADING_WS -> Disregard leading white space
NUMERICAL -> Digit sequences as numerals
IGNORE_COMMAS -> Allow commas in numerals
PERIOD_AS_COMMAS -> Periods can separate numerals.

* See doctests for examples.

Author: Ron Adam, (e-mail address removed)

"""
__version__ = '0.02 (pre-alpha) 10/18/2006'

import re
import locale
import string

locale.setlocale(locale.LC_ALL, '') # use current locale settings

# The above line may change the string constants from the string
# module. This may have unintended effects if your program
# assumes they are always the ascii defaults.

CAPS_FIRST = 'CAPS_FIRST'
HYPHEN_AS_SPACE = 'HYPHEN_AS_SPACE'
UNDERSCORE_AS_SPACE = 'UNDERSCORE_AS_SPACE'
IGNORE_LEADING_WS = 'IGNORE_LEADING_WS'
NUMERICAL = 'NUMERICAL'
IGNORE_COMMAS = 'IGNORE_COMMAS'
PERIOD_AS_COMMAS = 'PERIOD_AS_COMMAS'

class Collate(object):
""" A general purpose and configurable collator class.
"""
def __init__(self, flags=[]):
self.flags = flags
self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)
self.txtable = []
if HYPHEN_AS_SPACE in flags:
self.txtable.append(('-', ' '))
if UNDERSCORE_AS_SPACE in flags:
self.txtable.append(('_', ' '))
if PERIOD_AS_COMMAS in flags:
self.txtable.append(('.', ','))
if IGNORE_COMMAS in flags:
self.txtable.append((',', ''))
self.flags = flags

def transform(self, s):
""" Transform a string for collating.
"""
if not self.flags:
return locale.strxfrm(s)
for a, b in self.txtable:
s = s.replace(a, b)
if IGNORE_LEADING_WS in self.flags:
s = s.strip()
if CAPS_FIRST in self.flags:
s = s.swapcase()
if NUMERICAL in self.flags:
slist = self.numrex.split(s)
for i, x in enumerate(slist):
try:
slist = float(x)
except:
slist = locale.strxfrm(x)
return slist
return locale.strxfrm(s)

def __call__(self, a):
""" This allows the Collate class work as a sort key.

USE: list.sort(key=Collate(flags))
"""
return self.transform(a)

def collate(slist, flags=[]):
""" Collate list of strings in place.
"""
slist.sort(key=Collate(flags).transform)

def collated(slist, flags=[]):
""" Return a collated list of strings.
"""
return sorted(slist, key=Collate(flags).transform)

def _test():
"""
DOC TESTS AND EXAMPLES:

Sort (and sorted) normally order all words beginning with caps
before all words beginning with lower case.
>>> t = ['tuesday', 'Tuesday', 'Monday', 'monday']
>>> sorted(t) # regular sort
['Monday', 'Tuesday', 'monday', 'tuesday']

Locale collation puts words beginning with caps after words
beginning with lower case of the same letter.
['monday', 'Monday', 'tuesday', 'Tuesday']

The CAPS_FIRST option can be used to put all words beginning
with caps before words beginning in lowercase of the same letter.
['Monday', 'monday', 'Tuesday', 'tuesday']


The HYPHEN_AS_SPACE option causes hyphens to be equal to space.
>>> t = ['a-b', 'b-a', 'aa-b', 'bb-a']
>>> collated(t)
['aa-b', 'a-b', 'b-a', 'bb-a']
>>> collated(t, [HYPHEN_AS_SPACE])
['a-b', 'aa-b', 'b-a', 'bb-a']


The IGNORE_LEADING_WS and UNDERSCORE_AS_SPACE options can be
used together to improve ordering in some situations.
>>> t = ['sum', '__str__', 'about', ' round']
>>> collated(t)
[' round', '__str__', 'about', 'sum']
>>> collated(t, [IGNORE_LEADING_WS])
['__str__', 'about', ' round', 'sum']
>>> collated(t, [UNDERSCORE_AS_SPACE])
[' round', '__str__', 'about', 'sum']
>>> collated(t, [IGNORE_LEADING_WS, UNDERSCORE_AS_SPACE])
['about', ' round', '__str__', 'sum']


The NUMERICAL option orders sequences of digits as numerals.
>>> t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
>>> collated(t, [NUMERICAL])
['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']


The IGNORE_COMMAS option prevents commas from seperating numerals.
>>> t = ['a5', 'a4,000', '500b', '100,000b']
>>> collated(t, [NUMERICAL, IGNORE_COMMAS])
['500b', '100,000b', 'a5', 'a4,000']


The PERIOD_AS_COMMAS option can be used to sort version numbers
and other decimal seperated numbers correctly.
>>> t = ['5.1.1', '5.10.12','5.2.2', '5.2.19' ]
>>> collated(t, [NUMERICAL, PERIOD_AS_COMMAS])
['5.1.1', '5.2.2', '5.2.19', '5.10.12']

Collate also can be done in place by using collate() instead of
collated().
>>> t = ['Fred', 'Ron', 'Carol', 'Bob']
>>> collate(t)
>>> t
['Bob', 'Carol', 'Fred', 'Ron']

"""
import doctest
doctest.testmod()


if __name__ == '__main__':
_test()
 
B

bearophileHUGS

This part of code uses integer "constants" to be or-ed (or added):

CAPS_FIRST = 1
NUMERICAL = 2
HYPHEN_AS_SPACE = 4
UNDERSCORE_AS_SPACE = 8
IGNORE_LEADING_WS = 16
COMMA_IN_NUMERALS = 32

....

def __init__(self, flag):
self.flag = flag
def transform(self, s):
""" Transform a string for collating.
"""
if self.flag & CAPS_FIRST:
s = s.swapcase()
if self.flag & HYPHEN_AS_SPACE:
s = s.replace('-', ' ')
if self.flag & UNDERSCORE_AS_SPACE:
s = s.replace('_', ' ')
if self.flag & IGNORE_LEADING_WS:
s = s.strip()
if self.flag & NUMERICAL:
if self.flag & COMMA_IN_NUMERALS:

This is used in C, but maybe for Python other solutions may be better.
I can see some different (untested) solutions:

1)

def selfassign(self, locals):
# Code from web.py, modified.
for key, value in locals.iteritems():
if key != 'self':
setattr(self, key, value)

def __init__(self,
caps_first=False,
hyphen_as_space=False,
underscore_as_space=False,
ignore_leading_ws=False,
numerical=False,
comma_in_numerals=False):
selfassign(self, locals())

def transform(self, s):
if self.caps_first:
...

Disadvangages: if a flag is added/modified, the code has to be modified
in two places.


2)

def __init__(self, **kwds):
self.lflags = [k for k,v in kwds.items() if v]
def transform(self, s):
if "caps_first" in self.lflags:
...

This class can be created with 1 instead of Trues, to shorten the code.

Disadvantages: the user of this class has to read from the class
doctring or from from the docs the list of possible flags (and such
docs can be out of sync from the code).


3)

Tkinter (Tcl) shows that sometimes strings are better than int
constants (like using "left" instead of tkinter.LEFT, etc), so this is
another possibile solution:

def __init__(self, flags=""):
self.lflags = flags.lower().split()
def transform(self, s):
if "caps_first" in self.lflags:
...

An example of calling this class:

.... = Collate("caps_first hyphen_as_space numerical")

I like this third (nonstandard) solution enough.

Bye,
bearophile
 
R

Ron Adam

Thanks, But I fixed it already. (almost) ;-)

I think I will use strings as you suggest, and verify they are valid so a type
don't go though silently.

I ended up using string based option list. I agree a space separated string is
better and easier from a user point of view.

The advantage of the list is it can be iterated without splitting first. But
that's a minor thing. self.options = options.lower().split(' ') fixes that easily.


Once I'm sure it's not going to get any major changes I'll post this as a
recipe. I think it's almost there.

Cheers and thanks,
Ron



This part of code uses integer "constants" to be or-ed (or added):

CAPS_FIRST = 1
NUMERICAL = 2
HYPHEN_AS_SPACE = 4
UNDERSCORE_AS_SPACE = 8
IGNORE_LEADING_WS = 16
COMMA_IN_NUMERALS = 32

...

def __init__(self, flag):
self.flag = flag
def transform(self, s):
""" Transform a string for collating.
"""
if self.flag & CAPS_FIRST:
s = s.swapcase()
if self.flag & HYPHEN_AS_SPACE:
s = s.replace('-', ' ')
if self.flag & UNDERSCORE_AS_SPACE:
s = s.replace('_', ' ')
if self.flag & IGNORE_LEADING_WS:
s = s.strip()
if self.flag & NUMERICAL:
if self.flag & COMMA_IN_NUMERALS:

This is used in C, but maybe for Python other solutions may be better.
I can see some different (untested) solutions:

1)

def selfassign(self, locals):
# Code from web.py, modified.
for key, value in locals.iteritems():
if key != 'self':
setattr(self, key, value)

def __init__(self,
caps_first=False,
hyphen_as_space=False,
underscore_as_space=False,
ignore_leading_ws=False,
numerical=False,
comma_in_numerals=False):
selfassign(self, locals())

def transform(self, s):
if self.caps_first:
...

Disadvangages: if a flag is added/modified, the code has to be modified
in two places.


2)

def __init__(self, **kwds):
self.lflags = [k for k,v in kwds.items() if v]
def transform(self, s):
if "caps_first" in self.lflags:
...

This class can be created with 1 instead of Trues, to shorten the code.

Disadvantages: the user of this class has to read from the class
doctring or from from the docs the list of possible flags (and such
docs can be out of sync from the code).


3)

Tkinter (Tcl) shows that sometimes strings are better than int
constants (like using "left" instead of tkinter.LEFT, etc), so this is
another possibile solution:


I think maybe this is better, but I need to verify the flags so typos don't go
though silently.
 
R

Ron Adam

This is how I changed it...

(I edited out the test and imports for posting here.)



locale.setlocale(locale.LC_ALL, '') # use current locale settings

class Collate(object):
""" A general purpose and configurable collator class.
"""
options = [ 'CAPS_FIRST', 'NUMERICAL', 'HYPHEN_AS_SPACE',
'UNDERSCORE_AS_SPACE', 'IGNORE_LEADING_WS',
'IGNORE_COMMAS', 'PERIOD_AS_COMMAS' ]

def __init__(self, flags=""):
if flags:
flags = flags.upper().split()
for value in flags:
if value not in self.options:
raise ValueError, 'Invalid option: %s' % value
self.txtable = []
if 'HYPHEN_AS_SPACE' in flags:
self.txtable.append(('-', ' '))
if 'UNDERSCORE_AS_SPACE' in flags:
self.txtable.append(('_', ' '))
if 'PERIOD_AS_COMMAS' in flags:
self.txtable.append(('.', ','))
if 'IGNORE_COMMAS' in flags:
self.txtable.append((',', ''))
self.flags = flags
self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)

def transform(self, s):
""" Transform a string for collating.
"""
if not self.flags:
return locale.strxfrm(s)
for a, b in self.txtable:
s = s.replace(a, b)
if 'IGNORE_LEADING_WS' in self.flags:
s = s.strip()
if 'CAPS_FIRST' in self.flags:
s = s.swapcase()
if 'NUMERICAL' in self.flags:
slist = self.numrex.split(s)
for i, x in enumerate(slist):
try:
slist = float(x)
except:
slist = locale.strxfrm(x)
return slist
return locale.strxfrm(s)

def __call__(self, a):
""" This allows the Collate class to be used as a sort key.

USE: list.sort(key=Collate(flags))
"""
return self.transform(a)

def collate(slist, flags=[]):
""" Collate list of strings in place.
"""
slist.sort(key=Collate(flags).transform)

def collated(slist, flags=[]):
""" Return a collated list of strings.
"""
return sorted(slist, key=Collate(flags).transform)
 
B

bearophileHUGS

Ron Adam:

Insted of:

def __init__(self, flags=[]):
self.flags = flags
self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)
self.txtable = []
if HYPHEN_AS_SPACE in flags:
self.txtable.append(('-', ' '))
if UNDERSCORE_AS_SPACE in flags:
self.txtable.append(('_', ' '))
if PERIOD_AS_COMMAS in flags:
self.txtable.append(('.', ','))
if IGNORE_COMMAS in flags:
self.txtable.append((',', ''))
self.flags = flags

I think using a not mutable flags default is safer, this is an
alternative (NOT tested!):

numrex = re.compile(r'[\d\.]* | \D*', re.LOCALE|re.VERBOSE)
dflags = {"hyphen_as_space": ('-', ' '),
"underscore_as_space": ('_', ' '),
"period_as_commas": ('_', ' '),
"ignore_commas": (',', ''),
...
}

def __init__(self, flags=()):
self.flags = [fl.strip().lower() for fl in flags]
self.txtable = []
df = self.__class__.dflags
for flag in self.flags:
if flag in df:
self.txtable.append(df[flag])
...

This is just an idea, it surely has some problems that have to be
fixed.

Bye,
bearophile
 
G

Gabriel Genellina

I put together the following module today and would like some feedback on any
obvious problems. Or even opinions of weather or not it is a good approach.
if self.flag & CAPS_FIRST:
s = s.swapcase()

This is just coincidental; it relies on (lowercase)<(uppercase) on
the locale collating sequence, and I don't see why it should be always so.
if self.flag & IGNORE_LEADING_WS:
s = s.strip()

This ignores trailing ws too. (lstrip?)
if self.flag & NUMERICAL:
if self.flag & COMMA_IN_NUMERALS:
rex =
re.compile('^(\d*\,?\d*\.?\d*)(\D*)(\d*\,?\d*\.?\d*)',
re.LOCALE)
else:
rex = re.compile('^(\d*\.?\d*)(\D*)(\d*\.?\d*)', re.LOCALE)
slist = rex.split(s)
for i, x in enumerate(slist):
if self.flag & COMMA_IN_NUMERALS:
x = x.replace(',', '')
try:
slist = float(x)
except:
slist = locale.strxfrm(x)
return slist
return locale.strxfrm(s)


You should try to make this part a bit more generic. If you are
concerned about locales, do not use "comma" explicitely. In other
countries 10*100=1.000 - and 1,234 is a fraction between 1 and 2.
The NUMERICAL option orders leading and trailing digits as numerals.
t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
collated(t, NUMERICAL)
['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']

From the name "NUMERICAL" I would expect this sorting: b2, 4abc, a5,
a10.2, 13.5b, 20abc, a40 (that is, sorting as numbers only).
Maybe GROUP_NUMBERS... but I dont like that too much either...


--
Gabriel Genellina
Softlab SRL





__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
 
R

Ron Adam

Gabriel said:
This is just coincidental; it relies on (lowercase)<(uppercase) on the
locale collating sequence, and I don't see why it should be always so.

The LC_COLLATE structure (in the python.exe C code I think) controls the order
of upper and lower case during collating. I don't know if there is anyway to
examine it unfortunately.

If there was a way to change the LC_COLLATE structure, I wouldn't need to resort
to tricks like s.swapcase(). But without that info, I don't know of another way.

Maybe changing the CAPS_FIRST to REVERSE_CAPS_ORDER would do?


I'm not sure if this would make any visible difference. It might determine order
of two strings where they are the same, but one has white space at the end the
other doesn't.

They run at the same speed either way, so I'll go ahead and change it. Thanks.

This ignores trailing ws too. (lstrip?)
if self.flag & NUMERICAL:
if self.flag & COMMA_IN_NUMERALS:
rex =
re.compile('^(\d*\,?\d*\.?\d*)(\D*)(\d*\,?\d*\.?\d*)',
re.LOCALE)
else:
rex = re.compile('^(\d*\.?\d*)(\D*)(\d*\.?\d*)',
re.LOCALE)
slist = rex.split(s)
for i, x in enumerate(slist):
if self.flag & COMMA_IN_NUMERALS:
x = x.replace(',', '')
try:
slist = float(x)
except:
slist = locale.strxfrm(x)
return slist
return locale.strxfrm(s)


You should try to make this part a bit more generic. If you are
concerned about locales, do not use "comma" explicitely. In other
countries 10*100=1.000 - and 1,234 is a fraction between 1 and 2.


See the most recent version of this I posted. It is a bit more generic.


Maybe a 'comma_is_decimal' option?

Options are cheep so it's no problem to add them as long as they make sense. ;-)

These options are what I refer to as mid-level options. The programmer does
still need to know something about the data they are collating. They may still
need to do some preprocessing even with this, but maybe not as much.

In a higher level collation routine, I think you would just need to specify a
named sort type, such as 'dictionary', 'directory', 'enventory' and it would set
the options and accordingly. The problem with that approach is the higher level
definitions may be different depending on locale or even the field it is used in.

The NUMERICAL option orders leading and trailing digits as numerals.
t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
collated(t, NUMERICAL)
['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']

From the name "NUMERICAL" I would expect this sorting: b2, 4abc, a5,
a10.2, 13.5b, 20abc, a40 (that is, sorting as numbers only).
Maybe GROUP_NUMBERS... but I dont like that too much either...


How about 'VALUE_ORDERING' ?

The term I've seen before is called natural ordering, but that is more general
and can include date, roman numerals, as well as other type.


Cheers,
Ron
 
G

Gabriel Genellina

The LC_COLLATE structure (in the python.exe C code I think) controls
the order
of upper and lower case during collating. I don't know if there is anyway to
examine it unfortunately.

LC_COLLATE is just a #define'd constant. I don't know how to examine
the collating definition, either.
If there was a way to change the LC_COLLATE structure, I wouldn't
need to resort
to tricks like s.swapcase(). But without that info, I don't know of
another way.

Maybe changing the CAPS_FIRST to REVERSE_CAPS_ORDER would do?

At least it's a more accurate name.
There is an indirect way: test locale.strcoll("A","a") and see how
they get sorted. Then define options CAPS_FIRST, LOWER_FIRST
accordingly. But maybe it's too much trouble...
See the most recent version of this I posted. It is a bit more generic.


Maybe a 'comma_is_decimal' option?

I'd prefer to use the 'decimal_point' and 'thousands_sep' from the
locale information. That would be more coherent with the locale usage
along your module.
Options are cheep so it's no problem to add them as long as they
make sense. ;-)

These options are what I refer to as mid-level options. The programmer does
still need to know something about the data they are
collating. They may still
need to do some preprocessing even with this, but maybe not as much.

In a higher level collation routine, I think you would just need to specify a
named sort type, such as 'dictionary', 'directory', 'enventory' and
it would set
the options and accordingly. The problem with that approach is the
higher level
definitions may be different depending on locale or even the field
it is used in.

Sure. But your module is a good starting point for building a more
high-level procedure.
The NUMERICAL option orders leading and trailing digits as numerals.

t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
collated(t, NUMERICAL)
['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']

From the name "NUMERICAL" I would expect this sorting: b2, 4abc, a5,
a10.2, 13.5b, 20abc, a40 (that is, sorting as numbers only).
Maybe GROUP_NUMBERS... but I dont like that too much either...

How about 'VALUE_ORDERING' ?

The term I've seen before is called natural ordering, but that is
more general
and can include date, roman numerals, as well as other type.

Sometimes that's the hard part, finding a name which is concise,
descriptive, and accurately reflects what the code does. A good name
should make obvious what it is used for (being these option names, or
class names, or method names...) but in this case it may be difficult
to find a good one. So users will have to read the documentation (a
good thing, anyway!)


--
Gabriel Genellina
Softlab SRL





__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
 
L

Leo Kislov

Ron said:
locale.setlocale(locale.LC_ALL, '') # use current locale settings

It's not current locale settings, it's user's locale settings.
Application can actually use something else and you will overwrite
that. You can also affect (unexpectedly to the application)
time.strftime() and C extensions. So you should move this call into the
_test() function and put explanation into the documentation that
application should call locale.setlocale

self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)
[snip]

if NUMERICAL in self.flags:
slist = self.numrex.split(s)
for i, x in enumerate(slist):
try:
slist = float(x)
except:
slist = locale.strxfrm(x)


I think you should call locale.atof instead of float, since you call
re.compile with re.LOCALE.

Everything else looks fine. The biggest missing piece is support for
unicode strings.

-- Leo.
 
R

Ron Adam

Leo said:
It's not current locale settings, it's user's locale settings.
Application can actually use something else and you will overwrite
that. You can also affect (unexpectedly to the application)
time.strftime() and C extensions. So you should move this call into the
_test() function and put explanation into the documentation that
application should call locale.setlocale

I'll experiment with this a bit, I was under the impression that local.strxfrm
needed the locale set for it to work correctly.

Maybe it would be better to have two (or more) versions? A string, unicode, and
locale version or maybe add an option to __init__ to choose the behavior?
Multiple versions seems to be the approach of pre-py3k. Although I was trying
to avoid that.

Sigh, of course issues like this is why it is better to have a module to do this
with. If it was as simple as just calling sort() I wouldn't have bothered. ;-)

self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)
[snip]

if NUMERICAL in self.flags:
slist = self.numrex.split(s)
for i, x in enumerate(slist):
try:
slist = float(x)
except:
slist = locale.strxfrm(x)


I think you should call locale.atof instead of float, since you call
re.compile with re.LOCALE.


I think you are correct, but it seems locale.atof() is a *lot* slower than
float(). :(

Here's the local.atof() code.

def atof(string,func=float):
"Parses a string as a float according to the locale settings."
#First, get rid of the grouping
ts = localeconv()['thousands_sep']

if ts:
string = string.replace(ts, '')
#next, replace the decimal point with a dot
dd = localeconv()['decimal_point']
if dd:
string = string.replace(dd, '.')
#finally, parse the string
return func(string)


I could set ts and dd in __init__ and just do the replacements in the try...

if NUMERICAL in self.flags:
slist = self.numrex.split(s)
for i, x in enumerate(slist):
if x: # slist may contain null strings
if self.ts:
xx = x.replace(self.ts, '') # remove thousands sep
if self.dd:
xx = xx.replace(self.dd, '.') # replace decimal point
try:
slist = float(xx)
except:
slist = locale.strxfrm(x)

How does that look?

It needs a fast way to determine if x is a number or a string. Any suggestions?


Everything else looks fine. The biggest missing piece is support for
unicode strings.


This was the reason for using locale.strxfrm. It should let it work with unicode
strings from what I could figure out from the documents.

Am I missing something?

Thanks,
Ron
 
R

Ron Adam

Ron Adam:

Insted of:

def __init__(self, flags=[]):
self.flags = flags
self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)
self.txtable = []
if HYPHEN_AS_SPACE in flags:
self.txtable.append(('-', ' '))
if UNDERSCORE_AS_SPACE in flags:
self.txtable.append(('_', ' '))
if PERIOD_AS_COMMAS in flags:
self.txtable.append(('.', ','))
if IGNORE_COMMAS in flags:
self.txtable.append((',', ''))
self.flags = flags

I think using a not mutable flags default is safer, this is an
alternative (NOT tested!):

numrex = re.compile(r'[\d\.]* | \D*', re.LOCALE|re.VERBOSE)
dflags = {"hyphen_as_space": ('-', ' '),
"underscore_as_space": ('_', ' '),
"period_as_commas": ('_', ' '),
"ignore_commas": (',', ''),
...
}

def __init__(self, flags=()):
self.flags = [fl.strip().lower() for fl in flags]
self.txtable = []
df = self.__class__.dflags
for flag in self.flags:
if flag in df:
self.txtable.append(df[flag])
...

This is just an idea, it surely has some problems that have to be
fixed.

I think the 'if's are ok since there are only a few options that need to be
handled by them.

I'm still trying to determine what options are really needed. I can get the
thousand separator and decimal character from local.localconv() function. So
ignore_commas isn't needed I think. And maybe change period_as_commas to period
_as_sep and then split on periods before comparing.

I also want it to issue exceptions when the Collate object is created if invalid
options are specified. That makes finding problems much easier. The example
above doesn't do that, it accepts them silently. That was one of the reasons I
went to named constants at first.

How does this look?

numrex = re.compile(r'([\d\.]* | \D*)', re.LOCALE|re.VERBOSE)
options = ( 'CAPS_FIRST', 'NUMERICAL', 'HYPHEN_AS_SPACE',
'UNDERSCORE_AS_SPACE', 'IGNORE_LEADING_WS',
'IGNORE_COMMAS', 'PERIOD_AS_COMMAS' )
def __init__(self, flags=""):
if flags:
flags = flags.upper().split()
for value in flags:
if value not in self.options:
raise ValueError, 'Invalid option: %s' % value
self.txtable = []
if 'HYPHEN_AS_SPACE' in flags:
self.txtable.append(('-', ' '))
if 'UNDERSCORE_AS_SPACE' in flags:
self.txtable.append(('_', ' '))
if 'PERIOD_AS_COMMAS' in flags:
self.txtable.append(('.', ','))
if 'IGNORE_COMMAS' in flags:
self.txtable.append((',', ''))
self.flags = flags



So you can set an option strings as...


import collate as C

collateopts = \
""" caps_first
hyphen_as_space
numerical
ignore_commas
"""
colatedlist = C.collated(somelist, collateopts)


A nice advantage with an option string is you don't have to prepend all your
options with the module name. But you do have to validate it.

Cheers,
Ron
 
R

Ron Adam

Gabriel said:
At least it's a more accurate name.
There is an indirect way: test locale.strcoll("A","a") and see how they
get sorted. Then define options CAPS_FIRST, LOWER_FIRST accordingly. But
maybe it's too much trouble...

I should of thought of that simple test. Thanks. :)

That would work, and it's really not that much trouble. The actual test can be
done during importing. Maybe determining other useful values could be done at
that time as well.

I'd prefer to use the 'decimal_point' and 'thousands_sep' from the
locale information. That would be more coherent with the locale usage
along your module.

Yes, it looks like I'll need to do this.

Sometimes that's the hard part, finding a name which is concise,
descriptive, and accurately reflects what the code does. A good name
should make obvious what it is used for (being these option names, or
class names, or method names...) but in this case it may be difficult to
find a good one. So users will have to read the documentation (a good
thing, anyway!)

I plan on making the doc strings a bit more informative so that help(collate)
will give meaningful information on it's options.

Cheers,
Ron
 
L

Leo Kislov

Ron said:
I'll experiment with this a bit, I was under the impression that local.strxfrm
needed the locale set for it to work correctly.

Actually locale.strxfrm and all other functions in locale module work
as designed: they work in C locale before the first call to
locale.setlocale. This is by design, call to locale.setlocale should be
done by an application, not by a 3rd party module like your collation
module.
Maybe it would be better to have two (or more) versions? A string, unicode, and
locale version or maybe add an option to __init__ to choose the behavior?

I don't think it should be two separate versions. Unicode support is
only a matter of code like this:

# in the constructor
self.encoding = locale.getpreferredencoding()

# class method
def strxfrm(self, s):
if type(s) is unicode:
return locale.strxfrm(s.encode(self.encoding,'replace')
return locale.strxfrm(s)

and then instead of locale.strxfrm call self.strxfrm. And similar code
for locale.atof
This was the reason for using locale.strxfrm. It should let it work with unicode
strings from what I could figure out from the documents.

Am I missing something?

strxfrm works only with byte strings encoded in the system encoding.

-- Leo
 
R

Ron Adam

Leo said:
Actually locale.strxfrm and all other functions in locale module work
as designed: they work in C locale before the first call to
locale.setlocale. This is by design, call to locale.setlocale should be
done by an application, not by a 3rd party module like your collation
module.

Yes, I've come to that conclusion also. (reserching as I go) ;-)

I put an example of that in the class doc string so it could easily be found.


I don't think it should be two separate versions. Unicode support is
only a matter of code like this:

# in the constructor
self.encoding = locale.getpreferredencoding()

# class method
def strxfrm(self, s):
if type(s) is unicode:
return locale.strxfrm(s.encode(self.encoding,'replace')
return locale.strxfrm(s)

and then instead of locale.strxfrm call self.strxfrm. And similar code
for locale.atof

Thanks for the example.

strxfrm works only with byte strings encoded in the system encoding.

-- Leo


Windows has an alternative function, wcxfrm. (wide character transform)

http://msdn.microsoft.com/library/d...en-us/vclib/html/_crt_strxfrm.2c_.wcsxfrm.asp

But it's not exposed in Python. I could use ctypes to call it, but it would then
be windows specific and I doubt it would even work as expected.

Maybe a wcsxfrm patch would be good for Python 2.6? Python 3000 will probably
need it anyway.


I've made a few additional changes and will start a new thread after some more
testing to get some additional feedback.

Cheers,
Ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top