Flexable Collating (feedback please)

Ron Adam · Oct 18, 2006

I put together the following module today and would like some feedback on any
obvious problems. Or even opinions of weather or not it is a good approach.

While collating is not a difficult thing to do for experienced programmers, I
have seen quite a lot of poorly sorted lists in commercial applications, so it
seems it would be good to have an easy to use ready made API for collating.

I tried to make this both easy to use and flexible. My first thoughts was to
try and target actual uses such as Phone directory sorting, or Library sorting,
etc., but it seemed using keywords to alter the behavior is both easier and more
flexible.

I think the regular expressions I used to parse leading and trailing numerals
could be improved. They work, but you will probably get inconsistent results if
the strings are not well formed. Any suggestions on this would be appreciated.

Should I try to extend it to cover dates and currency sorting? Probably those
types should be converted before sorting, but maybe sometimes it's useful
not to?

Another variation is collating dewy decimal strings. It should be easy to add
if someone thinks that might be useful.

I haven't tested this in *anything* yet, so don't plug it into production code
of any type. I also haven't done any performance testing.

See the doc tests below for examples of how it's used.

Cheers,
Ron Adam

"""
Collate.py

A general purpose configurable collate module.

Collation can be modified with the following keywords:

CAPS_FIRST -> Aaa, aaa, Bbb, bbb
HYPHEN_AS_SPACE -> Don't ignore hyphens
UNDERSCORE_AS_SPACE -> Underscores as white space
IGNORE_LEADING_WS -> Disregard leading white space
NUMERICAL -> Digit sequences as numerals
COMMA_IN_NUMERALS -> Allow commas in numerals

* See doctests for examples.

Author: Ron Adam, (e-mail address removed), 10/18/2006

"""
import re
import locale

locale.setlocale(locale.LC_ALL, '') # use current locale settings

# The above line may change the string constants from the string
# module. This may have unintended effects if your program
# assumes they are always the ascii defaults.

CAPS_FIRST = 1
NUMERICAL = 2
HYPHEN_AS_SPACE = 4
UNDERSCORE_AS_SPACE = 8
IGNORE_LEADING_WS = 16
COMMA_IN_NUMERALS = 32

class Collate(object):
""" A general purpose and configurable collator class.
"""
def __init__(self, flag):
self.flag = flag
def transform(self, s):
""" Transform a string for collating.
"""
if self.flag & CAPS_FIRST:
s = s.swapcase()
if self.flag & HYPHEN_AS_SPACE:
s = s.replace('-', ' ')
if self.flag & UNDERSCORE_AS_SPACE:
s = s.replace('_', ' ')
if self.flag & IGNORE_LEADING_WS:
s = s.strip()
if self.flag & NUMERICAL:
if self.flag & COMMA_IN_NUMERALS:
rex = re.compile('^(\d*\,?\d*\.?\d*)(\D*)(\d*\,?\d*\.?\d*)',
re.LOCALE)
else:
rex = re.compile('^(\d*\.?\d*)(\D*)(\d*\.?\d*)', re.LOCALE)
slist = rex.split(s)
for i, x in enumerate(slist):
if self.flag & COMMA_IN_NUMERALS:
x = x.replace(',', '')
try:
slist = float(x)
except:
slist = locale.strxfrm(x)
return slist
return locale.strxfrm(s)

def __call__(self, a, b):
""" This allows the Collate class work as a sort key.

USE: list.sort(key=Collate(flags))
"""
return cmp(self.transform(a), self.transform(b))

def collate(slist, flags=0):
""" Collate list of strings in place.
"""
return slist.sort(Collate(flags))

def collated(slist, flags=0):
""" Return a collated list of strings.

This is a decorate-undecorate collate.
"""
collator = Collate(flags)
dd = [(collator.transform(x), x) for x in slist]
dd.sort()
return list([B for (A, B) in dd])

def _test():
"""
DOC TESTS AND EXAMPLES:

Sort (and sorted) normally order all words beginning with caps
before all words beginning with lower case.

>>> t = ['tuesday', 'Tuesday', 'Monday', 'monday']
>>> sorted(t) # regular sort

Click to expand...

Click to expand...

Click to expand...

['Monday', 'Tuesday', 'monday', 'tuesday']

Locale collation puts words beginning with caps after words
beginning with lower case of the same letter.
['monday', 'Monday', 'tuesday', 'Tuesday']

The CAPS_FIRST option can be used to put all words beginning
with caps after words beginning in lowercase of the same letter.
['Monday', 'monday', 'Tuesday', 'tuesday']

The HYPHEN_AS_SPACE option causes hyphens to be equal to space.

>>> t = ['a-b', 'b-a', 'aa-b', 'bb-a']
>>> collated(t)

Click to expand...

Click to expand...

Click to expand...

['aa-b', 'a-b', 'b-a', 'bb-a']
['a-b', 'aa-b', 'b-a', 'bb-a']

The IGNORE_LEADING_WS and UNDERSCORE_AS_SPACE options can be
used together to improve ordering in some situations.

>>> t = ['sum', '__str__', 'about', ' round']
>>> collated(t)

Click to expand...

Click to expand...

Click to expand...

[' round', '__str__', 'about', 'sum']
['__str__', 'about', ' round', 'sum']
[' round', '__str__', 'about', 'sum']
['about', ' round', '__str__', 'sum']

The NUMERICAL option orders leading and trailing digits as numerals.

>>> t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
>>> collated(t, NUMERICAL)

Click to expand...

Click to expand...

Click to expand...

['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']

The COMMA_IN_NUMERALS option ignores commas instead of using them to
seperate numerals.

>>> t = ['a5', 'a4,000', '500b', '100,000b']
>>> collated(t, NUMERICAL|COMMA_IN_NUMERALS)

Click to expand...

Click to expand...

Click to expand...

['500b', '100,000b', 'a5', 'a4,000']

Collating also can be done in place using collate() instead of collated().

>>> t = ['Fred', 'Ron', 'Carol', 'Bob']
>>> collate(t)
>>> t

Click to expand...

Click to expand...

Click to expand...

['Bob', 'Carol', 'Fred', 'Ron']

"""
import doctest
doctest.testmod()

if __name__ == '__main__':
_test()

Ron Adam · Oct 18, 2006

Fixed...

Changed the collate() function to return None the same as sort() since it is an
in place collate.

A comment in _test() doctests was reversed. CAPS_FIRST option puts words
beginning with capitals before, not after, words beginning with lower case of
the same letter.

It seems I always find a few obvious glitches right after I post something. ;-)

Cheers,
Ron

georgeryoung · Oct 18, 2006

I put together the following module today and would like some feedback on any
obvious problems. Or even opinions of weather or not it is a good approach.

,,,
def __call__(self, a, b):
""" This allows the Collate class work as a sort key.

USE: list.sort(key=Collate(flags))
"""
return cmp(self.transform(a), self.transform(b))

You document _call__ as useful for the "key" keyword to sort, but you
implement it for the "cmp" keyword. The "key" allows much better
performance, since it's called only once per value. Maybe just :
return self.transform(a)

-- George

Ron Adam · Oct 18, 2006

,,,
def __call__(self, a, b):
""" This allows the Collate class work as a sort key.

USE: list.sort(key=Collate(flags))
"""
return cmp(self.transform(a), self.transform(b))

You document _call__ as useful for the "key" keyword to sort, but you
implement it for the "cmp" keyword. The "key" allows much better
performance, since it's called only once per value. Maybe just :
return self.transform(a)

-- George

Thanks, I changed it to the following...

def __call__(self, a):
""" This allows the Collate class work as a sort key.

USE: list.sort(key=Collate(flags))
"""
return self.transform(a)

And also changed the sort call here ...

def collate(slist, flags=0):
""" Collate list of strings in place.
"""
slist.sort(key=Collate(flags)) <<<

Today I'll do some performance tests to see how much faster it is for moderate
sized lists.

Cheers,
Ron

Ron Adam · Oct 18, 2006

I made a number of changes ... (the new version is listed below)

These changes also resulted in improving the speed by about 3 times when all
flags are specified.

Collating now takes about 1/3 (or less) time. Although it is still quite a bit
slower than a bare list.sort(), that is to be expected as collate is locale
aware and does additional transformations on the data which you would need to do
anyways. The tests where done with Unicode strings as well.

Changed the flag types from integer values to a list of named strings. The
reason for this is it makes finding errors easier and you can examine the flags
attribute and get a readable list of flags.

A better regular expression for separating numerals. It now separates numerals
in the middle of the string.

Changed flag COMMA_IN_NUMERALS to IGNORE_COMMAS, This was how it was implemented.

Added flag PERIOD_AS_COMMAS

This lets you collate decimal separated numbers correctly such as version
numbers and internet address's. It also prevents numerals from being
interpreted as floating point or decimal.

It might make more since to implement it as PERIOD_IS_SEPARATOR. Needed?

Other minor changes to doc strings and tests were made.

Any feedback is welcome.

Cheers,
Ron

"""
Collate.py

A general purpose configurable collate module.

Collation can be modified with the following keywords:

CAPS_FIRST -> Aaa, aaa, Bbb, bbb
HYPHEN_AS_SPACE -> Don't ignore hyphens
UNDERSCORE_AS_SPACE -> Underscores as white space
IGNORE_LEADING_WS -> Disregard leading white space
NUMERICAL -> Digit sequences as numerals
IGNORE_COMMAS -> Allow commas in numerals
PERIOD_AS_COMMAS -> Periods can separate numerals.

* See doctests for examples.

Author: Ron Adam, (e-mail address removed)

"""
__version__ = '0.02 (pre-alpha) 10/18/2006'

import re
import locale
import string

locale.setlocale(locale.LC_ALL, '') # use current locale settings

# The above line may change the string constants from the string
# module. This may have unintended effects if your program
# assumes they are always the ascii defaults.

CAPS_FIRST = 'CAPS_FIRST'
HYPHEN_AS_SPACE = 'HYPHEN_AS_SPACE'
UNDERSCORE_AS_SPACE = 'UNDERSCORE_AS_SPACE'
IGNORE_LEADING_WS = 'IGNORE_LEADING_WS'
NUMERICAL = 'NUMERICAL'
IGNORE_COMMAS = 'IGNORE_COMMAS'
PERIOD_AS_COMMAS = 'PERIOD_AS_COMMAS'

class Collate(object):
""" A general purpose and configurable collator class.
"""
def __init__(self, flags=[]):
self.flags = flags
self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)
self.txtable = []
if HYPHEN_AS_SPACE in flags:
self.txtable.append(('-', ' '))
if UNDERSCORE_AS_SPACE in flags:
self.txtable.append(('_', ' '))
if PERIOD_AS_COMMAS in flags:
self.txtable.append(('.', ','))
if IGNORE_COMMAS in flags:
self.txtable.append((',', ''))
self.flags = flags

def transform(self, s):
""" Transform a string for collating.
"""
if not self.flags:
return locale.strxfrm(s)
for a, b in self.txtable:
s = s.replace(a, b)
if IGNORE_LEADING_WS in self.flags:
s = s.strip()
if CAPS_FIRST in self.flags:
s = s.swapcase()
if NUMERICAL in self.flags:
slist = self.numrex.split(s)
for i, x in enumerate(slist):
try:
slist = float(x)
except:
slist = locale.strxfrm(x)
return slist
return locale.strxfrm(s)

def __call__(self, a):
""" This allows the Collate class work as a sort key.

USE: list.sort(key=Collate(flags))
"""
return self.transform(a)

def collate(slist, flags=[]):
""" Collate list of strings in place.
"""
slist.sort(key=Collate(flags).transform)

def collated(slist, flags=[]):
""" Return a collated list of strings.
"""
return sorted(slist, key=Collate(flags).transform)

def _test():
"""
DOC TESTS AND EXAMPLES:

Sort (and sorted) normally order all words beginning with caps
before all words beginning with lower case.

>>> t = ['tuesday', 'Tuesday', 'Monday', 'monday']
>>> sorted(t) # regular sort

Click to expand...

Click to expand...

Click to expand...

['Monday', 'Tuesday', 'monday', 'tuesday']

Locale collation puts words beginning with caps after words
beginning with lower case of the same letter.
['monday', 'Monday', 'tuesday', 'Tuesday']

The CAPS_FIRST option can be used to put all words beginning
with caps before words beginning in lowercase of the same letter.

>>> collated(t, [CAPS_FIRST])

Click to expand...

Click to expand...

Click to expand...

['Monday', 'monday', 'Tuesday', 'tuesday']

The HYPHEN_AS_SPACE option causes hyphens to be equal to space.

>>> t = ['a-b', 'b-a', 'aa-b', 'bb-a']
>>> collated(t)

Click to expand...

Click to expand...

Click to expand...

['aa-b', 'a-b', 'b-a', 'bb-a']

>>> collated(t, [HYPHEN_AS_SPACE])

Click to expand...

Click to expand...

Click to expand...

['a-b', 'aa-b', 'b-a', 'bb-a']

The IGNORE_LEADING_WS and UNDERSCORE_AS_SPACE options can be
used together to improve ordering in some situations.

>>> t = ['sum', '__str__', 'about', ' round']
>>> collated(t)

Click to expand...

Click to expand...

Click to expand...

[' round', '__str__', 'about', 'sum']

>>> collated(t, [IGNORE_LEADING_WS])

Click to expand...

Click to expand...

Click to expand...

['__str__', 'about', ' round', 'sum']

>>> collated(t, [UNDERSCORE_AS_SPACE])

Click to expand...

Click to expand...

Click to expand...

[' round', '__str__', 'about', 'sum']

>>> collated(t, [IGNORE_LEADING_WS, UNDERSCORE_AS_SPACE])

Click to expand...

Click to expand...

Click to expand...

['about', ' round', '__str__', 'sum']

The NUMERICAL option orders sequences of digits as numerals.

>>> t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
>>> collated(t, [NUMERICAL])

Click to expand...

Click to expand...

Click to expand...

['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']

The IGNORE_COMMAS option prevents commas from seperating numerals.

>>> t = ['a5', 'a4,000', '500b', '100,000b']
>>> collated(t, [NUMERICAL, IGNORE_COMMAS])

Click to expand...

Click to expand...

Click to expand...

['500b', '100,000b', 'a5', 'a4,000']

The PERIOD_AS_COMMAS option can be used to sort version numbers
and other decimal seperated numbers correctly.

>>> t = ['5.1.1', '5.10.12','5.2.2', '5.2.19' ]
>>> collated(t, [NUMERICAL, PERIOD_AS_COMMAS])

Click to expand...

Click to expand...

Click to expand...

['5.1.1', '5.2.2', '5.2.19', '5.10.12']

Collate also can be done in place by using collate() instead of
collated().

>>> t = ['Fred', 'Ron', 'Carol', 'Bob']
>>> collate(t)
>>> t

Click to expand...

Click to expand...

Click to expand...

['Bob', 'Carol', 'Fred', 'Ron']

"""
import doctest
doctest.testmod()

if __name__ == '__main__':
_test()

bearophileHUGS · Oct 18, 2006

This part of code uses integer "constants" to be or-ed (or added):

CAPS_FIRST = 1
NUMERICAL = 2
HYPHEN_AS_SPACE = 4
UNDERSCORE_AS_SPACE = 8
IGNORE_LEADING_WS = 16
COMMA_IN_NUMERALS = 32

....

def __init__(self, flag):
self.flag = flag
def transform(self, s):
""" Transform a string for collating.
"""
if self.flag & CAPS_FIRST:
s = s.swapcase()
if self.flag & HYPHEN_AS_SPACE:
s = s.replace('-', ' ')
if self.flag & UNDERSCORE_AS_SPACE:
s = s.replace('_', ' ')
if self.flag & IGNORE_LEADING_WS:
s = s.strip()
if self.flag & NUMERICAL:
if self.flag & COMMA_IN_NUMERALS:

This is used in C, but maybe for Python other solutions may be better.
I can see some different (untested) solutions:

1)

def selfassign(self, locals):
# Code from web.py, modified.
for key, value in locals.iteritems():
if key != 'self':
setattr(self, key, value)

def __init__(self,
caps_first=False,
hyphen_as_space=False,
underscore_as_space=False,
ignore_leading_ws=False,
numerical=False,
comma_in_numerals=False):
selfassign(self, locals())

def transform(self, s):
if self.caps_first:
...

Disadvangages: if a flag is added/modified, the code has to be modified
in two places.

2)

def __init__(self, **kwds):
self.lflags = [k for k,v in kwds.items() if v]
def transform(self, s):
if "caps_first" in self.lflags:
...

This class can be created with 1 instead of Trues, to shorten the code.

Disadvantages: the user of this class has to read from the class
doctring or from from the docs the list of possible flags (and such
docs can be out of sync from the code).

3)

Tkinter (Tcl) shows that sometimes strings are better than int
constants (like using "left" instead of tkinter.LEFT, etc), so this is
another possibile solution:

def __init__(self, flags=""):
self.lflags = flags.lower().split()
def transform(self, s):
if "caps_first" in self.lflags:
...

An example of calling this class:

.... = Collate("caps_first hyphen_as_space numerical")

I like this third (nonstandard) solution enough.

Bye,
bearophile

Ron Adam · Oct 18, 2006

Thanks, But I fixed it already. (almost) ;-)

I think I will use strings as you suggest, and verify they are valid so a type
don't go though silently.

I ended up using string based option list. I agree a space separated string is
better and easier from a user point of view.

The advantage of the list is it can be iterated without splitting first. But
that's a minor thing. self.options = options.lower().split(' ') fixes that easily.

Once I'm sure it's not going to get any major changes I'll post this as a
recipe. I think it's almost there.

Cheers and thanks,
Ron

This part of code uses integer "constants" to be or-ed (or added):

CAPS_FIRST = 1
NUMERICAL = 2
HYPHEN_AS_SPACE = 4
UNDERSCORE_AS_SPACE = 8
IGNORE_LEADING_WS = 16
COMMA_IN_NUMERALS = 32

...

def __init__(self, flag):
self.flag = flag
def transform(self, s):
""" Transform a string for collating.
"""
if self.flag & CAPS_FIRST:
s = s.swapcase()
if self.flag & HYPHEN_AS_SPACE:
s = s.replace('-', ' ')
if self.flag & UNDERSCORE_AS_SPACE:
s = s.replace('_', ' ')
if self.flag & IGNORE_LEADING_WS:
s = s.strip()
if self.flag & NUMERICAL:
if self.flag & COMMA_IN_NUMERALS:

This is used in C, but maybe for Python other solutions may be better.
I can see some different (untested) solutions:

1)

def selfassign(self, locals):
# Code from web.py, modified.
for key, value in locals.iteritems():
if key != 'self':
setattr(self, key, value)

def __init__(self,
caps_first=False,
hyphen_as_space=False,
underscore_as_space=False,
ignore_leading_ws=False,
numerical=False,
comma_in_numerals=False):
selfassign(self, locals())

def transform(self, s):
if self.caps_first:
...

Disadvangages: if a flag is added/modified, the code has to be modified
in two places.

2)

def __init__(self, **kwds):
self.lflags = [k for k,v in kwds.items() if v]
def transform(self, s):
if "caps_first" in self.lflags:
...

This class can be created with 1 instead of Trues, to shorten the code.

Disadvantages: the user of this class has to read from the class
doctring or from from the docs the list of possible flags (and such
docs can be out of sync from the code).

3)

Tkinter (Tcl) shows that sometimes strings are better than int
constants (like using "left" instead of tkinter.LEFT, etc), so this is
another possibile solution:

I think maybe this is better, but I need to verify the flags so typos don't go
though silently.

Ron Adam · Oct 18, 2006

This is how I changed it...

(I edited out the test and imports for posting here.)

locale.setlocale(locale.LC_ALL, '') # use current locale settings

class Collate(object):
""" A general purpose and configurable collator class.
"""
options = [ 'CAPS_FIRST', 'NUMERICAL', 'HYPHEN_AS_SPACE',
'UNDERSCORE_AS_SPACE', 'IGNORE_LEADING_WS',
'IGNORE_COMMAS', 'PERIOD_AS_COMMAS' ]

def __init__(self, flags=""):
if flags:
flags = flags.upper().split()
for value in flags:
if value not in self.options:
raise ValueError, 'Invalid option: %s' % value
self.txtable = []
if 'HYPHEN_AS_SPACE' in flags:
self.txtable.append(('-', ' '))
if 'UNDERSCORE_AS_SPACE' in flags:
self.txtable.append(('_', ' '))
if 'PERIOD_AS_COMMAS' in flags:
self.txtable.append(('.', ','))
if 'IGNORE_COMMAS' in flags:
self.txtable.append((',', ''))
self.flags = flags
self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)

def transform(self, s):
""" Transform a string for collating.
"""
if not self.flags:
return locale.strxfrm(s)
for a, b in self.txtable:
s = s.replace(a, b)
if 'IGNORE_LEADING_WS' in self.flags:
s = s.strip()
if 'CAPS_FIRST' in self.flags:
s = s.swapcase()
if 'NUMERICAL' in self.flags:
slist = self.numrex.split(s)
for i, x in enumerate(slist):
try:
slist = float(x)
except:
slist = locale.strxfrm(x)
return slist
return locale.strxfrm(s)

def __call__(self, a):
""" This allows the Collate class to be used as a sort key.

USE: list.sort(key=Collate(flags))
"""
return self.transform(a)

def collate(slist, flags=[]):
""" Collate list of strings in place.
"""
slist.sort(key=Collate(flags).transform)

def collated(slist, flags=[]):
""" Return a collated list of strings.
"""
return sorted(slist, key=Collate(flags).transform)

bearophileHUGS · Oct 18, 2006

Ron Adam:

Insted of:

def __init__(self, flags=[]):
self.flags = flags
self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)
self.txtable = []
if HYPHEN_AS_SPACE in flags:
self.txtable.append(('-', ' '))
if UNDERSCORE_AS_SPACE in flags:
self.txtable.append(('_', ' '))
if PERIOD_AS_COMMAS in flags:
self.txtable.append(('.', ','))
if IGNORE_COMMAS in flags:
self.txtable.append((',', ''))
self.flags = flags

I think using a not mutable flags default is safer, this is an
alternative (NOT tested!):

numrex = re.compile(r'[\d\.]* | \D*', re.LOCALE|re.VERBOSE)
dflags = {"hyphen_as_space": ('-', ' '),
"underscore_as_space": ('_', ' '),
"period_as_commas": ('_', ' '),
"ignore_commas": (',', ''),
...
}

def __init__(self, flags=()):
self.flags = [fl.strip().lower() for fl in flags]
self.txtable = []
df = self.__class__.dflags
for flag in self.flags:
if flag in df:
self.txtable.append(df[flag])
...

This is just an idea, it surely has some problems that have to be
fixed.

Bye,
bearophile

Gabriel Genellina · Oct 18, 2006

I put together the following module today and would like some feedback on any
obvious problems. Or even opinions of weather or not it is a good approach.
if self.flag & CAPS_FIRST:
s = s.swapcase()

This is just coincidental; it relies on (lowercase)<(uppercase) on
the locale collating sequence, and I don't see why it should be always so.

if self.flag & IGNORE_LEADING_WS:
s = s.strip()

This ignores trailing ws too. (lstrip?)

if self.flag & NUMERICAL:
if self.flag & COMMA_IN_NUMERALS:
rex =
re.compile('^(\d*\,?\d*\.?\d*)(\D*)(\d*\,?\d*\.?\d*)',
re.LOCALE)
else:
rex = re.compile('^(\d*\.?\d*)(\D*)(\d*\.?\d*)', re.LOCALE)
slist = rex.split(s)
for i, x in enumerate(slist):
if self.flag & COMMA_IN_NUMERALS:
x = x.replace(',', '')
try:
slist = float(x)
except:
slist = locale.strxfrm(x)
return slist
return locale.strxfrm(s)

You should try to make this part a bit more generic. If you are
concerned about locales, do not use "comma" explicitely. In other
countries 10*100=1.000 - and 1,234 is a fraction between 1 and 2.

The NUMERICAL option orders leading and trailing digits as numerals.

t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
collated(t, NUMERICAL)

Click to expand...

Click to expand...

['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']

Click to expand...

From the name "NUMERICAL" I would expect this sorting: b2, 4abc, a5,
a10.2, 13.5b, 20abc, a40 (that is, sorting as numbers only).
Maybe GROUP_NUMBERS... but I dont like that too much either...

--
Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Ron Adam · Oct 18, 2006

Gabriel said:
This is just coincidental; it relies on (lowercase)<(uppercase) on the
locale collating sequence, and I don't see why it should be always so.

The LC_COLLATE structure (in the python.exe C code I think) controls the order
of upper and lower case during collating. I don't know if there is anyway to
examine it unfortunately.

If there was a way to change the LC_COLLATE structure, I wouldn't need to resort
to tricks like s.swapcase(). But without that info, I don't know of another way.

Maybe changing the CAPS_FIRST to REVERSE_CAPS_ORDER would do?

I'm not sure if this would make any visible difference. It might determine order
of two strings where they are the same, but one has white space at the end the
other doesn't.

They run at the same speed either way, so I'll go ahead and change it. Thanks.

This ignores trailing ws too. (lstrip?)

if self.flag & NUMERICAL:
if self.flag & COMMA_IN_NUMERALS:
rex =
re.compile('^(\d*\,?\d*\.?\d*)(\D*)(\d*\,?\d*\.?\d*)',
re.LOCALE)
else:
rex = re.compile('^(\d*\.?\d*)(\D*)(\d*\.?\d*)',
re.LOCALE)
slist = rex.split(s)
for i, x in enumerate(slist):
if self.flag & COMMA_IN_NUMERALS:
x = x.replace(',', '')
try:
slist = float(x)
except:
slist = locale.strxfrm(x)
return slist
return locale.strxfrm(s)

Click to expand...

You should try to make this part a bit more generic. If you are
concerned about locales, do not use "comma" explicitely. In other
countries 10*100=1.000 - and 1,234 is a fraction between 1 and 2.

See the most recent version of this I posted. It is a bit more generic.

Maybe a 'comma_is_decimal' option?

Options are cheep so it's no problem to add them as long as they make sense. ;-)

These options are what I refer to as mid-level options. The programmer does
still need to know something about the data they are collating. They may still
need to do some preprocessing even with this, but maybe not as much.

In a higher level collation routine, I think you would just need to specify a
named sort type, such as 'dictionary', 'directory', 'enventory' and it would set
the options and accordingly. The problem with that approach is the higher level
definitions may be different depending on locale or even the field it is used in.

The NUMERICAL option orders leading and trailing digits as numerals.

t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
collated(t, NUMERICAL)

Click to expand...

['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']

Click to expand...

From the name "NUMERICAL" I would expect this sorting: b2, 4abc, a5,
a10.2, 13.5b, 20abc, a40 (that is, sorting as numbers only).
Maybe GROUP_NUMBERS... but I dont like that too much either...

Click to expand...

How about 'VALUE_ORDERING' ?

The term I've seen before is called natural ordering, but that is more general
and can include date, roman numerals, as well as other type.

Cheers,
Ron

Gabriel Genellina · Oct 18, 2006

The LC_COLLATE structure (in the python.exe C code I think) controls
the order
of upper and lower case during collating. I don't know if there is anyway to
examine it unfortunately.

LC_COLLATE is just a #define'd constant. I don't know how to examine
the collating definition, either.

If there was a way to change the LC_COLLATE structure, I wouldn't
need to resort
to tricks like s.swapcase(). But without that info, I don't know of
another way.

Maybe changing the CAPS_FIRST to REVERSE_CAPS_ORDER would do?

At least it's a more accurate name.
There is an indirect way: test locale.strcoll("A","a") and see how
they get sorted. Then define options CAPS_FIRST, LOWER_FIRST
accordingly. But maybe it's too much trouble...

See the most recent version of this I posted. It is a bit more generic.

Maybe a 'comma_is_decimal' option?

I'd prefer to use the 'decimal_point' and 'thousands_sep' from the
locale information. That would be more coherent with the locale usage
along your module.

Options are cheep so it's no problem to add them as long as they
make sense. ;-)

These options are what I refer to as mid-level options. The programmer does
still need to know something about the data they are
collating. They may still
need to do some preprocessing even with this, but maybe not as much.

In a higher level collation routine, I think you would just need to specify a
named sort type, such as 'dictionary', 'directory', 'enventory' and
it would set
the options and accordingly. The problem with that approach is the
higher level
definitions may be different depending on locale or even the field
it is used in.

Sure. But your module is a good starting point for building a more
high-level procedure.

The NUMERICAL option orders leading and trailing digits as numerals.

t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2']
collated(t, NUMERICAL)
['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2']

Click to expand...

From the name "NUMERICAL" I would expect this sorting: b2, 4abc, a5,
a10.2, 13.5b, 20abc, a40 (that is, sorting as numbers only).
Maybe GROUP_NUMBERS... but I dont like that too much either...

Click to expand...

How about 'VALUE_ORDERING' ?

The term I've seen before is called natural ordering, but that is
more general
and can include date, roman numerals, as well as other type.

Sometimes that's the hard part, finding a name which is concise,
descriptive, and accurately reflects what the code does. A good name
should make obvious what it is used for (being these option names, or
class names, or method names...) but in this case it may be difficult
to find a good one. So users will have to read the documentation (a
good thing, anyway!)

--
Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Leo Kislov · Oct 19, 2006

Ron said:
locale.setlocale(locale.LC_ALL, '') # use current locale settings

It's not current locale settings, it's user's locale settings.
Application can actually use something else and you will overwrite
that. You can also affect (unexpectedly to the application)
time.strftime() and C extensions. So you should move this call into the
_test() function and put explanation into the documentation that
application should call locale.setlocale

self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)
[snip]

if NUMERICAL in self.flags:
slist = self.numrex.split(s)
for i, x in enumerate(slist):
try:
slist = float(x)
except:
slist = locale.strxfrm(x)

I think you should call locale.atof instead of float, since you call
re.compile with re.LOCALE.

Everything else looks fine. The biggest missing piece is support for
unicode strings.

-- Leo.

Ron Adam · Oct 19, 2006

Leo said:
It's not current locale settings, it's user's locale settings.
Application can actually use something else and you will overwrite
that. You can also affect (unexpectedly to the application)
time.strftime() and C extensions. So you should move this call into the
_test() function and put explanation into the documentation that
application should call locale.setlocale

I'll experiment with this a bit, I was under the impression that local.strxfrm
needed the locale set for it to work correctly.

Maybe it would be better to have two (or more) versions? A string, unicode, and
locale version or maybe add an option to __init__ to choose the behavior?
Multiple versions seems to be the approach of pre-py3k. Although I was trying
to avoid that.

Sigh, of course issues like this is why it is better to have a module to do this
with. If it was as simple as just calling sort() I wouldn't have bothered. ;-)

self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)
[snip]

if NUMERICAL in self.flags:
slist = self.numrex.split(s)
for i, x in enumerate(slist):
try:
slist = float(x)
except:
slist = locale.strxfrm(x)

Click to expand...

I think you should call locale.atof instead of float, since you call
re.compile with re.LOCALE.

I think you are correct, but it seems locale.atof() is a *lot* slower than
float().

Here's the local.atof() code.

def atof(string,func=float):
"Parses a string as a float according to the locale settings."
#First, get rid of the grouping
ts = localeconv()['thousands_sep']

if ts:
string = string.replace(ts, '')
#next, replace the decimal point with a dot
dd = localeconv()['decimal_point']
if dd:
string = string.replace(dd, '.')
#finally, parse the string
return func(string)

I could set ts and dd in __init__ and just do the replacements in the try...

if NUMERICAL in self.flags:
slist = self.numrex.split(s)
for i, x in enumerate(slist):
if x: # slist may contain null strings
if self.ts:
xx = x.replace(self.ts, '') # remove thousands sep
if self.dd:
xx = xx.replace(self.dd, '.') # replace decimal point
try:
slist = float(xx)
except:
slist = locale.strxfrm(x)

How does that look?

It needs a fast way to determine if x is a number or a string. Any suggestions?

Everything else looks fine. The biggest missing piece is support for
unicode strings.

Click to expand...

This was the reason for using locale.strxfrm. It should let it work with unicode
strings from what I could figure out from the documents.

Am I missing something?

Thanks,
Ron

Ron Adam · Oct 19, 2006

Ron Adam:

Insted of:

def __init__(self, flags=[]):
self.flags = flags
self.numrex = re.compile(r'([\d\.]*|\D*)', re.LOCALE)
self.txtable = []
if HYPHEN_AS_SPACE in flags:
self.txtable.append(('-', ' '))
if UNDERSCORE_AS_SPACE in flags:
self.txtable.append(('_', ' '))
if PERIOD_AS_COMMAS in flags:
self.txtable.append(('.', ','))
if IGNORE_COMMAS in flags:
self.txtable.append((',', ''))
self.flags = flags

I think using a not mutable flags default is safer, this is an
alternative (NOT tested!):

numrex = re.compile(r'[\d\.]* | \D*', re.LOCALE|re.VERBOSE)
dflags = {"hyphen_as_space": ('-', ' '),
"underscore_as_space": ('_', ' '),
"period_as_commas": ('_', ' '),
"ignore_commas": (',', ''),
...
}

def __init__(self, flags=()):
self.flags = [fl.strip().lower() for fl in flags]
self.txtable = []
df = self.__class__.dflags
for flag in self.flags:
if flag in df:
self.txtable.append(df[flag])
...

This is just an idea, it surely has some problems that have to be
fixed.

I think the 'if's are ok since there are only a few options that need to be
handled by them.

I'm still trying to determine what options are really needed. I can get the
thousand separator and decimal character from local.localconv() function. So
ignore_commas isn't needed I think. And maybe change period_as_commas to period
_as_sep and then split on periods before comparing.

I also want it to issue exceptions when the Collate object is created if invalid
options are specified. That makes finding problems much easier. The example
above doesn't do that, it accepts them silently. That was one of the reasons I
went to named constants at first.

How does this look?

numrex = re.compile(r'([\d\.]* | \D*)', re.LOCALE|re.VERBOSE)
options = ( 'CAPS_FIRST', 'NUMERICAL', 'HYPHEN_AS_SPACE',
'UNDERSCORE_AS_SPACE', 'IGNORE_LEADING_WS',
'IGNORE_COMMAS', 'PERIOD_AS_COMMAS' )
def __init__(self, flags=""):
if flags:
flags = flags.upper().split()
for value in flags:
if value not in self.options:
raise ValueError, 'Invalid option: %s' % value
self.txtable = []
if 'HYPHEN_AS_SPACE' in flags:
self.txtable.append(('-', ' '))
if 'UNDERSCORE_AS_SPACE' in flags:
self.txtable.append(('_', ' '))
if 'PERIOD_AS_COMMAS' in flags:
self.txtable.append(('.', ','))
if 'IGNORE_COMMAS' in flags:
self.txtable.append((',', ''))
self.flags = flags

So you can set an option strings as...

import collate as C

collateopts = \
""" caps_first
hyphen_as_space
numerical
ignore_commas
"""
colatedlist = C.collated(somelist, collateopts)

A nice advantage with an option string is you don't have to prepend all your
options with the module name. But you do have to validate it.

Cheers,
Ron

Ron Adam · Oct 19, 2006

Gabriel said:
At least it's a more accurate name.
There is an indirect way: test locale.strcoll("A","a") and see how they
get sorted. Then define options CAPS_FIRST, LOWER_FIRST accordingly. But
maybe it's too much trouble...

I should of thought of that simple test. Thanks.

That would work, and it's really not that much trouble. The actual test can be
done during importing. Maybe determining other useful values could be done at
that time as well.

I'd prefer to use the 'decimal_point' and 'thousands_sep' from the
locale information. That would be more coherent with the locale usage
along your module.

Yes, it looks like I'll need to do this.

Sometimes that's the hard part, finding a name which is concise,
descriptive, and accurately reflects what the code does. A good name
should make obvious what it is used for (being these option names, or
class names, or method names...) but in this case it may be difficult to
find a good one. So users will have to read the documentation (a good
thing, anyway!)

I plan on making the doc strings a bit more informative so that help(collate)
will give meaningful information on it's options.

Cheers,
Ron

Leo Kislov · Oct 20, 2006

Ron said:
I'll experiment with this a bit, I was under the impression that local.strxfrm
needed the locale set for it to work correctly.

Actually locale.strxfrm and all other functions in locale module work
as designed: they work in C locale before the first call to
locale.setlocale. This is by design, call to locale.setlocale should be
done by an application, not by a 3rd party module like your collation
module.

Maybe it would be better to have two (or more) versions? A string, unicode, and
locale version or maybe add an option to __init__ to choose the behavior?

I don't think it should be two separate versions. Unicode support is
only a matter of code like this:

# in the constructor
self.encoding = locale.getpreferredencoding()

# class method
def strxfrm(self, s):
if type(s) is unicode:
return locale.strxfrm(s.encode(self.encoding,'replace')
return locale.strxfrm(s)

and then instead of locale.strxfrm call self.strxfrm. And similar code
for locale.atof

This was the reason for using locale.strxfrm. It should let it work with unicode
strings from what I could figure out from the documents.

Am I missing something?

strxfrm works only with byte strings encoded in the system encoding.

-- Leo

Ron Adam · Oct 20, 2006

Leo said:
Actually locale.strxfrm and all other functions in locale module work
as designed: they work in C locale before the first call to
locale.setlocale. This is by design, call to locale.setlocale should be
done by an application, not by a 3rd party module like your collation
module.

Yes, I've come to that conclusion also. (reserching as I go) ;-)

I put an example of that in the class doc string so it could easily be found.

I don't think it should be two separate versions. Unicode support is
only a matter of code like this:

# in the constructor
self.encoding = locale.getpreferredencoding()

# class method
def strxfrm(self, s):
if type(s) is unicode:
return locale.strxfrm(s.encode(self.encoding,'replace')
return locale.strxfrm(s)

and then instead of locale.strxfrm call self.strxfrm. And similar code
for locale.atof

Thanks for the example.

strxfrm works only with byte strings encoded in the system encoding.

-- Leo

Windows has an alternative function, wcxfrm. (wide character transform)

http://msdn.microsoft.com/library/d...en-us/vclib/html/_crt_strxfrm.2c_.wcsxfrm.asp

But it's not exposed in Python. I could use ctypes to call it, but it would then
be windows specific and I doubt it would even work as expected.

Maybe a wcsxfrm patch would be good for Python 2.6? Python 3000 will probably
need it anyway.

I've made a few additional changes and will start a new thread after some more
testing to get some additional feedback.

Cheers,
Ron

Collate Module	0	Oct 23, 2006
Can anyone code this for me ?	1	Dec 6, 2024
Python code problem	2	Apr 23, 2023
Critique my assignment please	67	Aug 23, 2007
Evaluate My Program Please	5	Sep 18, 2006
KirbyBase : replacing string exceptions	2	Nov 23, 2009
Errors on REXML reading an HTML.	1	Dec 24, 2010
Better crypto hash functions, long, with code	2	Aug 26, 2005

Flexable Collating (feedback please)

Ron Adam

Ron Adam

georgeryoung

Ron Adam

Ron Adam

bearophileHUGS

Ron Adam

Ron Adam

bearophileHUGS

Gabriel Genellina

Ron Adam

Gabriel Genellina

Leo Kislov

Ron Adam

Ron Adam

Ron Adam

Leo Kislov

Ron Adam

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads