Pure python implementation of string-like class

A

Akihiro KAYAMA

Hi all.

I would like to ask how I can implement string-like class using tuple
or list. Does anyone know about some example codes of pure python
implementation of string-like class?

Because I am trying to use Python for a text processing which is
composed of a large character set. As the character set is wider than
UTF-16(U+10FFFF), I can't use Python's native unicode string class.

So I want to prepare my own string class, which provides convenience
string methods such as split, join, find and others like usual string
class, but it uses a sequence of integer as a internal representation
instead of a native string. Obviously, subclassing of str doesn't
help.

The implementation of each string methods in the Python source
tree(stringobject.c) is far from python code, so I have started from
scratch, like below:

def startswith(self, prefix, start=-1, end=-1):
assert start < 0, "not implemented"
assert end < 0, "not implemented"
if isinstance(prefix, (str, unicode)):
prefix = MyString(prefix)
n = len(prefix)
return self[0:n] == prefix

but I found it's not a trivial task for myself to achive correctness
and completeness. It smells "reinventing the wheel" also, though I
can't find any hints in google and/or Python cookbook.

I don't care efficiency as a starting point. Any comments are welcome.
Thanks.

-- kayama
 
B

bearophileHUGS

Maybe you can create your class using an array of 'L' with the array
standard module.

Bye,
bearophile
 
A

and-google

Akihiro said:
As the character set is wider than UTF-16(U+10FFFF), I can't use
Python's native unicode string class.

Have you tried using Python compiled in Wide Unicode mode
(--enable-unicode=ucs4)? You get native UTF-32/UCS-4 strings then,
which should be enough for most purposes.
 
A

Akihiro KAYAMA

Hi bearophile.

bearophileHUGS> Maybe you can create your class using an array of 'L' with the array
bearophileHUGS> standard module.

Thanks for your suggestion. I'm currently using an usual list as a
internal representation. According to my understanding, as compared to
list, array module offers efficiency but no convenient function to
implement various string methods. As Python's list is already enough
fast, I want to speed up my coding work first.

-- kayama
 
A

Akihiro KAYAMA

Hi And.

and-google> Akihiro KAYAMA wrote:
and-google> > As the character set is wider than UTF-16(U+10FFFF), I can't use
and-google> > Python's native unicode string class.
and-google>
and-google> Have you tried using Python compiled in Wide Unicode mode
and-google> (--enable-unicode=ucs4)? You get native UTF-32/UCS-4 strings then,
and-google> which should be enough for most purposes.
From my quick survey, Python's Unicode support is restricted to
UTF-16 range(U+0000...U+10FFFF) intentionally, regardless of
--enable-unicode=ucs4 option.
Python 2.4.1 (#2, Sep 3 2005, 22:35:47)
[GCC 2.95.4 20020320 [FreeBSD]] on freebsd4
Type "help", "copyright", "credits" or "license" for more information.UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character

Simple patch to unicodeobject.c which disables unicode range checking
could solve this, but I don't want to maintenance specialized Python
binary for my project.

-- kayama
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top