split string into multi-character "letters"

Jed · Aug 25, 2010

Hi, I'm seeking help with a fairly simple string processing task.
I've simplified what I'm actually doing into a hypothetical
equivalent.
Suppose I want to take a word in Spanish, and divide it into
individual letters. The problem is that there are a few 2-character
combinations that are considered single letters in Spanish - for
example 'ch', 'll', 'rr'.
Suppose I have:

alphabet = ['a','b','c','ch','d','u','r','rr','o'] #this would include
the whole alphabet but I shortened it here
theword = 'churro'

I would like to split the string 'churro' into a list containing:

'ch','u','rr','o'

So at each letter I want to look ahead and see if it can be combined
with the next letter to make a single 'letter' of the Spanish
alphabet. I think this could be done with a regular expression
passing the list called "alphabet" to re.match() for example, but I'm
not sure how to use the contents of a whole list as a search string in
a regular expression, or if it's even possible. My real application
is a bit more complex than the Spanish alphabet so I'm looking for a
fairly general solution.
Thanks,
Jed

Jussi Piitulainen · Aug 25, 2010

Jed said:
alphabet = ['a','b','c','ch','d','u','r','rr','o'] #this would
include the whole alphabet but I shortened it here
theword = 'churro'

I would like to split the string 'churro' into a list containing:

'ch','u','rr','o'

All non-overlapping matches, each as long as can be, and '.' catches
single characters by default:
['ch', 'u', 'rr', 'o']

Vlastimil Brom · Aug 25, 2010

2010/8/25 Jed said:
Hi, I'm seeking help with a fairly simple string processing task.
I've simplified what I'm actually doing into a hypothetical
equivalent.
Suppose I want to take a word in Spanish, and divide it into
individual letters. The problem is that there are a few 2-character
combinations that are considered single letters in Spanish - for
example 'ch', 'll', 'rr'.
Suppose I have:

alphabet = ['a','b','c','ch','d','u','r','rr','o'] #this would include
the whole alphabet but I shortened it here
theword = 'churro'

I would like to split the string 'churro' into a list containing:

'ch','u','rr','o'

So at each letter I want to look ahead and see if it can be combined
with the next letter to make a single 'letter' of the Spanish
alphabet. I think this could be done with a regular expression
passing the list called "alphabet" to re.match() for example, but I'm
not sure how to use the contents of a whole list as a search string in
a regular expression, or if it's even possible. My real application
is a bit more complex than the Spanish alphabet so I'm looking for a
fairly general solution.
Thanks,
Jed

Hi,
I am not sure, whether it can be generalised enough for your needs,
but you can try something like

re.findall(r"rr|ll|ch|[a-z]", "asdasdallasdrrcvb")

Click to expand...

Click to expand...

['a', 's', 'd', 'a', 's', 'd', 'a', 'll', 'a', 's', 'd', 'rr', 'c', 'v', 'b']

of course, the pattern should be adjusted precisely in order not to
loose characters...

hth,
vbr

MRAB · Aug 25, 2010

Hi, I'm seeking help with a fairly simple string processing task.
I've simplified what I'm actually doing into a hypothetical
equivalent.
Suppose I want to take a word in Spanish, and divide it into
individual letters. The problem is that there are a few 2-character
combinations that are considered single letters in Spanish - for
example 'ch', 'll', 'rr'.
Suppose I have:

alphabet = ['a','b','c','ch','d','u','r','rr','o'] #this would include
the whole alphabet but I shortened it here
theword = 'churro'

I would like to split the string 'churro' into a list containing:

'ch','u','rr','o'

So at each letter I want to look ahead and see if it can be combined
with the next letter to make a single 'letter' of the Spanish
alphabet. I think this could be done with a regular expression
passing the list called "alphabet" to re.match() for example, but I'm
not sure how to use the contents of a whole list as a search string in
a regular expression, or if it's even possible. My real application
is a bit more complex than the Spanish alphabet so I'm looking for a
fairly general solution.

You can build a regex with:
'a|b|c|ch|d|u|r|rr|o'

You want to try to match, say, 'ch' before 'c', so you want the longest
first:
'ch|rr|a|b|c|d|u|r|o'

If you were going to match the Spanish alphabet then I would recommend
that you do it in Unicode. Well, any text that's not pure ASCII should
be done in Unicode!

Thomas Jollans · Aug 25, 2010

Hi, I'm seeking help with a fairly simple string processing task.
I've simplified what I'm actually doing into a hypothetical
equivalent.
Suppose I want to take a word in Spanish, and divide it into
individual letters. The problem is that there are a few 2-character
combinations that are considered single letters in Spanish - for
example 'ch', 'll', 'rr'.
Suppose I have:

alphabet = ['a','b','c','ch','d','u','r','rr','o'] #this would include
the whole alphabet but I shortened it here
theword = 'churro'

I would like to split the string 'churro' into a list containing:

'ch','u','rr','o'

So at each letter I want to look ahead and see if it can be combined
with the next letter to make a single 'letter' of the Spanish
alphabet. I think this could be done with a regular expression
passing the list called "alphabet" to re.match() for example, but I'm
not sure how to use the contents of a whole list as a search string in
a regular expression, or if it's even possible. My real application
is a bit more complex than the Spanish alphabet so I'm looking for a
fairly general solution.

A very simple solution that might be general enough:
.... while string:
.... for b in bits:
.... if string.startswith(b):
.... yield b
.... string = string[len(b):]
.... break
.... else:
.... raise ValueError("string not composed of the right bits.")
....

alphabet = ['a','b','c','ch','d','u','r','rr','o']
# move longer letters to the front
alphabet.sort(key=len, reverse=True)

list(tokensplit("churro", alphabet)) ['ch', 'u', 'rr', 'o']

Click to expand...

Click to expand...

Tim Chase · Aug 25, 2010

Hi, I'm seeking help with a fairly simple string processing task.
I've simplified what I'm actually doing into a hypothetical
equivalent.
Suppose I want to take a word in Spanish, and divide it into
individual letters. The problem is that there are a few 2-character
combinations that are considered single letters in Spanish - for
example 'ch', 'll', 'rr'.
Suppose I have:

alphabet = ['a','b','c','ch','d','u','r','rr','o'] #this would include
the whole alphabet but I shortened it here
theword = 'churro'

I would like to split the string 'churro' into a list containing:

'ch','u','rr','o'

So at each letter I want to look ahead and see if it can be combined
with the next letter to make a single 'letter' of the Spanish
alphabet. I think this could be done with a regular expression
passing the list called "alphabet" to re.match() for example, but I'm
not sure how to use the contents of a whole list as a search string in
a regular expression, or if it's even possible.

My first attempt at the problem:

>>> import re
>>> special = ['ch', 'rr', 'll']
>>> r = re.compile(r'(?:%s)|[a-z]' % ('|'.join(re.escape(c) for c in special)), re.I)
>>> r.findall('churro') ['ch', 'u', 'rr', 'o']
>>> [r.findall(word) for word in 'churro lorenzo caballo'.split()]

Click to expand...

Click to expand...

[['ch', 'u', 'rr', 'o'], ['l', 'o', 'r', 'e', 'n', 'z', 'o'],
['c', 'a', 'b', 'a', 'll', 'o']]

This joins escaped versions of all your special characters. Due
to the sequential nature used by Python's re module to handle "|"
or-branching, the paired versions get tested (and found) before
proceeding to the single-letters.

-tkc

Alexander Kapps · Aug 25, 2010

Jed said:
Hi, I'm seeking help with a fairly simple string processing task.
I've simplified what I'm actually doing into a hypothetical
equivalent.
Suppose I want to take a word in Spanish, and divide it into
individual letters. The problem is that there are a few 2-character
combinations that are considered single letters in Spanish - for
example 'ch', 'll', 'rr'.
Suppose I have:

alphabet = ['a','b','c','ch','d','u','r','rr','o'] #this would include
the whole alphabet but I shortened it here
theword = 'churro'

I would like to split the string 'churro' into a list containing:

'ch','u','rr','o'

So at each letter I want to look ahead and see if it can be combined
with the next letter to make a single 'letter' of the Spanish
alphabet. I think this could be done with a regular expression
passing the list called "alphabet" to re.match() for example, but I'm
not sure how to use the contents of a whole list as a search string in
a regular expression, or if it's even possible. My real application
is a bit more complex than the Spanish alphabet so I'm looking for a
fairly general solution.
Thanks,
Jed

I don't know the Spanish alphabet, and you didn't say in what way
your real application is more complex, but maybe something like this
could be a starter:

In [13]: import re

In [14]: theword = 'churro'

In [15]: two_chars=["ch", "rr"]

In [16]: re.findall('|'.join(two_chars)+"|[a-z]", theword)
Out[16]: ['ch', 'u', 'rr', 'o']

Terry Reedy · Aug 26, 2010

On 08/25/10 14:46, Jed wrote:

Dirt simple, straightforward, easily generalized solution:

def sp_split(s):
n,i,ret = len(s), 0, []
while i < n:
s2 = s[i:i+2]
if s2 in ('ch', 'll', 'rr'):
ret.append(s2)
i += 2
else:
ret.append(s)
i += 1
return ret

print(sp_split('churro'))

#'ch', 'u', 'rr', 'o']

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Yet another "split string by spaces preserving single quotes" problem	1	May 13, 2012
need to split string into letters and numbers	6	May 14, 2009
FAQ 4.31 How can I split a [character] delimited string except when inside [character]?	0	Apr 13, 2011
String#split regex \W on non-ASCII text	1	Nov 9, 2010
Split a string based on change of character	2	Jul 29, 2007
String split question.	9	Aug 8, 2008
Split a string into characters	6	Nov 18, 2006

split string into multi-character "letters"

Jed

Jussi Piitulainen

Vlastimil Brom

MRAB

Thomas Jollans

Tim Chase

Alexander Kapps

Terry Reedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads