simple string parsing ?

TAG · Sep 9, 2004

Hi,

I am new to python and would like to parse a string, well acually a
formula and get the stuff grouped together
eg:

if I have :

=+GC142*(GC94+0.5*sum(GC96:GC101))

and I want to get :

['=', '+', 'GC142', '*', '(', 'GC94', '+', '0.5', '*', 'sum', '(',
'GC96', ':', 'GC101', ')', ')']

how can I get this ??????

Thanks

Peter Kleiweg · Sep 9, 2004

TAG schreef:

if I have :

=+GC142*(GC94+0.5*sum(GC96:GC101))

and I want to get :

['=', '+', 'GC142', '*', '(', 'GC94', '+', '0.5', '*', 'sum', '(',
'GC96', ':', 'GC101', ')', ')']

how can I get this ??????

import re
R = re.compile('[=+*

)]|[a-z]+|[A-Z]+[0-9]+|[0-9]*\.[0-9]+|[0-9]+|[^ \t\r\n])
s = '=+GC142*(GC94+0.5*sum(GC96:GC101))'
R.findall(s)

The last choice of the regex is to catch any tokens not defined
(except for white space).

Peter Kleiweg · Sep 9, 2004

Peter Kleiweg schreef:

TAG schreef:

if I have :

=+GC142*(GC94+0.5*sum(GC96:GC101))

and I want to get :

['=', '+', 'GC142', '*', '(', 'GC94', '+', '0.5', '*', 'sum', '(',
'GC96', ':', 'GC101', ')', ')']

how can I get this ??????

Click to expand...

import re
R = re.compile('[=+*)]|[a-z]+|[A-Z]+[0-9]+|[0-9]*\.[0-9]+|[0-9]+|[^ \t\r\n])
s = '=+GC142*(GC94+0.5*sum(GC96:GC101))'
R.findall(s)

Even simpler:

import re
R = re.compile('[=+*

)]|[^=+*

) \t\r\n]+')
s = '=+GC142*(GC94+0.5*sum(GC96:GC101))'
R.findall(s)

Istvan Albert · Sep 9, 2004

Peter said:
R = re.compile('[=+*)]|[a-z]+|[A-Z]+[0-9]+|[0-9]*\.[0-9]+|[0-9]+|[^ \t\r\n])

Lets also mention the title of the chapter under which a
newbie can get more info on such solutions:

Regexes: Bad Idea or Big Mistake?

Istvan.

Alex Martelli · Sep 9, 2004

TAG said:
Hi,

I am new to python and would like to parse a string, well acually a
formula and get the stuff grouped together
eg:

if I have :

=+GC142*(GC94+0.5*sum(GC96:GC101))

and I want to get :

['=', '+', 'GC142', '*', '(', 'GC94', '+', '0.5', '*', 'sum', '(',
'GC96', ':', 'GC101', ')', ')']

import tokenize
import cStringIO
x='=+GC142*(GC94+0.5*sum(GC96:GC101))'
[t[1] for t in

Click to expand...

Click to expand...

tokenize.generate_tokens(cStringIO.StringIO(x).readline)]
['=', '+', 'GC142', '*', '(', 'GC94', '+', '0.5', '*', 'sum', '(',
'GC96', ':', 'GC101', ')', ')', '']
close enough for you...?

Alex

TAG · Sep 10, 2004

WOW - I never thought tokenize was that simple

THANKS

import tokenize
import cStringIO
x='=+GC142*(GC94+0.5*sum(GC96:GC101))'
[t[1] for t in

Click to expand...

Click to expand...

tokenize.generate_tokens(cStringIO.StringIO(x).readline)]
['=', '+', 'GC142', '*', '(', 'GC94', '+', '0.5', '*', 'sum', '(',
'GC96', ':', 'GC101', ')', ')', '']

close enough for you...?

erm YES

T

Alex Martelli · Sep 10, 2004

TAG said:
WOW - I never thought tokenize was that simple

It didn't use to be all that simple when it was callback-based, but
since the generate_tokens function was put into it I think it's become
so. You do need a list comprehension or something over the iterator
which generate_tokens return, and to wrap a readling function around the
string you're tokenizing to pass it (must return the string the first
time, and '' the second time it's called), but that's tolerable IMHO.

((Of course, you ARE restricted to what Python considers 'tokens' so you
may need some postprocessing if you need a slightly different notion of
tokens))

The new iterator protocol has allowed interface simplifications such as
this one and the equally empowering os.walk (iterator based) vs
os.path.walk (callbacl based), which I think is quite a good sign said
protocol is good!-)

Alex

TAG · Sep 10, 2004

((Of course, you ARE restricted to what Python considers 'tokens' so you

may need some postprocessing if you need a slightly different notion of
tokens))

luckily they should all be - but in the case that they are not - how
can I checki it ?

thanks again

Alex Martelli · Sep 10, 2004

TAG said:
luckily they should all be - but in the case that they are not - how
can I checki it ?

With a little post-processing. Say for example that you need := and :+
to be seen as single tokens; here's a Python 2.4 approach...:

mergers = {':' : set('=+'), }

def tokens_of(x):
it = peekahead_iterator(toktuple[1] for toktuple in
tokenize.generate_tokens(cStringIO.StringIO(x).readline)
)
for tok in it:
if it.preview in mergers.get(tok, ()):
yield tok+it.preview
it.next()
else:
yield tok

x = 'fup(z:=97, y:+45):zap'
print list(tokens_of(x))

result is:

['fup', '(', 'z', ':=', '97', ',', 'y', ':+', '45', ')', ':', 'zap', '']

Of course, you do need the handy 'peekahead_iterator', say something
like:

class peekahead_iterator(object):
class nothing: pass
def __init__(self, it):
self._nit = iter(it).next
self.preview = None
self._step()
def __iter__(self): return self
def next(self):
result = self._step()
if result == self.nothing: raise StopIteration
else: return result
def _step(self):
result = self.preview
try: self.preview = self._nit()
except StopIteration: self.preview = self.nothing
return result

Splitting one token into several is easier (no peeking ahead is needed).
But both splitting and merging are fine, as long as the deviations
between what you want to see as tokens and what Python considers tokens
are minor. If you have BIG divergences -- e.g., you do not want to
support triple-quoted strings as single tokens -- then you may be better
off with a completely different approach, as others have suggested.

Alex

C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
Dynamic programming	3	Jan 9, 2023
Rearranging .ply file via C++ String Parsing	0	Dec 14, 2019
Flag control variable	32	Feb 11, 2014
Traceback (most recent call last): File "<string>", line 23, in <module>TypeError: '>' not supported between instances of 'complex' and 'in	1	Dec 1, 2023
Noob question about mathematical addition vs. "string addition" in C#	1	Mar 6, 2022
Simple Program	0	Sep 27, 2022
Web Page Parsing/Downloading	1	Nov 22, 2013

simple string parsing ?

TAG

Peter Kleiweg

Peter Kleiweg

Istvan Albert

Alex Martelli

TAG

Alex Martelli

TAG

Alex Martelli

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads