simple string parsing ?

T

TAG

Hi,

I am new to python and would like to parse a string, well acually a
formula and get the stuff grouped together
eg:

if I have :

=+GC142*(GC94+0.5*sum(GC96:GC101))

and I want to get :

['=', '+', 'GC142', '*', '(', 'GC94', '+', '0.5', '*', 'sum', '(',
'GC96', ':', 'GC101', ')', ')']

how can I get this ??????

Thanks :)
 
P

Peter Kleiweg

TAG schreef:
if I have :

=+GC142*(GC94+0.5*sum(GC96:GC101))

and I want to get :

['=', '+', 'GC142', '*', '(', 'GC94', '+', '0.5', '*', 'sum', '(',
'GC96', ':', 'GC101', ')', ')']

how can I get this ??????

import re
R = re.compile('[=+*:()]|[a-z]+|[A-Z]+[0-9]+|[0-9]*\.[0-9]+|[0-9]+|[^ \t\r\n])
s = '=+GC142*(GC94+0.5*sum(GC96:GC101))'
R.findall(s)

The last choice of the regex is to catch any tokens not defined
(except for white space).
 
P

Peter Kleiweg

Peter Kleiweg schreef:
TAG schreef:
if I have :

=+GC142*(GC94+0.5*sum(GC96:GC101))

and I want to get :

['=', '+', 'GC142', '*', '(', 'GC94', '+', '0.5', '*', 'sum', '(',
'GC96', ':', 'GC101', ')', ')']

how can I get this ??????

import re
R = re.compile('[=+*:()]|[a-z]+|[A-Z]+[0-9]+|[0-9]*\.[0-9]+|[0-9]+|[^ \t\r\n])
s = '=+GC142*(GC94+0.5*sum(GC96:GC101))'
R.findall(s)

Even simpler:

import re
R = re.compile('[=+*:()]|[^=+*:() \t\r\n]+')
s = '=+GC142*(GC94+0.5*sum(GC96:GC101))'
R.findall(s)
 
I

Istvan Albert

Peter said:
R = re.compile('[=+*:()]|[a-z]+|[A-Z]+[0-9]+|[0-9]*\.[0-9]+|[0-9]+|[^ \t\r\n])

Lets also mention the title of the chapter under which a
newbie can get more info on such solutions:

Regexes: Bad Idea or Big Mistake?

Istvan.
 
A

Alex Martelli

TAG said:
Hi,

I am new to python and would like to parse a string, well acually a
formula and get the stuff grouped together
eg:

if I have :

=+GC142*(GC94+0.5*sum(GC96:GC101))

and I want to get :

['=', '+', 'GC142', '*', '(', 'GC94', '+', '0.5', '*', 'sum', '(',
'GC96', ':', 'GC101', ')', ')']
import tokenize
import cStringIO
x='=+GC142*(GC94+0.5*sum(GC96:GC101))'
[t[1] for t in
tokenize.generate_tokens(cStringIO.StringIO(x).readline)]
['=', '+', 'GC142', '*', '(', 'GC94', '+', '0.5', '*', 'sum', '(',
'GC96', ':', 'GC101', ')', ')', '']
close enough for you...?


Alex
 
A

Alex Martelli

TAG said:
WOW - I never thought tokenize was that simple :)

It didn't use to be all that simple when it was callback-based, but
since the generate_tokens function was put into it I think it's become
so. You do need a list comprehension or something over the iterator
which generate_tokens return, and to wrap a readling function around the
string you're tokenizing to pass it (must return the string the first
time, and '' the second time it's called), but that's tolerable IMHO.

((Of course, you ARE restricted to what Python considers 'tokens' so you
may need some postprocessing if you need a slightly different notion of
tokens))

The new iterator protocol has allowed interface simplifications such as
this one and the equally empowering os.walk (iterator based) vs
os.path.walk (callbacl based), which I think is quite a good sign said
protocol is good!-)


Alex
 
T

TAG

((Of course, you ARE restricted to what Python considers 'tokens' so you
may need some postprocessing if you need a slightly different notion of
tokens))

luckily they should all be - but in the case that they are not - how
can I checki it ?


thanks again :)
 
A

Alex Martelli

TAG said:
luckily they should all be - but in the case that they are not - how
can I checki it ?

With a little post-processing. Say for example that you need := and :+
to be seen as single tokens; here's a Python 2.4 approach...:

mergers = {':' : set('=+'), }

def tokens_of(x):
it = peekahead_iterator(toktuple[1] for toktuple in
tokenize.generate_tokens(cStringIO.StringIO(x).readline)
)
for tok in it:
if it.preview in mergers.get(tok, ()):
yield tok+it.preview
it.next()
else:
yield tok

x = 'fup(z:=97, y:+45):zap'
print list(tokens_of(x))

result is:

['fup', '(', 'z', ':=', '97', ',', 'y', ':+', '45', ')', ':', 'zap', '']


Of course, you do need the handy 'peekahead_iterator', say something
like:

class peekahead_iterator(object):
class nothing: pass
def __init__(self, it):
self._nit = iter(it).next
self.preview = None
self._step()
def __iter__(self): return self
def next(self):
result = self._step()
if result == self.nothing: raise StopIteration
else: return result
def _step(self):
result = self.preview
try: self.preview = self._nit()
except StopIteration: self.preview = self.nothing
return result


Splitting one token into several is easier (no peeking ahead is needed).
But both splitting and merging are fine, as long as the deviations
between what you want to see as tokens and what Python considers tokens
are minor. If you have BIG divergences -- e.g., you do not want to
support triple-quoted strings as single tokens -- then you may be better
off with a completely different approach, as others have suggested.


Alex
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top