Looking for very simple general purpose tokenizer

  • Thread starter Maarten van Reeuwijk
  • Start date
M

Maarten van Reeuwijk

Hi group,

I need to parse various text files in python. I was wondering if there was a
general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
tokenize but this specifically for Python and is way too heavy for me. I am
looking for something like this:


splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance

Maarten
 
E

Eric Brunel

Maarten said:
Hi group,

I need to parse various text files in python. I was wondering if there was a
general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
tokenize but this specifically for Python and is way too heavy for me. I am
looking for something like this:


splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance

You may use re.findall for that:
>>> import re
>>> s = "a = b+c; z = 34;"
>>> pat = " |=|;|[^ =;]*"
>>> re.findall(pat, s)
['a', ' ', '=', ' ', 'b+c', ';', ' ', 'z', ' ', '=', ' ', '34', ';', '']

The pattern basically says: match either a space, a '=', a ';', or a sequence of
any characters that are not space, '=' or ';'. You may have to take care
beforehands about special characters like \n or \ (very special in regular
expressions)

HTH
 
P

Paul McGuire

Maarten van Reeuwijk said:
Hi group,

I need to parse various text files in python. I was wondering if there was a
general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
tokenize but this specifically for Python and is way too heavy for me. I am
looking for something like this:


splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance

Maarten
--
===================================================================
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology
Maarten -
Please give my pyparsing module a try. You can download it from SourceForge
at http://pyparsing.sourceforge.net. I wrote it for just this purpose, it
allows you to define your own parsing patterns for any text data file, and
the tokenized results are returned in a dictionary or list, as you prefer.
The download includes several examples also - one especially difficult file
parsing solution is shown in the dictExample.py script. And if you get
stuck, send me a sample of what you are trying to parse, and I can try to
give you some pointers (or even tell you if pyparsing isn't necessarily the
most appropriate tool for your job - it happens sometimes!).

-- Paul McGuire

Austin, Texas, USA
 
A

Alan Kennedy

Maarten said:
I need to parse various text files in python. I was wondering if
there was a general purpose tokenizer available.

Indeed there is: python comes with batteries included. Try the shlex
module.

http://www.python.org/doc/lib/module-shlex.html

Try the following code: it seems to do what you want. If it doesn't,
then please be more specific on your tokenisation rules.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
splitchars = [' ', '\n', '=', '/',]

source = """
thisshouldcome inthree parts
thisshould comeintwo
andso/shouldthis
and=this
"""

import shlex
import StringIO

def prepareToker(toker, splitters):
for s in splitters: # resists People's Front of Judea joke ;-D
if toker.whitespace.find(s) == -1:
toker.whitespace = "%s%s" % (s, toker.whitespace)
return toker

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker = prepareToker(toker, splitchars)
for num, tok in enumerate(toker):
print "%s:%s" % (num, tok)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Note that the use of the iteration based interface in the above code
requires python 2.3. If you need it to run on previous versions,
specify which one.

regards,
 
M

Maarten van Reeuwijk

Thank you all for your very useful comments. Below I have included my
source. Could you comment if there's a more elegant way of implementing the
continuation character &?

With the RE implementation I have noticed that the position of the '*' in
spclist is very delicate. This order works, but other orders throw
exceptions. Is this correct or is it a bug? Lastly, is there more
documentation and examples for the shlex module? Ideally I would like to
see a full scale example of how this module should be used to parse.

Maarten

import re
import shlex
import StringIO

def splitf90(source):
buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.commenters = "!"
toker.whitespace = " \t\r"
return processTokens(toker)

def splitf90_re(source):
spclist = ['\*', '\+', '-', '/', '=','\[', '\]', '\(', '\)' \
'>', '<', '&', ';', ',', ':', '!', ' ', '\n']
pat = '|'.join(spclist) + '|[^' + ''.join(spclist) + ']+'
rawtokens = re.findall(pat, source)
return processTokens(rawtokens)

def processTokens(rawtokens):
# substitute characters
subst1 = []
prevtoken = None
for token in rawtokens:
if token == ';': token = '\n'
if token == ' ': token = ''
if token == '\n' and prevtoken == '&': token = ''
if not token == '':
subst1.append(token)
prevtoken = token

# remove continuation chars
subst2 = []
for token in subst1:
if token == '&': token = ''
if not token == '':
subst2.append(token)

# split into lines
final = []
curline = []
for token in subst2:
if not token == '\n':
curline.append(token)
else:
if not curline == []:
final.append(curline)
curline = []

return final

# Example session
src = """
MODULE modsize
implicit none

integer, parameter:: &
Nx = 256, &
Ny = 256, &
Nz = 256, &
nt = 1, & ! nr of (passive) scalars
Np = 16 ! nr of processors, should match mpirun -np .. command

END MODULE
"""
print splitf90(src)
print splitf90_re(src)

Output:
[['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
',', 'nt', '=', '1', ',', 'Np', '=', '16'], ['END', 'MODULE']]

[['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
',', 'nt', '=', '1', ',', '!', 'nr', 'of', '(', 'passive', 'scalars'],
['Np', '=', '16', '!', 'nr', 'of', 'processors', ',', 'should', 'match',
'mpirun', '-', 'np', 'command'], ['END', 'MODULE']]
 
M

Maarten van Reeuwijk

I found a complication with the shlex module. When I execute the following
fragment you'll notice that doubles are split. Is there any way to avoid
numbers this?


source = """
$NAMRUN
Lz = 0.15
nu = 1.08E-6
"""

import shlex
import StringIO

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.comments = ""
toker.whitespace = " \t\r"
print [tok for tok in toker]

Output:
['\n', '$', 'NAMRUN', '\n', 'Lz', '=', '0', '.', '15', '\n', 'nu', '=', '1',
'.', '08E', '-', '6', '\n']
 
J

JanC

Maarten van Reeuwijk said:
I found a complication with the shlex module. When I execute the
following fragment you'll notice that doubles are split. Is there any way
to avoid numbers this?

From the docs at <http://www.python.org/doc/current/lib/shlex-objects.html>

wordchars
The string of characters that will accumulate into multi-character
tokens. By default, includes all ASCII alphanumerics and underscore.
source = """
$NAMRUN
Lz = 0.15
nu = 1.08E-6
"""

import shlex
import StringIO

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.comments = ""
toker.whitespace = " \t\r"

toker.wordchars = toker.wordchars + ".-$" # etc.
print [tok for tok in toker]


Output:

['\n', '$NAMRUN', '\n', 'Lz', '=', '0.15', '\n', 'nu', '=', '1.08E-6', '\n']

Is this what you want?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top