Looking for very simple general purpose tokenizer

Maarten van Reeuwijk · Jan 19, 2004

Hi group,

I need to parse various text files in python. I was wondering if there was a
general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
tokenize but this specifically for Python and is way too heavy for me. I am
looking for something like this:

splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance

Maarten

Eric Brunel · Jan 19, 2004

Maarten said:
Hi group,

I need to parse various text files in python. I was wondering if there was a
general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
tokenize but this specifically for Python and is way too heavy for me. I am
looking for something like this:

splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance

You may use re.findall for that:

>>> import re
>>> s = "a = b+c; z = 34;"
>>> pat = " |=|;|[^ =;]*"
>>> re.findall(pat, s)

Click to expand...

Click to expand...

['a', ' ', '=', ' ', 'b+c', ';', ' ', 'z', ' ', '=', ' ', '34', ';', '']

The pattern basically says: match either a space, a '=', a ';', or a sequence of
any characters that are not space, '=' or ';'. You may have to take care
beforehands about special characters like \n or \ (very special in regular
expressions)

HTH

Paul McGuire · Jan 19, 2004

Maarten van Reeuwijk said:
Hi group,

I need to parse various text files in python. I was wondering if there was a
general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
tokenize but this specifically for Python and is way too heavy for me. I am
looking for something like this:

splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance

Maarten
--
===================================================================
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology

Maarten -
Please give my pyparsing module a try. You can download it from SourceForge
at http://pyparsing.sourceforge.net. I wrote it for just this purpose, it
allows you to define your own parsing patterns for any text data file, and
the tokenized results are returned in a dictionary or list, as you prefer.
The download includes several examples also - one especially difficult file
parsing solution is shown in the dictExample.py script. And if you get
stuck, send me a sample of what you are trying to parse, and I can try to
give you some pointers (or even tell you if pyparsing isn't necessarily the
most appropriate tool for your job - it happens sometimes!).

-- Paul McGuire

Austin, Texas, USA

Alan Kennedy · Jan 19, 2004

Maarten said:
I need to parse various text files in python. I was wondering if
there was a general purpose tokenizer available.

Indeed there is: python comes with batteries included. Try the shlex
module.

http://www.python.org/doc/lib/module-shlex.html

Try the following code: it seems to do what you want. If it doesn't,
then please be more specific on your tokenisation rules.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
splitchars = [' ', '\n', '=', '/',]

source = """
thisshouldcome inthree parts
thisshould comeintwo
andso/shouldthis
and=this
"""

import shlex
import StringIO

def prepareToker(toker, splitters):
for s in splitters: # resists People's Front of Judea joke ;-D
if toker.whitespace.find(s) == -1:
toker.whitespace = "%s%s" % (s, toker.whitespace)
return toker

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker = prepareToker(toker, splitchars)
for num, tok in enumerate(toker):
print "%s:%s" % (num, tok)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Note that the use of the iteration based interface in the above code
requires python 2.3. If you need it to run on previous versions,
specify which one.

regards,

Maarten van Reeuwijk · Jan 20, 2004

Thank you all for your very useful comments. Below I have included my
source. Could you comment if there's a more elegant way of implementing the
continuation character &?

With the RE implementation I have noticed that the position of the '*' in
spclist is very delicate. This order works, but other orders throw
exceptions. Is this correct or is it a bug? Lastly, is there more
documentation and examples for the shlex module? Ideally I would like to
see a full scale example of how this module should be used to parse.

Maarten

import re
import shlex
import StringIO

def splitf90(source):
buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.commenters = "!"
toker.whitespace = " \t\r"
return processTokens(toker)

def splitf90_re(source):
spclist = ['\*', '\+', '-', '/', '=','\[', '\]', '$', '$' \
'>', '<', '&', ';', ',', ':', '!', ' ', '\n']
pat = '|'.join(spclist) + '|[^' + ''.join(spclist) + ']+'
rawtokens = re.findall(pat, source)
return processTokens(rawtokens)

def processTokens(rawtokens):
# substitute characters
subst1 = []
prevtoken = None
for token in rawtokens:
if token == ';': token = '\n'
if token == ' ': token = ''
if token == '\n' and prevtoken == '&': token = ''
if not token == '':
subst1.append(token)
prevtoken = token

# remove continuation chars
subst2 = []
for token in subst1:
if token == '&': token = ''
if not token == '':
subst2.append(token)

# split into lines
final = []
curline = []
for token in subst2:
if not token == '\n':
curline.append(token)
else:
if not curline == []:
final.append(curline)
curline = []

return final

# Example session
src = """
MODULE modsize
implicit none

integer, parameter:: &
Nx = 256, &
Ny = 256, &
Nz = 256, &
nt = 1, & ! nr of (passive) scalars
Np = 16 ! nr of processors, should match mpirun -np .. command

END MODULE
"""
print splitf90(src)
print splitf90_re(src)

Output:
[['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
',', 'nt', '=', '1', ',', 'Np', '=', '16'], ['END', 'MODULE']]

[['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
',', 'nt', '=', '1', ',', '!', 'nr', 'of', '(', 'passive', 'scalars'],
['Np', '=', '16', '!', 'nr', 'of', 'processors', ',', 'should', 'match',
'mpirun', '-', 'np', 'command'], ['END', 'MODULE']]

Maarten van Reeuwijk · Jan 20, 2004

I found a complication with the shlex module. When I execute the following
fragment you'll notice that doubles are split. Is there any way to avoid
numbers this?

source = """
$NAMRUN
Lz = 0.15
nu = 1.08E-6
"""

import shlex
import StringIO

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.comments = ""
toker.whitespace = " \t\r"
print [tok for tok in toker]

Output:
['\n', '$', 'NAMRUN', '\n', 'Lz', '=', '0', '.', '15', '\n', 'nu', '=', '1',
'.', '08E', '-', '6', '\n']

JanC · Jan 21, 2004

Maarten van Reeuwijk said:
I found a complication with the shlex module. When I execute the
following fragment you'll notice that doubles are split. Is there any way
to avoid numbers this?

From the docs at <http://www.python.org/doc/current/lib/shlex-objects.html>

wordchars
The string of characters that will accumulate into multi-character
tokens. By default, includes all ASCII alphanumerics and underscore.

source = """
$NAMRUN
Lz = 0.15
nu = 1.08E-6
"""

import shlex
import StringIO

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.comments = ""
toker.whitespace = " \t\r"

toker.wordchars = toker.wordchars + ".-$" # etc.

print [tok for tok in toker]

Output:

['\n', '$NAMRUN', '\n', 'Lz', '=', '0.15', '\n', 'nu', '=', '1.08E-6', '\n']

Is this what you want?

Looking For Advice	1	Dec 10, 2022
Looking to change programming direction	1	Aug 10, 2022
General purpose TCP proxy?	8	Sep 27, 2010
Basic tokenizer	5	Sep 1, 2004
Very simple parser... not for me	36	Jan 27, 2014
Looking for very complicated gettext PO file(s) for testing	0	Jun 11, 2010
Creating a very simple revision system for photos in python	4	Mar 11, 2011
Two More Very General Consulting Question	26	Nov 30, 2011

Looking for very simple general purpose tokenizer

Maarten van Reeuwijk

Eric Brunel

Paul McGuire

Alan Kennedy

Maarten van Reeuwijk

Maarten van Reeuwijk

JanC

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads