Looking for very simple general purpose tokenizer

Discussion in 'Python' started by Maarten van Reeuwijk, Jan 19, 2004.

  1. Hi group,

    I need to parse various text files in python. I was wondering if there was a
    general purpose tokenizer available. I know about split(), but this
    (otherwise very handy method does not allow me to specify a list of
    splitting characters, only one at the time and it removes my splitting
    operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
    tokenize but this specifically for Python and is way too heavy for me. I am
    looking for something like this:


    splitchars = [' ', '\n', '=', '/', ....]
    tokenlist = tokenize(rawfile, splitchars)

    Is there something like this available inside Python or did anyone already
    make this? Thank you in advance

    Maarten
    --
    ===================================================================
    Maarten van Reeuwijk Heat and Fluid Sciences
    Phd student dept. of Multiscale Physics
    www.ws.tn.tudelft.nl Delft University of Technology
     
    Maarten van Reeuwijk, Jan 19, 2004
    #1
    1. Advertising

  2. Maarten van Reeuwijk

    Eric Brunel Guest

    Maarten van Reeuwijk wrote:
    > Hi group,
    >
    > I need to parse various text files in python. I was wondering if there was a
    > general purpose tokenizer available. I know about split(), but this
    > (otherwise very handy method does not allow me to specify a list of
    > splitting characters, only one at the time and it removes my splitting
    > operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
    > tokenize but this specifically for Python and is way too heavy for me. I am
    > looking for something like this:
    >
    >
    > splitchars = [' ', '\n', '=', '/', ....]
    > tokenlist = tokenize(rawfile, splitchars)
    >
    > Is there something like this available inside Python or did anyone already
    > make this? Thank you in advance


    You may use re.findall for that:

    >>> import re
    >>> s = "a = b+c; z = 34;"
    >>> pat = " |=|;|[^ =;]*"
    >>> re.findall(pat, s)

    ['a', ' ', '=', ' ', 'b+c', ';', ' ', 'z', ' ', '=', ' ', '34', ';', '']

    The pattern basically says: match either a space, a '=', a ';', or a sequence of
    any characters that are not space, '=' or ';'. You may have to take care
    beforehands about special characters like \n or \ (very special in regular
    expressions)

    HTH
    --
    - Eric Brunel <eric dot brunel at pragmadev dot com> -
    PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com
     
    Eric Brunel, Jan 19, 2004
    #2
    1. Advertising

  3. Maarten van Reeuwijk

    Paul McGuire Guest

    "Maarten van Reeuwijk" <maarten@remove_this_ws.tn.tudelft.nl> wrote in
    message news:bug9ij$30k$...
    > Hi group,
    >
    > I need to parse various text files in python. I was wondering if there was

    a
    > general purpose tokenizer available. I know about split(), but this
    > (otherwise very handy method does not allow me to specify a list of
    > splitting characters, only one at the time and it removes my splitting
    > operators (OK for spaces and \n's but not for =, / etc. Furthermore I

    tried
    > tokenize but this specifically for Python and is way too heavy for me. I

    am
    > looking for something like this:
    >
    >
    > splitchars = [' ', '\n', '=', '/', ....]
    > tokenlist = tokenize(rawfile, splitchars)
    >
    > Is there something like this available inside Python or did anyone already
    > make this? Thank you in advance
    >
    > Maarten
    > --
    > ===================================================================
    > Maarten van Reeuwijk Heat and Fluid Sciences
    > Phd student dept. of Multiscale Physics
    > www.ws.tn.tudelft.nl Delft University of Technology

    Maarten -
    Please give my pyparsing module a try. You can download it from SourceForge
    at http://pyparsing.sourceforge.net. I wrote it for just this purpose, it
    allows you to define your own parsing patterns for any text data file, and
    the tokenized results are returned in a dictionary or list, as you prefer.
    The download includes several examples also - one especially difficult file
    parsing solution is shown in the dictExample.py script. And if you get
    stuck, send me a sample of what you are trying to parse, and I can try to
    give you some pointers (or even tell you if pyparsing isn't necessarily the
    most appropriate tool for your job - it happens sometimes!).

    -- Paul McGuire

    Austin, Texas, USA
     
    Paul McGuire, Jan 19, 2004
    #3
  4. Maarten van Reeuwijk

    Alan Kennedy Guest

    Maarten van Reeuwijk wrote:
    > I need to parse various text files in python. I was wondering if
    > there was a general purpose tokenizer available.


    Indeed there is: python comes with batteries included. Try the shlex
    module.

    http://www.python.org/doc/lib/module-shlex.html

    Try the following code: it seems to do what you want. If it doesn't,
    then please be more specific on your tokenisation rules.

    #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    splitchars = [' ', '\n', '=', '/',]

    source = """
    thisshouldcome inthree parts
    thisshould comeintwo
    andso/shouldthis
    and=this
    """

    import shlex
    import StringIO

    def prepareToker(toker, splitters):
    for s in splitters: # resists People's Front of Judea joke ;-D
    if toker.whitespace.find(s) == -1:
    toker.whitespace = "%s%s" % (s, toker.whitespace)
    return toker

    buf = StringIO.StringIO(source)
    toker = shlex.shlex(buf)
    toker = prepareToker(toker, splitchars)
    for num, tok in enumerate(toker):
    print "%s:%s" % (num, tok)
    #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

    Note that the use of the iteration based interface in the above code
    requires python 2.3. If you need it to run on previous versions,
    specify which one.

    regards,

    --
    alan kennedy
    ------------------------------------------------------
    check http headers here: http://xhaus.com/headers
    email alan: http://xhaus.com/contact/alan
     
    Alan Kennedy, Jan 19, 2004
    #4
  5. Thank you all for your very useful comments. Below I have included my
    source. Could you comment if there's a more elegant way of implementing the
    continuation character &?

    With the RE implementation I have noticed that the position of the '*' in
    spclist is very delicate. This order works, but other orders throw
    exceptions. Is this correct or is it a bug? Lastly, is there more
    documentation and examples for the shlex module? Ideally I would like to
    see a full scale example of how this module should be used to parse.

    Maarten

    import re
    import shlex
    import StringIO

    def splitf90(source):
    buf = StringIO.StringIO(source)
    toker = shlex.shlex(buf)
    toker.commenters = "!"
    toker.whitespace = " \t\r"
    return processTokens(toker)

    def splitf90_re(source):
    spclist = ['\*', '\+', '-', '/', '=','\[', '\]', '\(', '\)' \
    '>', '<', '&', ';', ',', ':', '!', ' ', '\n']
    pat = '|'.join(spclist) + '|[^' + ''.join(spclist) + ']+'
    rawtokens = re.findall(pat, source)
    return processTokens(rawtokens)

    def processTokens(rawtokens):
    # substitute characters
    subst1 = []
    prevtoken = None
    for token in rawtokens:
    if token == ';': token = '\n'
    if token == ' ': token = ''
    if token == '\n' and prevtoken == '&': token = ''
    if not token == '':
    subst1.append(token)
    prevtoken = token

    # remove continuation chars
    subst2 = []
    for token in subst1:
    if token == '&': token = ''
    if not token == '':
    subst2.append(token)

    # split into lines
    final = []
    curline = []
    for token in subst2:
    if not token == '\n':
    curline.append(token)
    else:
    if not curline == []:
    final.append(curline)
    curline = []

    return final

    # Example session
    src = """
    MODULE modsize
    implicit none

    integer, parameter:: &
    Nx = 256, &
    Ny = 256, &
    Nz = 256, &
    nt = 1, & ! nr of (passive) scalars
    Np = 16 ! nr of processors, should match mpirun -np .. command

    END MODULE
    """
    print splitf90(src)
    print splitf90_re(src)

    Output:
    [['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
    ':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
    ',', 'nt', '=', '1', ',', 'Np', '=', '16'], ['END', 'MODULE']]

    [['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
    ':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
    ',', 'nt', '=', '1', ',', '!', 'nr', 'of', '(', 'passive', 'scalars'],
    ['Np', '=', '16', '!', 'nr', 'of', 'processors', ',', 'should', 'match',
    'mpirun', '-', 'np', 'command'], ['END', 'MODULE']]

    --
    ===================================================================
    Maarten van Reeuwijk Heat and Fluid Sciences
    Phd student dept. of Multiscale Physics
    www.ws.tn.tudelft.nl Delft University of Technology
     
    Maarten van Reeuwijk, Jan 20, 2004
    #5
  6. I found a complication with the shlex module. When I execute the following
    fragment you'll notice that doubles are split. Is there any way to avoid
    numbers this?


    source = """
    $NAMRUN
    Lz = 0.15
    nu = 1.08E-6
    """

    import shlex
    import StringIO

    buf = StringIO.StringIO(source)
    toker = shlex.shlex(buf)
    toker.comments = ""
    toker.whitespace = " \t\r"
    print [tok for tok in toker]

    Output:
    ['\n', '$', 'NAMRUN', '\n', 'Lz', '=', '0', '.', '15', '\n', 'nu', '=', '1',
    '.', '08E', '-', '6', '\n']


    --
    ===================================================================
    Maarten van Reeuwijk Heat and Fluid Sciences
    Phd student dept. of Multiscale Physics
    www.ws.tn.tudelft.nl Delft University of Technology
     
    Maarten van Reeuwijk, Jan 20, 2004
    #6
  7. Maarten van Reeuwijk

    JanC Guest

    Maarten van Reeuwijk <maarten@remove_this_ws.tn.tudelft.nl> schreef:

    > I found a complication with the shlex module. When I execute the
    > following fragment you'll notice that doubles are split. Is there any way
    > to avoid numbers this?


    From the docs at <http://www.python.org/doc/current/lib/shlex-objects.html>

    wordchars
    The string of characters that will accumulate into multi-character
    tokens. By default, includes all ASCII alphanumerics and underscore.

    > source = """
    > $NAMRUN
    > Lz = 0.15
    > nu = 1.08E-6
    > """
    >
    > import shlex
    > import StringIO
    >
    > buf = StringIO.StringIO(source)
    > toker = shlex.shlex(buf)
    > toker.comments = ""
    > toker.whitespace = " \t\r"


    toker.wordchars = toker.wordchars + ".-$" # etc.

    > print [tok for tok in toker]



    Output:

    ['\n', '$NAMRUN', '\n', 'Lz', '=', '0.15', '\n', 'nu', '=', '1.08E-6', '\n']

    Is this what you want?

    --
    JanC

    "Be strict when sending and tolerant when receiving."
    RFC 1958 - Architectural Principles of the Internet - section 3.9
     
    JanC, Jan 21, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Bj?rn
    Replies:
    0
    Views:
    433
    Robert Bj?rn
    Jun 1, 2004
  2. Ron Stephens
    Replies:
    23
    Views:
    3,025
    Ron Stephens
    Apr 12, 2004
  3. Donald Newcomb

    NEWB: General purpose list iteration?

    Donald Newcomb, Aug 12, 2005, in forum: Python
    Replies:
    4
    Views:
    330
    Donald Newcomb
    Aug 12, 2005
  4. John Machin
    Replies:
    6
    Views:
    407
    metaperl
    Aug 16, 2006
  5. olivier.melcher

    Help running a very very very simple code

    olivier.melcher, May 12, 2008, in forum: Java
    Replies:
    8
    Views:
    2,377
Loading...

Share This Page