multi split function taking delimiter list

Discussion in 'Python' started by martinskou@gmail.com, Nov 14, 2006.

  1. Guest

    Hi, I'm looking for something like:

    multi_split( 'a:=b+c' , [':=','+'] )

    returning:
    ['a', ':=', 'b', '+', 'c']

    whats the python way to achieve this, preferably without regexp?

    Thanks.

    Martin
     
    , Nov 14, 2006
    #1
    1. Advertising

  2. wrote:
    > Hi, I'm looking for something like:
    >
    > multi_split( 'a:=b+c' , [':=','+'] )
    >
    > returning:
    > ['a', ':=', 'b', '+', 'c']
    >
    > whats the python way to achieve this, preferably without regexp?


    I think regexps are likely the right way to do this kind of
    tokenization.

    The string split() method doesn't return the split value so that is
    less than helpful for your application: 'a=b'.split() --> ['a',
    'b']

    The new str.partition() method will return the split value and is
    suitable for successive applications: 'a:=b+c'.partition(':=') -->
    ('a', ':=', 'b+c')

    FWIW, when someone actually does want something that behaves like
    str.split() but with multiple split values, one approach is to replace
    each of the possible splitters with a single splitter:

    def multi_split(s, splitters):
    first = splitters[0]
    for splitter in splitters:
    s = s.replace(splitter, first)
    return s.split(first)

    print multi_split( 'a:=b+c' , [':=','+'] )


    Raymond
     
    Raymond Hettinger, Nov 14, 2006
    #2
    1. Advertising

  3. Peter Otten Guest

    wrote:

    > Hi, I'm looking for something like:
    >
    > multi_split( 'a:=b+c' , [':=','+'] )
    >
    > returning:
    > ['a', ':=', 'b', '+', 'c']
    >
    > whats the python way to achieve this, preferably without regexp?


    I think in this case the regexp approach is the simplest, though:

    >>> def multi_split(text, splitters):

    .... return re.split("(%s)" % "|".join(re.escape(splitter) for splitter
    in splitters), text)
    ....
    >>> multi_split("a:=b+c", [":=", "+"])

    ['a', ':=', 'b', '+', 'c']

    Peter
     
    Peter Otten, Nov 14, 2006
    #3
  4. Kent Johnson Guest

    wrote:
    > Hi, I'm looking for something like:
    >
    > multi_split( 'a:=b+c' , [':=','+'] )
    >
    > returning:
    > ['a', ':=', 'b', '+', 'c']
    >
    > whats the python way to achieve this, preferably without regexp?


    What do you have against regexp? re.split() does exactly what you want:

    In [1]: import re

    In [2]: re.split(r':)=|\+)', 'a:=b+c')
    Out[2]: ['a', ':=', 'b', '+', 'c']

    Kent
     
    Kent Johnson, Nov 14, 2006
    #4
  5. Paddy Guest

    wrote:

    > Hi, I'm looking for something like:
    >
    > multi_split( 'a:=b+c' , [':=','+'] )
    >
    > returning:
    > ['a', ':=', 'b', '+', 'c']
    >
    > whats the python way to achieve this, preferably without regexp?
    >
    > Thanks.
    >
    > Martin


    I resisted my urge to use a regexp and came up with this:

    >>> from itertools import groupby
    >>> s = 'apple=blue+cart'
    >>> [''.join(g) for k,g in groupby(s, lambda x: x in '=+')]

    ['apple', '=', 'blue', '+', 'cart']
    >>>


    For me, the regexp solution would have been clearer, but I need to
    stretch my itertools skills.

    - Paddy.
     
    Paddy, Nov 14, 2006
    #5
  6. Sam Pointon Guest

    On Nov 14, 7:56 pm, "" <>
    wrote:
    > Hi, I'm looking for something like:
    >
    > multi_split( 'a:=b+c' , [':=','+'] )
    >
    > returning:
    > ['a', ':=', 'b', '+', 'c']
    >
    > whats the python way to achieve this, preferably without regexp?


    pyparsing <http://pyparsing.wikispaces.com/> is quite a cool package
    for doing this sort of thing. Using your example:

    #untested
    from pyparsing import *

    splitat = Or(":=", "+")
    lexeme = Word(alphas)
    grammar = splitat | lexeme

    grammar.parseString("a:=b+c")
    #returns (the equivalent of) ['a', ':=', 'b', '+', 'c'].

    --Sam
     
    Sam Pointon, Nov 14, 2006
    #6
  7. Paddy Guest

    Paddy wrote:

    > wrote:
    >
    > > Hi, I'm looking for something like:
    > >
    > > multi_split( 'a:=b+c' , [':=','+'] )
    > >
    > > returning:
    > > ['a', ':=', 'b', '+', 'c']
    > >
    > > whats the python way to achieve this, preferably without regexp?
    > >
    > > Thanks.
    > >
    > > Martin

    >
    > I resisted my urge to use a regexp and came up with this:
    >
    > >>> from itertools import groupby
    > >>> s = 'apple=blue+cart'
    > >>> [''.join(g) for k,g in groupby(s, lambda x: x in '=+')]

    > ['apple', '=', 'blue', '+', 'cart']
    > >>>

    >
    > For me, the regexp solution would have been clearer, but I need to
    > stretch my itertools skills.
    >
    > - Paddy.

    Arghhh!
    No colon!
    Forget the above please.

    - Pad.
     
    Paddy, Nov 15, 2006
    #7
  8. Paddy Guest

    Paddy wrote:

    > Paddy wrote:
    >
    > > wrote:
    > >
    > > > Hi, I'm looking for something like:
    > > >
    > > > multi_split( 'a:=b+c' , [':=','+'] )
    > > >
    > > > returning:
    > > > ['a', ':=', 'b', '+', 'c']
    > > >
    > > > whats the python way to achieve this, preferably without regexp?
    > > >
    > > > Thanks.
    > > >
    > > > Martin

    > >
    > > I resisted my urge to use a regexp and came up with this:
    > >
    > > >>> from itertools import groupby
    > > >>> s = 'apple=blue+cart'
    > > >>> [''.join(g) for k,g in groupby(s, lambda x: x in '=+')]

    > > ['apple', '=', 'blue', '+', 'cart']
    > > >>>

    > >
    > > For me, the regexp solution would have been clearer, but I need to
    > > stretch my itertools skills.
    > >
    > > - Paddy.

    > Arghhh!
    > No colon!
    > Forget the above please.
    >
    > - Pad.


    With colon:

    >>> from itertools import groupby
    >>> s = 'apple:=blue+cart'
    >>> [''.join(g) for k,g in groupby(s,lambda x: x in ':=+')]

    ['apple', ':=', 'blue', '+', 'cart']
    >>>


    - Pad.
     
    Paddy, Nov 15, 2006
    #8
  9. Paddy wrote:
    > Paddy wrote:
    >
    >> Paddy wrote:
    >>
    >>> wrote:
    >>>
    >>>> Hi, I'm looking for something like:
    >>>>
    >>>> multi_split( 'a:=b+c' , [':=','+'] )
    >>>>
    >>>> returning:
    >>>> ['a', ':=', 'b', '+', 'c']
    >>>>
    >>>> whats the python way to achieve this, preferably without regexp?
    >>>>
    >>>> Thanks.
    >>>>
    >>>> Martin
    >>> I resisted my urge to use a regexp and came up with this:
    >>>
    >>>>>> from itertools import groupby
    >>>>>> s = 'apple=blue+cart'
    >>>>>> [''.join(g) for k,g in groupby(s, lambda x: x in '=+')]
    >>> ['apple', '=', 'blue', '+', 'cart']
    >>> For me, the regexp solution would have been clearer, but I need to
    >>> stretch my itertools skills.
    >>>
    >>> - Paddy.

    >> Arghhh!
    >> No colon!
    >> Forget the above please.
    >>
    >> - Pad.

    >
    > With colon:
    >
    >>>> from itertools import groupby
    >>>> s = 'apple:=blue+cart'
    >>>> [''.join(g) for k,g in groupby(s,lambda x: x in ':=+')]

    > ['apple', ':=', 'blue', '+', 'cart']
    >
    > - Pad.
    >

    Automatic grouping may or may not work as intended. If some subsets
    should not be split, the solution raises a new problem.

    I have been demonstrating solutions based on SE with such frequency of
    late that I have begun to irritate some readers and SE in sarcastic
    exaggeration has been characterized as the 'Solution of Everything'.
    With some trepidation I am going to demonstrate another SE solution,
    because the truth of the exaggeration is that SE is a versatile tool for
    handling a variety of relatively simple problems in a simple,
    straightforward manner.

    >>> test_string = 'a:=b+c: apple:=blue:+cart''
    >>> SE.SE (':\==/:\=/ +=/+/')(test_string).split ('/') # For repeats

    the SE object would be assigned to a variable
    ['a', ':=', 'b', '+', 'c: apple', ':=', 'blue:', '+', 'cart']

    This is a nuts-and-bolts approach. What you do is what you get. What you
    want is what you do. By itself SE doesn't do anything but search and
    replace, a concept without a learning curve. The simplicity doesn't
    suggest versatility. Versatility comes from application techniques.
    SE is a game of challenge. You know the result you want. You know
    the pieces you have. The game is how to get the result with the pieces
    using search and replace, either per se or as an auxiliary, as in this
    case for splitting. That's all. The example above inserts some
    appropriate split mark ('/'). It takes thirty seconds to write it up and
    see the result. No need to ponder formulas and inner workings. If you
    don't like what you see you also see what needs to be changed. Supposing
    we should split single colons too, adding the corresponding substitution
    and verifying the effect is a matter of another ten seconds:

    >>> SE.SE (':\==/:\=/ +=/+/ :=/:/')(test_string).split ('/')

    ['a', ':=', 'b', '+', 'c', ':', ' apple', ':=', 'blue', ':', '', '+',
    'cart']

    Now we see an empty field we don't like towards the end. Why?

    >>> SE.SE (':\==/:\=/ +=/+/ :=/:/')(test_string)

    'a/:=/b/+/c/:/ apple/:=/blue/://+/cart'

    Ah! It's two slashes next to each other. No problem. We de-multiply
    double slashes in a second pass:

    >>> SE.SE (':\==/:\=/ +=/+/ :=/:/ | //=/')(test_string).split ('/')

    ['a', ':=', 'b', '+', 'c', ':', ' apple', ':=', 'blue', ':', '+', 'cart']

    On second thought the colon should not be split if a plus sign follows:

    >>> SE.SE (':\==/:\=/ +=/+/ :=/:/ :+=:/+/ | //=/')(test_string).split ('/')


    ['a', ':=', 'b', '+', 'c', ':', ' apple', ':=', 'blue:', '+', 'cart']

    No, wrong again! 'Colon-plus' should be exempt altogether. And no spaces
    please:

    >>> SE.SE (':\==/:\=/ +=/+/ :=/:/ :+=:+ " =" |

    //=/')(test_string).split ('/')
    ['a', ':=', 'b', '+', 'c', ':', 'apple', ':=', 'blue:+cart']

    etc.

    It is easy to get carried away and to forget that SE should not be used
    instead of Python's built-ins, or to get carried away doing contextual
    or grammar processing explicitly, which gets messy very fast. SE fills a
    gap somewhere between built-ins and parsers.
    Stream editing is not a mainstream technique. I believe it has the
    potential to make many simple problems trivial and many harder ones
    simpler. This is why I believe the technique deserves more attention,
    which, again, may explain the focus of my posts.

    Frederic
     
    Frederic Rentsch, Nov 16, 2006
    #9
  10. Paul McGuire Guest

    On Nov 14, 5:41 pm, "Sam Pointon" <> wrote:
    > On Nov 14, 7:56 pm, "" <>
    > wrote:
    >
    > > Hi, I'm looking for something like:

    >
    > > multi_split( 'a:=b+c' , [':=','+'] )

    >
    > > returning:
    > > ['a', ':=', 'b', '+', 'c']

    >
    > > whats the python way to achieve this, preferably without regexp?

    >
    > pyparsing <http://pyparsing.wikispaces.com/> is quite a cool package
    > for doing this sort of thing.


    Thanks for mentioning pyparsing, Sam!

    This is a good example of using pyparsing for just basic tokenizing,
    and it will do a nice job of splitting up the tokens, whether there is
    whitespace or not.

    For instance, if you were tokenizing using the string split() method,
    you would get nice results from "a := b + c", but not so good from "a:=
    b+ c". Using Sam Pointon's simple pyparsing expression, you can split
    up the arithmetic using the symbol expressions, and the whitespace is
    pretty much ignored.

    But pyparsing can be used for more than just tokenizing. Here is a
    slightly longer pyparsing example, using a new pyparsing helper method
    called operatorPrecedence, which can shortcut the definition of
    operator-separated expressions with () grouping. Note how this not
    only tokenizes the expression, but also identifies the implicit groups
    based on operator precedence. Finally, pyparsing allows you to label
    the parsed results - in this case, you can reference the LHS and RHS
    sides of your assignment statement using the attribute names "lhs" and
    "rhs". This can really be handy for complicated grammars.

    -- Paul


    from pyparsing import *

    number = Word(nums)
    variable = Word(alphas)
    operand = number | variable

    arithexpr = operatorPrecedence( operand,
    [("!", 1, opAssoc.LEFT), # factorial
    ("^", 2, opAssoc.RIGHT), # exponentiation
    (oneOf('+ -'), 1, opAssoc.RIGHT), # leading sign
    (oneOf('* /'), 2, opAssoc.LEFT), # multiplication
    (oneOf('+ -'), 2, opAssoc.LEFT),] # addition
    )

    assignment = (variable.setResultsName("lhs") +
    ":=" +
    arithexpr.setResultsName("rhs"))

    test = ["a:= b+c",
    "a := b + -c",
    "y := M*X + B",
    "e := m * c^2",]

    for t in test:
    tokens = assignment.parseString(t)
    print tokens.asList()
    print tokens.lhs, "<-", tokens.rhs
    print

    Prints:
    ['a', ':=', ['b', '+', 'c']]
    a <- ['b', '+', 'c']

    ['a', ':=', ['b', '+', ['-', 'c']]]
    a <- ['b', '+', ['-', 'c']]

    ['y', ':=', [['M', '*', 'X'], '+', 'B']]
    y <- [['M', '*', 'X'], '+', 'B']

    ['e', ':=', ['m', '*', ['c', '^', 2]]]
    e <- ['m', '*', ['c', '^', 2]]
     
    Paul McGuire, Nov 16, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark Fox

    Delimiter Split

    Mark Fox, Aug 11, 2003, in forum: ASP .Net
    Replies:
    2
    Views:
    3,353
    Chris R. Timmons
    Aug 11, 2003
  2. Kevin Spencer
    Replies:
    5
    Views:
    1,243
    =?Utf-8?B?UENL?=
    Jan 21, 2004
  3. Replies:
    18
    Views:
    5,258
    Michael Jung
    Aug 11, 2013
  4. rewonka
    Replies:
    10
    Views:
    704
    M.-A. Lemburg
    Mar 19, 2009
  5. Jim Cain
    Replies:
    1
    Views:
    221
    Yukihiro Matsumoto
    Jul 18, 2003
Loading...

Share This Page