splitting delimited strings

Mark Harrison · Jun 16, 2005

What is the best way to process a text file of delimited strings?
I've got a file where strings are quoted with at-signs, @like this@.
At-signs in the string are represented as doubled @@.

What's the most efficient way to process this? Failing all
else I will split the string into characters and use a FSM,
but it seems that's not very pythonesqe.

@rv@ 2 @db.locks@ @//depot/hello.txt@ @mh@ @mh@ 1 1 44
@pv@ 0 @db.changex@ 44 44 @mh@ @mh@ 1118875308 0 @

@:

@@@: @

(this is from a perforce journal file, btw)

Many TIA!
Mark

Paul McNett · Jun 16, 2005

Mark said:
What is the best way to process a text file of delimited strings?
I've got a file where strings are quoted with at-signs, @like this@.
At-signs in the string are represented as doubled @@.

Have you taken a look at the csv module yet? No guarantees, but it may
just work. You'd have to set delimiter to ' ' and quotechar to '@'. You
may need to manually handle the double-@ thing, but why don't you see
how close you can get with csv?

Christoph Rackwitz · Jun 16, 2005

You could use regular expressions... it's an FSM of some kind but it's
faster *g*
check this snippet out:

def mysplit(s):
pattern = '((?:"[^"]*")|(?:[^ ]+))'
tmp = re.split(pattern, s)
res = [ifelse(i[0] in ('"',"'"), lambda:i[1:-1], lambda:i) for i in
tmp if i.strip()]
return res
['foo', 'bar', 'baz foo', 'bar', 'baz']

John Machin · Jun 16, 2005

Mark said:
What is the best way to process a text file of delimited strings?
I've got a file where strings are quoted with at-signs, @like this@.
At-signs in the string are represented as doubled @@.

What's the most efficient way to process this? Failing all
else I will split the string into characters and use a FSM,
but it seems that's not very pythonesqe.

@rv@ 2 @db.locks@ @//depot/hello.txt@ @mh@ @mh@ 1 1 44
@pv@ 0 @db.changex@ 44 44 @mh@ @mh@ 1118875308 0 @ @: @@@: @

quotechar='@'))
[['rv', '2', 'db.locks', '//depot/hello.txt', 'mh', 'mh', '1', '1',
'44'], ['pv'
, '0', 'db.changex', '44', '44', 'mh', 'mh', '1118875308', '0', '

:

@: ']]

John Machin · Jun 16, 2005

Nicola said:
like this ?

No, not like that. The OP said that an embedded @ was doubled.

['', 'hello', 'world', '', 'foo', 'bar']

'hello@world@@foo@bar'

Click to expand...

['hello', 'world', '', 'foo', 'bar']

bye

Mark Harrison · Jun 16, 2005

Paul McNett said:
Have you taken a look at the csv module yet? No guarantees, but it may
just work. You'd have to set delimiter to ' ' and quotechar to '@'. You
may need to manually handle the double-@ thing, but why don't you see
how close you can get with csv?

This is great! Everything works perfectly. Even the double-@ thing
is handled by the default quotechar handling.

Thanks again,
Mark

Leif K-Brooks · Jun 16, 2005

Mark said:
What is the best way to process a text file of delimited strings?
I've got a file where strings are quoted with at-signs, @like this@.
At-signs in the string are represented as doubled @@.

.... return [field.replace('@@', '@') for field in
.... _at_re.split(line)]
....['foo', 'bar@baz', 'qux']

Paul McGuire · Jun 16, 2005

Mark -

Let me weigh in with a pyparsing entry to your puzzle. It wont be
blazingly fast, but at least it will give you another data point in
your comparison of approaches. Note that the parser can do the
string-to-int conversion for you during the parsing pass.

If @rv@ and @pv@ are record type markers, then you can use pyparsing to
create more of a parser than just a simple tokenizer, and parse out the
individual record fields into result attributes.

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul

test1 = "@hello@@world@@foo@bar"
test2 = """@rv@ 2 @db.locks@ @//depot/hello.txt@ @mh@ @mh@ 1 1 44
@pv@ 0 @db.changex@ 44 44 @mh@ @mh@ 1118875308 0 @

@:

@@@: @"""

from pyparsing import *

AT = Literal("@")
atQuotedString = AT.suppress() + Combine(OneOrMore((~AT + SkipTo(AT)) |

(AT +
AT).setParseAction(replaceWith("@")) )) + AT.suppress()

# extract any @-quoted strings
for test in (test1,test2):
for toks,s,e in atQuotedString.scanString(test):
print toks
print

# parse all tokens (assume either a positive integer or @-quoted
string)
def makeInt(s,l,toks):
return int(toks[0])
entry = OneOrMore( Word(nums).setParseAction(makeInt) | atQuotedString
)

for t in test2.split("\n"):
print entry.parseString(t)

Prints out:

['hello@world@foo']

['rv']
['db.locks']
['//depot/hello.txt']
['mh']
['mh']
['pv']
['db.changex']
['mh']
['mh']
['

:

@: ']

['rv', 2, 'db.locks', '//depot/hello.txt', 'mh', 'mh', 1, 1, 44]
['pv', 0, 'db.changex', 44, 44, 'mh', 'mh', 1118875308, 0, '

:

@: ']

Nicola Mingotti · Jun 16, 2005

What's the most efficient way to process this? Failing all
else I will split the string into characters and use a FSM,
but it seems that's not very pythonesqe.

like this ?

s = "@hello@world@@foo@bar"
s.split("@") ['', 'hello', 'world', '', 'foo', 'bar']
s2 = "hello@world@@foo@bar"
s2 'hello@world@@foo@bar'
s2.split("@") ['hello', 'world', '', 'foo', 'bar']

Click to expand...

Click to expand...

bye

John Machin · Jun 16, 2005

Leif said:
Mark said:

What is the best way to process a text file of delimited strings?
I've got a file where strings are quoted with at-signs, @like this@.
At-signs in the string are represented as doubled @@.

Click to expand...

... return [field.replace('@@', '@') for field in
... _at_re.split(line)]
...

['foo', 'bar@baz', 'qux']

The plot according to the OP was that the @s were quotes, NOT delimiters.

Nicola Mingotti · Jun 16, 2005

No, not like that. The OP said that an embedded @ was doubled.

you are right, sorry

anyway, if @@ -> @
an empty field map to what ?

Searching way to assign blank delimited strings to different variables?	1	Jan 18, 2011
efficiently splitting up strings based on substrings	7	Sep 5, 2009
Efficient way of generating original alphabetic strings like unix file "split"	6	Jun 14, 2007
reading a Text-File into Vector of Strings	16	Sep 13, 2007
print header for output	0	Jun 19, 2011
Question of reference and (sub)strings.	1	Dec 14, 2005
No-syntax Web-programming-IDE (was: Does turtle graphics have the wrong associations?)	0	Nov 22, 2009
comp.lang.c FAQ list Table of Contents	0	Jan 12, 2008

splitting delimited strings

Mark Harrison

Paul McNett

Christoph Rackwitz

John Machin

John Machin

Mark Harrison

Leif K-Brooks

Paul McGuire

Nicola Mingotti

John Machin

Nicola Mingotti

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads