On text processing

Daniel Nogradi · Mar 23, 2007

Hi list,

I'm in a process of rewriting a bash/awk/sed script -- that grew to
big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.

The bash script processed text files of the form:

###############################
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
###################################

I guess you get the point. If a line has two entries it is a key/value
pair which should end up in a dictionary. If a key/value pair is
followed by consequtive lines with more then two entries, it is a
matrix that should end up in a list of lists (matrix) that can be
identified by the key preceeding it. The empty line after the last
line of a matrix signifies that the matrix is finished and we are back
to a key/value situation. Note that a matrix is always preceeded by a
key/value pair so that it can really be identified by the key.

Any elegant solution for this?

bearophileHUGS · Mar 23, 2007

Daniel Nogradi:

Any elegant solution for this?

This is my first try:

ddata = {}

inside_matrix = False
for row in file("data.txt"):
if row.strip():
fields = row.split()
if len(fields) == 2:
inside_matrix = False
ddata[fields[0]] = [fields[1]]
lastkey = fields[0]
else:
if inside_matrix:
ddata[lastkey][1].append(fields)
else:
ddata[lastkey].append([fields])
inside_matrix = True

# This gives some output for testing only:
for k in sorted(ddata):
print k, ddata[k]

Input file data.txt:

key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8

The output:

key1 ['value1']
key2 ['value2']
key3 ['value3']
key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
'spec34']]]
key5 ['value5']
key6 ['value6']
key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]]
key8 ['value8']

If there are many simple keys, then you can avoid creating a single
element list for them, but then you have to tell apart the two cases
on the base of the key (while now the presence of the second element
is able to tell apart the two situations). You can also use two
different dicts to keep the two different kinds of data.

Bye,
bearophile

Daniel Nogradi · Mar 23, 2007

This is my first try:

ddata = {}

inside_matrix = False
for row in file("data.txt"):
if row.strip():
fields = row.split()
if len(fields) == 2:
inside_matrix = False
ddata[fields[0]] = [fields[1]]
lastkey = fields[0]
else:
if inside_matrix:
ddata[lastkey][1].append(fields)
else:
ddata[lastkey].append([fields])
inside_matrix = True

# This gives some output for testing only:
for k in sorted(ddata):
print k, ddata[k]

Input file data.txt:

key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8

The output:

key1 ['value1']
key2 ['value2']
key3 ['value3']
key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
'spec34']]]
key5 ['value5']
key6 ['value6']
key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]]
key8 ['value8']

If there are many simple keys, then you can avoid creating a single
element list for them, but then you have to tell apart the two cases
on the base of the key (while now the presence of the second element
is able to tell apart the two situations). You can also use two
different dicts to keep the two different kinds of data.

Bye,
bearophile

Thanks very much, it's indeed quite simple. I was lost in the
itertools documentation

Paddy · Mar 24, 2007

Hi list,

I'm in a process of rewriting a bash/awk/sed script -- that grew to
big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.

The bash script processed text files of the form:

###############################
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
###################################

I guess you get the point. If a line has two entries it is a key/value
pair which should end up in a dictionary. If a key/value pair is
followed by consequtive lines with more then two entries, it is a
matrix that should end up in a list of lists (matrix) that can be
identified by the key preceeding it. The empty line after the last
line of a matrix signifies that the matrix is finished and we are back
to a key/value situation. Note that a matrix is always preceeded by a
key/value pair so that it can really be identified by the key.

Any elegant solution for this?

My solution expects correctly formatted input and parses it into
separate key/value and matrix holding dicts:

from StringIO import StringIO

fileText = '''\
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
'''
infile = StringIO(fileText)

keyvalues = {}
matrices = {}
for line in infile:
fields = line.strip().split()
if len(fields) == 2:
keyvalues[fields[0]] = fields[1]
lastkey = fields[0]
elif fields:
matrices.setdefault(lastkey, []).append(fields)

==============
Here is the sample output:
{'key1': 'value1',
'key2': 'value2',
'key3': 'value3',
'key4': 'value4',
'key5': 'value5',
'key6': 'value6',
'key7': 'value7',
'key8': 'value8'}{'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
['spec21', 'spec22', 'spec23', 'spec24'],
['spec31', 'spec32', 'spec33', 'spec34']],
'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]}
- Paddy.

Paul McGuire · Mar 24, 2007

Hi list,

I'm in a process of rewriting a bash/awk/sed script -- that grew to
big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.

The bash script processed text files of the form...

Any elegant solution for this?

Is a parser overkill? Here's how you might use pyparsing for this
problem.

I just wanted to show that pyparsing's returned results can be
structured as more than just lists of tokens. Using pyparsing's Dict
class (or the dictOf helper that simplifies using Dict), you can
return results that can be accessed like a nested list, like a dict,
or like an instance with named attributes (see the last line of the
example).

You can adjust the syntax definition of keys and values to fit your
actual data, for instance, if the matrices are actually integers, then
define the matrixRow as:

matrixRow = Group( OneOrMore( Word(nums) ) ) + eol

-- Paul

from pyparsing import ParserElement, LineEnd, Word, alphas, alphanums,
\
Group, ZeroOrMore, OneOrMore, Optional, dictOf

data = """key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
"""

# retain significant newlines (pyparsing reads over whitespace by
default)
ParserElement.setDefaultWhitespaceChars(" \t")

eol = LineEnd().suppress()
elem = Word(alphas,alphanums)
key = elem
matrixRow = Group( elem + elem + OneOrMore(elem) ) + eol
matrix = Group( OneOrMore( matrixRow ) ) + eol
value = elem + eol + Optional( matrix ) + ZeroOrMore(eol)
parser = dictOf(key, value)

# parse the data
results = parser.parseString(data)

# access the results
# - like a dict
# - like a list
# - like an instance with keys for attributes
print results.keys()
print

for k in sorted(results.keys()):
print k,
if isinstance( results[k], basestring ):
print results[k]
else:
print results[k][0]
for row in results[k][1]:
print " "," ".join(row)
print

print results.key3

Prints out:
['key8', 'key3', 'key2', 'key1', 'key7', 'key6', 'key5', 'key4']

key1 value1
key2 value2
key3 value3
key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34
key5 value5
key6 value6
key7 value7
more11 more12 more13
more21 more22 more23
key8 value8

value3

Daniel Nogradi · Mar 24, 2007

I'm in a process of rewriting a bash/awk/sed script -- that grew to

big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.

The bash script processed text files of the form:

###############################
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
###################################

I guess you get the point. If a line has two entries it is a key/value
pair which should end up in a dictionary. If a key/value pair is
followed by consequtive lines with more then two entries, it is a
matrix that should end up in a list of lists (matrix) that can be
identified by the key preceeding it. The empty line after the last
line of a matrix signifies that the matrix is finished and we are back
to a key/value situation. Note that a matrix is always preceeded by a
key/value pair so that it can really be identified by the key.

Any elegant solution for this?

Click to expand...

My solution expects correctly formatted input and parses it into
separate key/value and matrix holding dicts:

from StringIO import StringIO

fileText = '''\
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
'''
infile = StringIO(fileText)

keyvalues = {}
matrices = {}
for line in infile:
fields = line.strip().split()
if len(fields) == 2:
keyvalues[fields[0]] = fields[1]
lastkey = fields[0]
elif fields:
matrices.setdefault(lastkey, []).append(fields)

==============
Here is the sample output:
{'key1': 'value1',
'key2': 'value2',
'key3': 'value3',
'key4': 'value4',
'key5': 'value5',
'key6': 'value6',
'key7': 'value7',
'key8': 'value8'}{'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
['spec21', 'spec22', 'spec23', 'spec24'],
['spec31', 'spec32', 'spec33', 'spec34']],
'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]}

Paddy, thanks, this looks even better.
Paul, pyparsing looks like an overkill, even the config parser module
is something that is too complex for me for such a simple task. The
text files are actually input files to a program and will never be
longer than 20-30 lines so Paddy's solution is perfectly fine. In any
case it's good to know that there exists a module called pyparsing

How to create a dict based on such a file?	2	Feb 14, 2011
Text processing	29	Sep 26, 2011
Returning part of a hash	10	Jul 11, 2007
Help with parsing a list	0	Dec 16, 2009
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
fastest data structure for retrieving objects identified by (x,y)tuple?	4	Oct 3, 2012
Passing a hash by reference	10	Dec 12, 2003
[PATCH] tagz-5.0.1 -- processing instruction support	1	Mar 29, 2009

On text processing

Daniel Nogradi

bearophileHUGS

Daniel Nogradi

Paddy

Paul McGuire

Daniel Nogradi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads