On text processing

D

Daniel Nogradi

Hi list,

I'm in a process of rewriting a bash/awk/sed script -- that grew to
big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.

The bash script processed text files of the form:

###############################
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
###################################

I guess you get the point. If a line has two entries it is a key/value
pair which should end up in a dictionary. If a key/value pair is
followed by consequtive lines with more then two entries, it is a
matrix that should end up in a list of lists (matrix) that can be
identified by the key preceeding it. The empty line after the last
line of a matrix signifies that the matrix is finished and we are back
to a key/value situation. Note that a matrix is always preceeded by a
key/value pair so that it can really be identified by the key.

Any elegant solution for this?
 
B

bearophileHUGS

Daniel Nogradi:
Any elegant solution for this?

This is my first try:

ddata = {}

inside_matrix = False
for row in file("data.txt"):
if row.strip():
fields = row.split()
if len(fields) == 2:
inside_matrix = False
ddata[fields[0]] = [fields[1]]
lastkey = fields[0]
else:
if inside_matrix:
ddata[lastkey][1].append(fields)
else:
ddata[lastkey].append([fields])
inside_matrix = True

# This gives some output for testing only:
for k in sorted(ddata):
print k, ddata[k]


Input file data.txt:

key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8


The output:

key1 ['value1']
key2 ['value2']
key3 ['value3']
key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
'spec34']]]
key5 ['value5']
key6 ['value6']
key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]]
key8 ['value8']


If there are many simple keys, then you can avoid creating a single
element list for them, but then you have to tell apart the two cases
on the base of the key (while now the presence of the second element
is able to tell apart the two situations). You can also use two
different dicts to keep the two different kinds of data.

Bye,
bearophile
 
D

Daniel Nogradi

This is my first try:
ddata = {}

inside_matrix = False
for row in file("data.txt"):
if row.strip():
fields = row.split()
if len(fields) == 2:
inside_matrix = False
ddata[fields[0]] = [fields[1]]
lastkey = fields[0]
else:
if inside_matrix:
ddata[lastkey][1].append(fields)
else:
ddata[lastkey].append([fields])
inside_matrix = True

# This gives some output for testing only:
for k in sorted(ddata):
print k, ddata[k]


Input file data.txt:

key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8


The output:

key1 ['value1']
key2 ['value2']
key3 ['value3']
key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
'spec34']]]
key5 ['value5']
key6 ['value6']
key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]]
key8 ['value8']


If there are many simple keys, then you can avoid creating a single
element list for them, but then you have to tell apart the two cases
on the base of the key (while now the presence of the second element
is able to tell apart the two situations). You can also use two
different dicts to keep the two different kinds of data.

Bye,
bearophile

Thanks very much, it's indeed quite simple. I was lost in the
itertools documentation :)
 
P

Paddy

Hi list,

I'm in a process of rewriting a bash/awk/sed script -- that grew to
big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.

The bash script processed text files of the form:

###############################
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
###################################

I guess you get the point. If a line has two entries it is a key/value
pair which should end up in a dictionary. If a key/value pair is
followed by consequtive lines with more then two entries, it is a
matrix that should end up in a list of lists (matrix) that can be
identified by the key preceeding it. The empty line after the last
line of a matrix signifies that the matrix is finished and we are back
to a key/value situation. Note that a matrix is always preceeded by a
key/value pair so that it can really be identified by the key.

Any elegant solution for this?


My solution expects correctly formatted input and parses it into
separate key/value and matrix holding dicts:


from StringIO import StringIO

fileText = '''\
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
'''
infile = StringIO(fileText)

keyvalues = {}
matrices = {}
for line in infile:
fields = line.strip().split()
if len(fields) == 2:
keyvalues[fields[0]] = fields[1]
lastkey = fields[0]
elif fields:
matrices.setdefault(lastkey, []).append(fields)

==============
Here is the sample output:
{'key1': 'value1',
'key2': 'value2',
'key3': 'value3',
'key4': 'value4',
'key5': 'value5',
'key6': 'value6',
'key7': 'value7',
'key8': 'value8'}{'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
['spec21', 'spec22', 'spec23', 'spec24'],
['spec31', 'spec32', 'spec33', 'spec34']],
'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]}
- Paddy.
 
P

Paul McGuire

Hi list,

I'm in a process of rewriting a bash/awk/sed script -- that grew to
big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.

The bash script processed text files of the form...

Any elegant solution for this?

Is a parser overkill? Here's how you might use pyparsing for this
problem.

I just wanted to show that pyparsing's returned results can be
structured as more than just lists of tokens. Using pyparsing's Dict
class (or the dictOf helper that simplifies using Dict), you can
return results that can be accessed like a nested list, like a dict,
or like an instance with named attributes (see the last line of the
example).

You can adjust the syntax definition of keys and values to fit your
actual data, for instance, if the matrices are actually integers, then
define the matrixRow as:

matrixRow = Group( OneOrMore( Word(nums) ) ) + eol


-- Paul


from pyparsing import ParserElement, LineEnd, Word, alphas, alphanums,
\
Group, ZeroOrMore, OneOrMore, Optional, dictOf

data = """key1 value1
key2 value2
key3 value3


key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34


key5 value5
key6 value6


key7 value7
more11 more12 more13
more21 more22 more23


key8 value8
"""

# retain significant newlines (pyparsing reads over whitespace by
default)
ParserElement.setDefaultWhitespaceChars(" \t")

eol = LineEnd().suppress()
elem = Word(alphas,alphanums)
key = elem
matrixRow = Group( elem + elem + OneOrMore(elem) ) + eol
matrix = Group( OneOrMore( matrixRow ) ) + eol
value = elem + eol + Optional( matrix ) + ZeroOrMore(eol)
parser = dictOf(key, value)

# parse the data
results = parser.parseString(data)

# access the results
# - like a dict
# - like a list
# - like an instance with keys for attributes
print results.keys()
print

for k in sorted(results.keys()):
print k,
if isinstance( results[k], basestring ):
print results[k]
else:
print results[k][0]
for row in results[k][1]:
print " "," ".join(row)
print

print results.key3


Prints out:
['key8', 'key3', 'key2', 'key1', 'key7', 'key6', 'key5', 'key4']

key1 value1
key2 value2
key3 value3
key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34
key5 value5
key6 value6
key7 value7
more11 more12 more13
more21 more22 more23
key8 value8

value3
 
D

Daniel Nogradi

I'm in a process of rewriting a bash/awk/sed script -- that grew to
big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.

The bash script processed text files of the form:

###############################
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
###################################

I guess you get the point. If a line has two entries it is a key/value
pair which should end up in a dictionary. If a key/value pair is
followed by consequtive lines with more then two entries, it is a
matrix that should end up in a list of lists (matrix) that can be
identified by the key preceeding it. The empty line after the last
line of a matrix signifies that the matrix is finished and we are back
to a key/value situation. Note that a matrix is always preceeded by a
key/value pair so that it can really be identified by the key.

Any elegant solution for this?


My solution expects correctly formatted input and parses it into
separate key/value and matrix holding dicts:


from StringIO import StringIO

fileText = '''\
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
'''
infile = StringIO(fileText)

keyvalues = {}
matrices = {}
for line in infile:
fields = line.strip().split()
if len(fields) == 2:
keyvalues[fields[0]] = fields[1]
lastkey = fields[0]
elif fields:
matrices.setdefault(lastkey, []).append(fields)

==============
Here is the sample output:
{'key1': 'value1',
'key2': 'value2',
'key3': 'value3',
'key4': 'value4',
'key5': 'value5',
'key6': 'value6',
'key7': 'value7',
'key8': 'value8'}{'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
['spec21', 'spec22', 'spec23', 'spec24'],
['spec31', 'spec32', 'spec33', 'spec34']],
'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]}

Paddy, thanks, this looks even better.
Paul, pyparsing looks like an overkill, even the config parser module
is something that is too complex for me for such a simple task. The
text files are actually input files to a program and will never be
longer than 20-30 lines so Paddy's solution is perfectly fine. In any
case it's good to know that there exists a module called pyparsing :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top