On text processing

Discussion in 'Python' started by Daniel Nogradi, Mar 23, 2007.

  1. Hi list,

    I'm in a process of rewriting a bash/awk/sed script -- that grew to
    big -- in python. I can rewrite it in a simple line-by-line way but
    that results in ugly python code and I'm sure there is a simple
    pythonic way.

    The bash script processed text files of the form:

    ###############################
    key1 value1
    key2 value2
    key3 value3

    key4 value4
    spec11 spec12 spec13 spec14
    spec21 spec22 spec23 spec24
    spec31 spec32 spec33 spec34

    key5 value5
    key6 value6

    key7 value7
    more11 more12 more13
    more21 more22 more23

    key8 value8
    ###################################

    I guess you get the point. If a line has two entries it is a key/value
    pair which should end up in a dictionary. If a key/value pair is
    followed by consequtive lines with more then two entries, it is a
    matrix that should end up in a list of lists (matrix) that can be
    identified by the key preceeding it. The empty line after the last
    line of a matrix signifies that the matrix is finished and we are back
    to a key/value situation. Note that a matrix is always preceeded by a
    key/value pair so that it can really be identified by the key.

    Any elegant solution for this?
     
    Daniel Nogradi, Mar 23, 2007
    #1
    1. Advertising

  2. Daniel Nogradi

    Guest

    Daniel Nogradi:
    > Any elegant solution for this?


    This is my first try:

    ddata = {}

    inside_matrix = False
    for row in file("data.txt"):
    if row.strip():
    fields = row.split()
    if len(fields) == 2:
    inside_matrix = False
    ddata[fields[0]] = [fields[1]]
    lastkey = fields[0]
    else:
    if inside_matrix:
    ddata[lastkey][1].append(fields)
    else:
    ddata[lastkey].append([fields])
    inside_matrix = True

    # This gives some output for testing only:
    for k in sorted(ddata):
    print k, ddata[k]


    Input file data.txt:

    key1 value1
    key2 value2
    key3 value3

    key4 value4
    spec11 spec12 spec13 spec14
    spec21 spec22 spec23 spec24
    spec31 spec32 spec33 spec34

    key5 value5
    key6 value6

    key7 value7
    more11 more12 more13
    more21 more22 more23

    key8 value8


    The output:

    key1 ['value1']
    key2 ['value2']
    key3 ['value3']
    key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
    'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
    'spec34']]]
    key5 ['value5']
    key6 ['value6']
    key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
    'more23']]]
    key8 ['value8']


    If there are many simple keys, then you can avoid creating a single
    element list for them, but then you have to tell apart the two cases
    on the base of the key (while now the presence of the second element
    is able to tell apart the two situations). You can also use two
    different dicts to keep the two different kinds of data.

    Bye,
    bearophile
     
    , Mar 23, 2007
    #2
    1. Advertising

  3. > This is my first try:
    >
    > ddata = {}
    >
    > inside_matrix = False
    > for row in file("data.txt"):
    > if row.strip():
    > fields = row.split()
    > if len(fields) == 2:
    > inside_matrix = False
    > ddata[fields[0]] = [fields[1]]
    > lastkey = fields[0]
    > else:
    > if inside_matrix:
    > ddata[lastkey][1].append(fields)
    > else:
    > ddata[lastkey].append([fields])
    > inside_matrix = True
    >
    > # This gives some output for testing only:
    > for k in sorted(ddata):
    > print k, ddata[k]
    >
    >
    > Input file data.txt:
    >
    > key1 value1
    > key2 value2
    > key3 value3
    >
    > key4 value4
    > spec11 spec12 spec13 spec14
    > spec21 spec22 spec23 spec24
    > spec31 spec32 spec33 spec34
    >
    > key5 value5
    > key6 value6
    >
    > key7 value7
    > more11 more12 more13
    > more21 more22 more23
    >
    > key8 value8
    >
    >
    > The output:
    >
    > key1 ['value1']
    > key2 ['value2']
    > key3 ['value3']
    > key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
    > 'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
    > 'spec34']]]
    > key5 ['value5']
    > key6 ['value6']
    > key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
    > 'more23']]]
    > key8 ['value8']
    >
    >
    > If there are many simple keys, then you can avoid creating a single
    > element list for them, but then you have to tell apart the two cases
    > on the base of the key (while now the presence of the second element
    > is able to tell apart the two situations). You can also use two
    > different dicts to keep the two different kinds of data.
    >
    > Bye,
    > bearophile


    Thanks very much, it's indeed quite simple. I was lost in the
    itertools documentation :)
     
    Daniel Nogradi, Mar 23, 2007
    #3
  4. Daniel Nogradi

    Paddy Guest

    On Mar 23, 10:30 pm, "Daniel Nogradi" <> wrote:
    > Hi list,
    >
    > I'm in a process of rewriting a bash/awk/sed script -- that grew to
    > big -- in python. I can rewrite it in a simple line-by-line way but
    > that results in ugly python code and I'm sure there is a simple
    > pythonic way.
    >
    > The bash script processed text files of the form:
    >
    > ###############################
    > key1 value1
    > key2 value2
    > key3 value3
    >
    > key4 value4
    > spec11 spec12 spec13 spec14
    > spec21 spec22 spec23 spec24
    > spec31 spec32 spec33 spec34
    >
    > key5 value5
    > key6 value6
    >
    > key7 value7
    > more11 more12 more13
    > more21 more22 more23
    >
    > key8 value8
    > ###################################
    >
    > I guess you get the point. If a line has two entries it is a key/value
    > pair which should end up in a dictionary. If a key/value pair is
    > followed by consequtive lines with more then two entries, it is a
    > matrix that should end up in a list of lists (matrix) that can be
    > identified by the key preceeding it. The empty line after the last
    > line of a matrix signifies that the matrix is finished and we are back
    > to a key/value situation. Note that a matrix is always preceeded by a
    > key/value pair so that it can really be identified by the key.
    >
    > Any elegant solution for this?



    My solution expects correctly formatted input and parses it into
    separate key/value and matrix holding dicts:


    from StringIO import StringIO

    fileText = '''\
    key1 value1
    key2 value2
    key3 value3

    key4 value4
    spec11 spec12 spec13 spec14
    spec21 spec22 spec23 spec24
    spec31 spec32 spec33 spec34

    key5 value5
    key6 value6

    key7 value7
    more11 more12 more13
    more21 more22 more23

    key8 value8
    '''
    infile = StringIO(fileText)

    keyvalues = {}
    matrices = {}
    for line in infile:
    fields = line.strip().split()
    if len(fields) == 2:
    keyvalues[fields[0]] = fields[1]
    lastkey = fields[0]
    elif fields:
    matrices.setdefault(lastkey, []).append(fields)

    ==============
    Here is the sample output:

    >>> from pprint import pprint as pp
    >>> pp(keyvalues)

    {'key1': 'value1',
    'key2': 'value2',
    'key3': 'value3',
    'key4': 'value4',
    'key5': 'value5',
    'key6': 'value6',
    'key7': 'value7',
    'key8': 'value8'}
    >>> pp(matrices)

    {'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
    ['spec21', 'spec22', 'spec23', 'spec24'],
    ['spec31', 'spec32', 'spec33', 'spec34']],
    'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
    'more23']]}
    >>>


    - Paddy.
     
    Paddy, Mar 24, 2007
    #4
  5. Daniel Nogradi

    Paul McGuire Guest

    On Mar 23, 5:30 pm, "Daniel Nogradi" <> wrote:
    > Hi list,
    >
    > I'm in a process of rewriting a bash/awk/sed script -- that grew to
    > big -- in python. I can rewrite it in a simple line-by-line way but
    > that results in ugly python code and I'm sure there is a simple
    > pythonic way.
    >
    > The bash script processed text files of the form...
    >
    > Any elegant solution for this?


    Is a parser overkill? Here's how you might use pyparsing for this
    problem.

    I just wanted to show that pyparsing's returned results can be
    structured as more than just lists of tokens. Using pyparsing's Dict
    class (or the dictOf helper that simplifies using Dict), you can
    return results that can be accessed like a nested list, like a dict,
    or like an instance with named attributes (see the last line of the
    example).

    You can adjust the syntax definition of keys and values to fit your
    actual data, for instance, if the matrices are actually integers, then
    define the matrixRow as:

    matrixRow = Group( OneOrMore( Word(nums) ) ) + eol


    -- Paul


    from pyparsing import ParserElement, LineEnd, Word, alphas, alphanums,
    \
    Group, ZeroOrMore, OneOrMore, Optional, dictOf

    data = """key1 value1
    key2 value2
    key3 value3


    key4 value4
    spec11 spec12 spec13 spec14
    spec21 spec22 spec23 spec24
    spec31 spec32 spec33 spec34


    key5 value5
    key6 value6


    key7 value7
    more11 more12 more13
    more21 more22 more23


    key8 value8
    """

    # retain significant newlines (pyparsing reads over whitespace by
    default)
    ParserElement.setDefaultWhitespaceChars(" \t")

    eol = LineEnd().suppress()
    elem = Word(alphas,alphanums)
    key = elem
    matrixRow = Group( elem + elem + OneOrMore(elem) ) + eol
    matrix = Group( OneOrMore( matrixRow ) ) + eol
    value = elem + eol + Optional( matrix ) + ZeroOrMore(eol)
    parser = dictOf(key, value)

    # parse the data
    results = parser.parseString(data)

    # access the results
    # - like a dict
    # - like a list
    # - like an instance with keys for attributes
    print results.keys()
    print

    for k in sorted(results.keys()):
    print k,
    if isinstance( results[k], basestring ):
    print results[k]
    else:
    print results[k][0]
    for row in results[k][1]:
    print " "," ".join(row)
    print

    print results.key3


    Prints out:
    ['key8', 'key3', 'key2', 'key1', 'key7', 'key6', 'key5', 'key4']

    key1 value1
    key2 value2
    key3 value3
    key4 value4
    spec11 spec12 spec13 spec14
    spec21 spec22 spec23 spec24
    spec31 spec32 spec33 spec34
    key5 value5
    key6 value6
    key7 value7
    more11 more12 more13
    more21 more22 more23
    key8 value8

    value3
     
    Paul McGuire, Mar 24, 2007
    #5
  6. > > I'm in a process of rewriting a bash/awk/sed script -- that grew to
    > > big -- in python. I can rewrite it in a simple line-by-line way but
    > > that results in ugly python code and I'm sure there is a simple
    > > pythonic way.
    > >
    > > The bash script processed text files of the form:
    > >
    > > ###############################
    > > key1 value1
    > > key2 value2
    > > key3 value3
    > >
    > > key4 value4
    > > spec11 spec12 spec13 spec14
    > > spec21 spec22 spec23 spec24
    > > spec31 spec32 spec33 spec34
    > >
    > > key5 value5
    > > key6 value6
    > >
    > > key7 value7
    > > more11 more12 more13
    > > more21 more22 more23
    > >
    > > key8 value8
    > > ###################################
    > >
    > > I guess you get the point. If a line has two entries it is a key/value
    > > pair which should end up in a dictionary. If a key/value pair is
    > > followed by consequtive lines with more then two entries, it is a
    > > matrix that should end up in a list of lists (matrix) that can be
    > > identified by the key preceeding it. The empty line after the last
    > > line of a matrix signifies that the matrix is finished and we are back
    > > to a key/value situation. Note that a matrix is always preceeded by a
    > > key/value pair so that it can really be identified by the key.
    > >
    > > Any elegant solution for this?

    >
    >
    > My solution expects correctly formatted input and parses it into
    > separate key/value and matrix holding dicts:
    >
    >
    > from StringIO import StringIO
    >
    > fileText = '''\
    > key1 value1
    > key2 value2
    > key3 value3
    >
    > key4 value4
    > spec11 spec12 spec13 spec14
    > spec21 spec22 spec23 spec24
    > spec31 spec32 spec33 spec34
    >
    > key5 value5
    > key6 value6
    >
    > key7 value7
    > more11 more12 more13
    > more21 more22 more23
    >
    > key8 value8
    > '''
    > infile = StringIO(fileText)
    >
    > keyvalues = {}
    > matrices = {}
    > for line in infile:
    > fields = line.strip().split()
    > if len(fields) == 2:
    > keyvalues[fields[0]] = fields[1]
    > lastkey = fields[0]
    > elif fields:
    > matrices.setdefault(lastkey, []).append(fields)
    >
    > ==============
    > Here is the sample output:
    >
    > >>> from pprint import pprint as pp
    > >>> pp(keyvalues)

    > {'key1': 'value1',
    > 'key2': 'value2',
    > 'key3': 'value3',
    > 'key4': 'value4',
    > 'key5': 'value5',
    > 'key6': 'value6',
    > 'key7': 'value7',
    > 'key8': 'value8'}
    > >>> pp(matrices)

    > {'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
    > ['spec21', 'spec22', 'spec23', 'spec24'],
    > ['spec31', 'spec32', 'spec33', 'spec34']],
    > 'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
    > 'more23']]}
    > >>>


    Paddy, thanks, this looks even better.
    Paul, pyparsing looks like an overkill, even the config parser module
    is something that is too complex for me for such a simple task. The
    text files are actually input files to a program and will never be
    longer than 20-30 lines so Paddy's solution is perfectly fine. In any
    case it's good to know that there exists a module called pyparsing :)
     
    Daniel Nogradi, Mar 24, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jason Heyes
    Replies:
    4
    Views:
    389
    Karl Heinz Buchegger
    Mar 24, 2005
  2. Joe Francia
    Replies:
    0
    Views:
    315
    Joe Francia
    Jul 8, 2003
  3. phil hunt

    Text-to-HTML processing program

    phil hunt, Jan 3, 2004, in forum: Python
    Replies:
    11
    Views:
    618
    Reinier Post
    Jan 8, 2004
  4. Michael Ellis

    Cleaner idiom for text processing?

    Michael Ellis, May 26, 2004, in forum: Python
    Replies:
    16
    Views:
    498
    Peter Otten
    May 27, 2004
  5. Hubert Hung-Hsien Chang
    Replies:
    2
    Views:
    540
    Michael Foord
    Sep 17, 2004
Loading...

Share This Page