Help needed with nested parsing of file into objects

Discussion in 'Python' started by richard, Jun 4, 2012.

  1. richard

    richard Guest

    Hi guys i am having a bit of dificulty finding the best approach /
    solution to parsing a file into a list of objects / nested objects any
    help would be greatly appreciated.

    #file format to parse .txt
    Code (Text):

    An instance of TestArray
    a=a
    b=b
    c=c
    List of 2 A elements:
    Instance of A element
    a=1
    b=2
    c=3
    Instance of A element
    d=1
    e=2
    f=3
    List of 1 B elements
    Instance of B element
    a=1
    b=2
    c=3
    List of 2 C elements
    Instance of C element
    a=1
    b=2
    c=3
    Instance of C element
    a=1
    b=2
    c=3

    An instance of TestArray
    a=1
    b=2
    c=3
     
    expected output
    list of 2 TestArray objects been the parents the first one having an
    attribute holding a list of the 2 instance of A objects the parents
    children, another
    attribute of the parent holding a list of just the 1 child instance of
    B object with the child object then containing an attribute holding a
    list of the 2 Instance of C objects
    but the nesting could be more this is just an example. The instance of
    TestArray may or may not have any nesting at all
    this is illustrated in the second TestArray. Basically just want to
    create a list of objects with the objects may or may not contain more
    nested objects as attributes but
    need a generic way to do it that would work for any amount of depth.

    #end list of objects with objects printed as dicts

    Code (Text):

    parsed = [
    {
    "a":"a",
    "b":"b",
    "c":"c",
    "A_elements":[
    {
    "a":1,
    "b":2,
    "c":3
    },
    {
    "a":1,
    "b":2,
    "c":3
    }
    ],
    "B_elements":[
    {
    "a":1,
    "b":2,
    "c":3,
    "C_elements":[
    {
    "a":1,
    "b":2,
    "c":3
    },
    {
    "a":1,
    "b":2,
    "c":3
    }
    ]
    }
    ]
    },

    {
    "a":"1",
    "b":"2",
    "c":"3",
    }

    ]

     
    #this is what i have so far which works with the 2nd instance but cant
    figure
    out the best way to handle the multi nested objects.

    Code (Text):

    import re
    def test_parser(filename):
    parent_stanza = None
    stanzas = []

    class parentStanza:
    pass

    fo = open(filename)

    for line in fo:
    line = line.strip()
    if re.search("An instance of TestArray", line):
    if parent_stanza:
    stanzas.append(parent_stanza)
    parent_stanza = parentStanza()
    if parent_stanza and "=" in line:
    attr, val = line.split("=")
    setattr(parent_stanza, attr, val)
    else:
    stanzas.append(parent_stanza)
    return stanzas

    stanzas = test_parser("test.txt")

    import pprint
    for stanza in stanzas:
    pprint.pprint(stanza.__dict__)
    n=raw_input("paused")
     
     
    richard, Jun 4, 2012
    #1
    1. Advertisements

  2. richard

    Roy Smith Guest

    The first question is "Why do you want to do this?" Is this some
    pre-existing file format imposed by an external system that you can't
    change? Or are you just looking for a generic way to store nested
    structures in a file?

    If the later, then I would strongly suggest not rolling your own. Take
    a look at json or pickle (or even xml) and adopt one of those.
     
    Roy Smith, Jun 4, 2012
    #2
    1. Advertisements

  3.  
    Alain Ketterlin, Jun 4, 2012
    #3
  4. richard

    richard Guest

     
    richard, Jun 4, 2012
    #4
  5. richard

    Eelco Guest

    thank you both for your replies. Unfortunately it is a pre-existing
    Hi Richard,

    Despite the fact that it is a preexisting format, it is very close
    indeed to valid YAML code.

    Writing your own whitespace-aware parser can be a bit of a pain, but
    since YAML does this for you, I would argue the cleanest solution
    would be to bootstrap that functionality, rather than roll your own
    solution, or to resort to hard to maintain regex voodoo.

    Here is my solution. As a bonus, it directly constructs a custom
    object hierarchy (obviously you would want to expand on this, but the
    essentials are there). One caveat: at the moment, the conversion to
    YAML relies on the appparent convention that instances never directly
    contain other instances, and lists never directly contain lists. This
    means all instances are list entries and get a '-' appended, and this
    just works. If this is not a general rule, youd have to keep track of
    an enclosing scope stack an emit dashes based on that. Anyway, the
    idea is there, and I believe it to be one worth looking at.

    <code>
    import yaml

    class A(yaml.YAMLObject):
    yaml_tag = u'!A'
    def __init__(self, **kwargs):
    self.__dict__.update(kwargs)
    def __repr__(self):
    return 'A' + str(self.__dict__)

    class B(yaml.YAMLObject):
    yaml_tag = u'!B'
    def __init__(self, **kwargs):
    self.__dict__.update(kwargs)
    def __repr__(self):
    return 'B' + str(self.__dict__)

    class C(yaml.YAMLObject):
    yaml_tag = u'!C'
    def __init__(self, **kwargs):
    self.__dict__.update(kwargs)
    def __repr__(self):
    return 'C' + str(self.__dict__)

    class TestArray(yaml.YAMLObject):
    yaml_tag = u'!TestArray'
    def __init__(self, **kwargs):
    self.__dict__.update(kwargs)
    def __repr__(self):
    return 'TestArray' + str(self.__dict__)

    class myList(yaml.YAMLObject):
    yaml_tag = u'!myList'
    def __init__(self, **kwargs):
    self.__dict__.update(kwargs)
    def __repr__(self):
    return 'myList' + str(self.__dict__)


    data = \
    """
    An instance of TestArray
    a=a
    b=b
    c=c
    List of 2 A elements:
    Instance of A element
    a=1
    b=2
    c=3
    Instance of A element
    d=1
    e=2
    f=3
    List of 1 B elements
    Instance of B element
    a=1
    b=2
    c=3
    List of 2 C elements
    Instance of C element
    a=1
    b=2
    c=3
    Instance of C element
    a=1
    b=2
    c=3
    An instance of TestArray
    a=1
    b=2
    c=3
    """.strip()

    #remove trailing whitespace and seemingly erronous colon in line 5
    lines = [' '+line.rstrip().rstrip(':') for line in data.split('\n')]


    def transform(lines):
    """transform text line by line"""
    for line in lines:
    #regular mapping lines
    if line.find('=') > 0:
    yield line.replace('=', ': ')
    #instance lines
    p = line.find('nstance of')
    if p > 0:
    s = p + 11
    e = line[s:].find(' ')
    if e == -1: e = len(line[s:])
    tag = line[s:s+e]
    whitespace= line.partition(line.lstrip())[0]
    yield whitespace[:-2]+' -'+ ' !'+tag
    #list lines
    p = line.find('List of')
    if p > 0:
    whitespace= line.partition(line.lstrip())[0]
    yield whitespace[:-2]+' '+ 'myList:'

    ##transformed = (transform( lines))
    ##for i,t in enumerate(transformed):
    ## print '{:>3}{}'.format(i,t)

    transformed = '\n'.join(transform( lines))
    print transformed

    res = yaml.load(transformed)
    print res
    print yaml.dump(res)
    </code>
     
    Eelco, Jun 5, 2012
    #5
  6. richard

    richard Guest

     
    richard, Jun 5, 2012
    #6
  7. richard

    richard Guest

    Hi Eelco many thanks for the reply / solution it definitely looks like
    a clean way to go about it. However installing 3rd party libs like
    yaml on the server I dont think is on the cards at the moment.
     
    richard, Jun 5, 2012
    #7
  8. [I'm leaving the data in the message in case anybody has troubles going
    up-thread.]
    You forgot one case:

    def build(couple):
    if "=" in couple[0]:
    attr, val = couple[0].split("=")
    return attr,val
    elif "Instance of" in couple[0]:
    #match = re.search("Instance of (.+) element", couple[0])
    #return ("attr_%s" % match.group(1),Stanza(couple[1]))
    return dict(couple[1])
    elif "An instance of" in couple[0]: # you forgot that case
    return dict(couple[1])
    elif "List of" in couple[0]:
    match = re.search("List of \d (.+) elements", couple[0])
    return ("%s_elements" % match.group(1),couple[1])
    else:
    pass # put a test here
    Change this to:

    stack[-2][1].append(build(stack[-1])) # call build() here also
    Actually the first and only element of stack is a container: all you
    need is the second element of the only tuple in stack, so:

    return stack[0][1]

    and this is your list. If you need it pretty printed, you'll have to
    work the hierarchy.

    -- Alain.
     
    Alain Ketterlin, Jun 5, 2012
    #8
  9. richard

    richard Guest

    Hi Alain thanks for the reply. With regards to the missing case "An
    Instance of" im not sure where/ how that is working as the case i put
    in originally "Instance of" is in the file and been handled in the
    previous case. Also when running the final solution im getting a list
    of [None, None] as the final stack? just busy debugging it to see
    whats going wrong. But sorry should have been clearer with regards to
    the format mentioned above. The objects are been printed out as dicts
    so where you put in

    elif "An Instance of" in couple[0]:
    return dict(couple[1])

    should still be ?
    elif "Instance of" in couple[0]:
    match = re.search("Instance of (.+) element", couple[0])
    return ("attr_%s" % match.group(1),Stanza(couple[1])) #
    instantiating new stanza object and setting attributes.
     
    richard, Jun 5, 2012
    #9
  10. richard

    richard Guest

    Hi Alain, thanks for the reply. Amended the code and just busy
    debugging but the stack i get back justs return [None, None]. Also
    should have been clearer when i mentioned the format above the dicts
    are actually objects instantaited from classes and just printed out as
    obj.__dict__ just for representation putposes. so where you have
    replaced the following i presume this was because of my format
    confusion. Thanks
     
    richard, Jun 5, 2012
    #10
  11. richard

    richard Guest

    Sorry silly mistake made with "An instance" and "Instance of" code
    emende below for fix

    if "=" in couple[0]:
    attr, val = couple[0].split("=")
    return attr,val
    elif re.search("Instance of .+",couple[0]):
    #match = re.search("Instance of (.+) element", couple[0])
    #return ("attr_%s" % match.group(1),Stanza(couple[1]))
    return dict(couple[1])
    elif re.search("An instance of .+", couple[0]):
    return dict(couple[1])
    elif "List of" in couple[0]:
    match = re.search("List of \d (.+) elements", couple[0])
    return ("%s_elements" % match.group(1),couple[1])
    else:
    pass
     
    richard, Jun 5, 2012
    #11
  12. Both cases are different in your example above. Top level elements are
    labeled "An instance ...", whereas "inner" instances are labeled
    "Instance of ...".
    There's only one way this can happen: by falling through to the last
    case of build(). Check the regexps etc. again.
    Your last "Instance of..." case is correct, but "An instance..." is
    different, because there's no containing object, so it's probably more
    like: return Stanza(couple[1]).

    -- Alain.
     
    Alain Ketterlin, Jun 5, 2012
    #12
  13. richard

    richard Guest

    A big thank you to everyone who has helped me tackle / shed light on
    this problem it is working great. Much appreciated.
     
    richard, Jun 5, 2012
    #13
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.