Which is the better way to parse this file?

Discussion in 'Python' started by Roberto A. F. De Almeida, Sep 2, 2003.

  1. Hi,

    I'm interested in parsing a file containing this "structure":

    """dataset {
    int catalog_number;
    sequence {
    string experimenter;
    int32 time;
    structure {
    float64 latitude;
    float64 longitude;
    } location;
    sequence {
    float depth;
    float temperature;
    } xbt;
    } casts;
    } data;"""

    I want to obtain a dictionary like this:

    >>> pprint.pprint(data)

    {'casts': {'experimenter': None,
    'location': {'latitude': None, 'longitude': None},
    'time': None,
    'xbt': {'depth': None, 'temperature': None}},
    'catalog_number': None}

    The values ('None') will be filled later. I tried to do the parsing
    using regular expressions, but things became too complicated. I had
    more success using SimpleParse, but I'm interested in more insights on
    different ways of parsing this file.

    TIA,

    Roberto
    Roberto A. F. De Almeida, Sep 2, 2003
    #1
    1. Advertising

  2. Roberto A. F. De Almeida

    Terry Reedy Guest

    "Roberto A. F. De Almeida" <> wrote in message
    news:...
    > I'm interested in parsing a file containing this "structure":
    >
    > """dataset {
    > int catalog_number;
    > sequence {
    > string experimenter;
    > int32 time;
    > structure {
    > float64 latitude;
    > float64 longitude;
    > } location;
    > sequence {
    > float depth;
    > float temperature;
    > } xbt;
    > } casts;
    > } data;"""


    I suspect that what you actually want to do is parse structures 'like'
    the above, as defined be a grammar not shown ;-)

    You did not specify whether you will get such files from an
    uncontrolable external source or whether you control the input format.
    If the later, there is no obvious reason for separate database,
    sequence, and structure productions since all three result in
    dictionaries with no functional difference.

    > I want to obtain a dictionary like this:
    >
    > >>> pprint.pprint(data)

    > {'casts': {'experimenter': None,
    > 'location': {'latitude': None, 'longitude': None},
    > 'time': None,
    > 'xbt': {'depth': None, 'temperature': None}},
    > 'catalog_number': None}
    > The values ('None') will be filled later.


    Using None as placeholders either tosses the type information or
    requires that it be recorded elsewhere. Use the int and float type
    objects instead. Note that standard Python cannot differentiate
    between float and float64.

    > I tried to do the parsing
    > using regular expressions, but things became too complicated.


    REs are great for linear repetition but not for indefinite nesting.

    > I had
    > more success using SimpleParse, but I'm interested in more insights

    on
    > different ways of parsing this file.


    I know nothing of SimpleParse (and therefore, of what would be
    different). If the grammar is as simple as I infer from the sample --
    dataset and sequences containing sequences, structures, and types -- I
    would reread about recursive-descent parsing and maybe try that. The
    type_entry function would return a (name, typeobject) pair and the
    structure, sequence, and database functions a (name, dict) pair.

    But as hinted above, I would think about simplifying the grammar
    before worryinng about parsing. If you only have sequences of
    sequences and type entries, parsing is trivial.

    Terry J. Reedy
    Terry Reedy, Sep 2, 2003
    #2
    1. Advertising

  3. "Terry Reedy" <> wrote in message news:<>...
    > I suspect that what you actually want to do is parse structures 'like'
    > the above, as defined be a grammar not shown ;-)


    Yes, you're right. :)

    The grammar is not complex, but I'm still struggling to process the
    result tree.

    > You did not specify whether you will get such files from an
    > uncontrolable external source or whether you control the input format.
    > If the later, there is no obvious reason for separate database,
    > sequence, and structure productions since all three result in
    > dictionaries with no functional difference.


    This is a Dataset Descriptor for the Data Access Protocol
    (http://www.unidata.ucar.edu/packages/dods/design/dap-rfc-html/), an
    API to access remote datasets. DAP servers describe their datasets
    using this grammar, and I'm developing a module to access DAP servers.

    > > I want to obtain a dictionary like this:
    > >
    > > >>> pprint.pprint(data)

    > > {'casts': {'experimenter': None,
    > > 'location': {'latitude': None, 'longitude': None},
    > > 'time': None,
    > > 'xbt': {'depth': None, 'temperature': None}},
    > > 'catalog_number': None}
    > > The values ('None') will be filled later.

    >
    > Using None as placeholders either tosses the type information or
    > requires that it be recorded elsewhere. Use the int and float type
    > objects instead. Note that standard Python cannot differentiate
    > between float and float64.


    Ok. One of the strong points of DAP is that data is retrieved only for
    your region/period of interest. I created a class and redefined
    __getitem__ so that data is only retrieved from the server when the
    object is sliced.

    >>> data = file("http://dods.gso.uri.edu/cgi-bin/nph-nc/data/fnoc1.nc")
    >>> print data.variables['lat'].shape

    (17,)
    >>> print data.variables['lat'][1:4] # only this subset is retrieved

    [ 47.5 45. 42.5 40. ]

    > I know nothing of SimpleParse (and therefore, of what would be
    > different). If the grammar is as simple as I infer from the sample --
    > dataset and sequences containing sequences, structures, and types -- I
    > would reread about recursive-descent parsing and maybe try that. The
    > type_entry function would return a (name, typeobject) pair and the
    > structure, sequence, and database functions a (name, dict) pair.


    Yes, it's very simple. As you see, even a structure is identical to a
    sequence. The declarations are basically "types" or declarations
    containing "types". Do you think it can be done without 3rd party
    modules?

    > But as hinted above, I would think about simplifying the grammar
    > before worryinng about parsing. If you only have sequences of
    > sequences and type entries, parsing is trivial.


    I'll take a look in that. Thanks very much for the insights.

    Regards,

    Roberto
    Roberto A. F. De Almeida, Sep 2, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. MileHighCelt

    better way to parse Tabs

    MileHighCelt, Dec 6, 2005, in forum: Java
    Replies:
    2
    Views:
    422
    MileHighCelt
    Dec 7, 2005
  2. Kevin
    Replies:
    8
    Views:
    522
    Nigel Wade
    Feb 27, 2006
  3. Blue Ocean
    Replies:
    14
    Views:
    561
    jeffc
    Jul 9, 2004
  4. Ed
    Replies:
    6
    Views:
    1,255
    =?ISO-8859-1?Q?Arne_Vajh=F8j?=
    Aug 2, 2007
  5. Replies:
    2
    Views:
    44
    Mark H Harris
    May 13, 2014
Loading...

Share This Page