Which is the better way to parse this file?

Roberto A. F. De Almeida · Sep 2, 2003

Hi,

I'm interested in parsing a file containing this "structure":

"""dataset {
int catalog_number;
sequence {
string experimenter;
int32 time;
structure {
float64 latitude;
float64 longitude;
} location;
sequence {
float depth;
float temperature;
} xbt;
} casts;
} data;"""

I want to obtain a dictionary like this:
{'casts': {'experimenter': None,
'location': {'latitude': None, 'longitude': None},
'time': None,
'xbt': {'depth': None, 'temperature': None}},
'catalog_number': None}

The values ('None') will be filled later. I tried to do the parsing
using regular expressions, but things became too complicated. I had
more success using SimpleParse, but I'm interested in more insights on
different ways of parsing this file.

TIA,

Roberto

Terry Reedy · Sep 2, 2003

Roberto A. F. De Almeida said:
I'm interested in parsing a file containing this "structure":

"""dataset {
int catalog_number;
sequence {
string experimenter;
int32 time;
structure {
float64 latitude;
float64 longitude;
} location;
sequence {
float depth;
float temperature;
} xbt;
} casts;
} data;"""

I suspect that what you actually want to do is parse structures 'like'
the above, as defined be a grammar not shown ;-)

You did not specify whether you will get such files from an
uncontrolable external source or whether you control the input format.
If the later, there is no obvious reason for separate database,
sequence, and structure productions since all three result in
dictionaries with no functional difference.

I want to obtain a dictionary like this:

{'casts': {'experimenter': None,
'location': {'latitude': None, 'longitude': None},
'time': None,
'xbt': {'depth': None, 'temperature': None}},
'catalog_number': None}
The values ('None') will be filled later.

Using None as placeholders either tosses the type information or
requires that it be recorded elsewhere. Use the int and float type
objects instead. Note that standard Python cannot differentiate
between float and float64.

I tried to do the parsing
using regular expressions, but things became too complicated.

REs are great for linear repetition but not for indefinite nesting.

I had
more success using SimpleParse, but I'm interested in more insights on
different ways of parsing this file.

I know nothing of SimpleParse (and therefore, of what would be
different). If the grammar is as simple as I infer from the sample --
dataset and sequences containing sequences, structures, and types -- I
would reread about recursive-descent parsing and maybe try that. The
type_entry function would return a (name, typeobject) pair and the
structure, sequence, and database functions a (name, dict) pair.

But as hinted above, I would think about simplifying the grammar
before worryinng about parsing. If you only have sequences of
sequences and type entries, parsing is trivial.

Terry J. Reedy

Roberto A. F. De Almeida · Sep 2, 2003

Terry Reedy said:
I suspect that what you actually want to do is parse structures 'like'
the above, as defined be a grammar not shown ;-)

Yes, you're right.

The grammar is not complex, but I'm still struggling to process the
result tree.

You did not specify whether you will get such files from an
uncontrolable external source or whether you control the input format.
If the later, there is no obvious reason for separate database,
sequence, and structure productions since all three result in
dictionaries with no functional difference.

This is a Dataset Descriptor for the Data Access Protocol
(http://www.unidata.ucar.edu/packages/dods/design/dap-rfc-html/), an
API to access remote datasets. DAP servers describe their datasets
using this grammar, and I'm developing a module to access DAP servers.

Using None as placeholders either tosses the type information or
requires that it be recorded elsewhere. Use the int and float type
objects instead. Note that standard Python cannot differentiate
between float and float64.

Ok. One of the strong points of DAP is that data is retrieved only for
your region/period of interest. I created a class and redefined
__getitem__ so that data is only retrieved from the server when the
object is sliced.

data = file("http://dods.gso.uri.edu/cgi-bin/nph-nc/data/fnoc1.nc")
print data.variables['lat'].shape (17,)
print data.variables['lat'][1:4] # only this subset is retrieved

Click to expand...

Click to expand...

[ 47.5 45. 42.5 40. ]

I know nothing of SimpleParse (and therefore, of what would be
different). If the grammar is as simple as I infer from the sample --
dataset and sequences containing sequences, structures, and types -- I
would reread about recursive-descent parsing and maybe try that. The
type_entry function would return a (name, typeobject) pair and the
structure, sequence, and database functions a (name, dict) pair.

Yes, it's very simple. As you see, even a structure is identical to a
sequence. The declarations are basically "types" or declarations
containing "types". Do you think it can be done without 3rd party
modules?

But as hinted above, I would think about simplifying the grammar
before worryinng about parsing. If you only have sequences of
sequences and type entries, parsing is trivial.

I'll take a look in that. Thanks very much for the insights.

Regards,

Roberto

Playing with dictionaries	5	Sep 22, 2003
ANN: PyTables 0.9.1 is out	0	Dec 4, 2004
Download the JAVA , .NET and SQL Server interview with answers	0	Sep 14, 2006
Download the JAVA , .NET and SQL Server interview PDF	0	Sep 17, 2006
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 4, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
clc selected threads (30-jan-2005 to 31-jan-2005) #1	3	Feb 6, 2005

Which is the better way to parse this file?

Roberto A. F. De Almeida

Terry Reedy

Roberto A. F. De Almeida

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads