Help parsing a text file

W

William Gill

I haven't done much with Python for a couple years, bouncing around
between other languages and scripts as needs suggest, so I have some
minor difficulty keeping Python functionality Python functionality in my
head, but I can overcome that as the cobwebs clear. Though I do seem to
keep tripping over the same Py2 -> Py3 syntax changes (old habits die hard).

I have a text file with XML like records that I need to parse. By XML
like I mean records have proper opening and closing tags. but fields
don't have closing tags (they rely on line ends). Not all fields appear
in all records, but they do adhere to a defined sequence.

My initial passes into Python have been very unfocused (a scatter gun of
too many possible directions, yielding very messy results), so I'm
asking for some suggestions, or algorithms (possibly even examples)that
may help me focus.

I'm not asking anyone to write my code, just to nudge me toward a more
disciplined approach to a common task, and I promise to put in the
effort to understand the underlying fundamentals.
 
P

Philip Semanchuk

I haven't done much with Python for a couple years, bouncing around between other languages and scripts as needs suggest, so I have some minor difficulty keeping Python functionality Python functionality in my head, but I can overcome that as the cobwebs clear. Though I do seem to keep tripping over the same Py2 -> Py3 syntax changes (old habits die hard).

I have a text file with XML like records that I need to parse. By XML like I mean records have proper opening and closing tags. but fields don't have closing tags (they rely on line ends). Not all fields appear in all records, but they do adhere to a defined sequence.

My initial passes into Python have been very unfocused (a scatter gun of too many possible directions, yielding very messy results), so I'm asking for some suggestions, or algorithms (possibly even examples)that may help me focus.

I'm not asking anyone to write my code, just to nudge me toward a more disciplined approach to a common task, and I promise to put in the effort to understand the underlying fundamentals.

If the syntax really is close to XML, would it be all that difficult to convert it to proper XML? Then you have nice libraries like ElementTree to use for parsing.


Cheers
Philip
 
W

William Gill

If the syntax really is close to XML, would it be all that difficult to convert it to proper XML? Then you have nice libraries like ElementTree to use for parsing.

Possibly, but I would still need the same search algorithms to find the
opening tag for the field, then find and replace the next line end with
a matching closing tag. So it seems to me that the starting point is
the same, and then it's my choice to either process the substrings
myself or employ something like ElementTree.
 
T

Thomas Jollans

I haven't done much with Python for a couple years, bouncing around
between other languages and scripts as needs suggest, so I have some
minor difficulty keeping Python functionality Python functionality in my
head, but I can overcome that as the cobwebs clear. Though I do seem to
keep tripping over the same Py2 -> Py3 syntax changes (old habits die
hard).

I have a text file with XML like records that I need to parse. By XML
like I mean records have proper opening and closing tags. but fields
don't have closing tags (they rely on line ends). Not all fields appear
in all records, but they do adhere to a defined sequence.

My initial passes into Python have been very unfocused (a scatter gun of
too many possible directions, yielding very messy results), so I'm
asking for some suggestions, or algorithms (possibly even examples)that
may help me focus.

I'm not asking anyone to write my code, just to nudge me toward a more
disciplined approach to a common task, and I promise to put in the
effort to understand the underlying fundamentals.

A name that is often thrown around on this list for this kind of
question is pyparsing. Now, I don't know anything about it myself, but
it may be worth looking into.

Otherwise, if you say it's similar to XML, you might want to take a cue
from XML processing when it comes to dealing with the file. You could
emulate the stream-based approach taken by SAX or eXpat - have methods
that handle the different events that can occur - for XML this is "start
tag", "end tag", "text node", "processing instruction", etc., in your
case, it might be "start/end record", "field data", etc. That way, you
could separate the code that keeps track of the current record, and how
the data fits together to make an object structure, and the parsing
code, that knows how to convert a line of data into something meaningful.

Thomas
 
W

Waldek M.

A name that is often thrown around on this list for this kind of
question is pyparsing. Now, I don't know anything about it myself, but
it may be worth looking into.

Definitely. I did use it and even though it's not perfect - it's very
useful indeed. Due to it's nature it is not a demon of speed when parsing
complex and big structures, so you might want to keep it in mind.
But I whole-heartedly recommend it.

Br.
Waldek
 
J

JT

I have a text file with XML like records that I need to parse. By XML
like I mean records have proper opening and closing tags. but fields
don't have closing tags (they rely on line ends). Not all fields appear
in all records, but they do adhere to a defined sequence.

lxml can parse XML and broken HTML (see http://lxml.de/parsing.html).

- James
 
W

William Gill

lxml can parse XML and broken HTML (see http://lxml.de/parsing.html).

- James
Thanks to everyone.

Though I didn't get what I expected, it made me think more about the
reason I need to parse these files to begin with. So I'm going to do
some more homework on the overall business application and work backward
from there. Once I know how the data fits in the scheme of things, I
will create an appropriate abstraction layer, either from scratch, or
using one of the existing parsers mentioned, but I won't really know
that until I have finished modeling.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top