Need help on File parsing

R

Rui Maciel

John said:
Not to mention it's code that *you* don't have to write or test.

Not necessarily. Depending on what the programmer intends to do, when he
adopts a 3rd party parser instead of writing his own what he is doing is
delegating only a portion of the work he must do in order to extract
information from a given format, while being forced to do the legwork in
the remaining of the work.

More specifically, when a programmer employs a 3rd party parser, he is
implicitly dividing the simple task of parsing a given format into two
different tasks:
- parsing the information described in a base format in order to build a
data structure
- parsing the data structure in order to extract the information he
intended to extract

While the first test may be delegated to a parser developed by a 3rd
party, which ends up being implemented by a small generic code snippet,
the second task ends up being needlessly cumbersome, error-prone and
needlessly wasting resources which, in some cases, the programmer may not
have. Yet, it still needs code which *you* have to write and, more
importantly, *you* must test, with the added difficulty of consisting of a
couple of layers of abstraction.


Rui Maciel
 
R

Rui Maciel

Nobody said:
If you restrict the application to reading a subset of XML, that defeats
the purpose of using XML in the first place.

Every XML application a language which is a subset of XML. Every
application of XML is nothing more than the definition of languages which
are a subset of XML. The main advantages of XML is that it's human-
readable, the languages based on it tend to be self-descriptive and it's a
common base format of a series of languages. This means that it becomes
easier to add support for other languages, even if you don't have the
entire specification.

Therefore, claiming that restricting the application to reading a subset
of XML defeats the purpose of adopting a XML-based language doesn't make
sense. It doesn't make sense because the only purpose of XML is to reduce
it to a subset.

You can find a wide range of tools which can process XML, but the range
of tools which can process a particular custom subset of XML is likely
to be much smaller (i.e. those tools which you write yourself).

An image editor is a tool that can only process a particular custom subset
of XML (for example, SVG). The same applies to office applications, RSS
readers, web browsers and other applications. Therefore, there is no harm
in that. That's what programs are designed to do.

If you think that you only need to support files written by a particular
program, you're likely to end up only supporting files which were
directly written by that program and not post-processed in any way. This
often makes your program less useful than you had originally assumed.

That problem has absolutely nothing to do with XML and everything to do
with adopting/creating open standards to exchange information.


Rui Maciel
 
R

Rui Maciel

David said:
Note that it always starts this way. It is easy to hand parse the XML
if it is in a truly fixed format, so why use a real parser? But then
there are modifications/extensions/etc. People hand edit the file and
add white space, which won't confuse a parser but messes up your less
flexible hand parse.

Adding white spaces can only mess up a parser if the parser wasn't develop
to handle that language. Therefore, you can't claim that writing parsers
by hand is a bad thing to do if the only problem that you can point out is
that your parser fails to parse the language it was intended to parse.

People write a mixture of <element></element>
instead of <element/>, which should parse as equivalent and somehow
don't when hand parsing.

Only if you failed to add support for that in your parser.

People suddenly want validation. etc.

The beautiful thing about parsers is that they automatically and
implicitly validate a given language. Therefore, it's a non-issue.

Going with a real parser is very much the way to go in a real
application, much more future friendly even if not apparently needed
up front...

This idea that a parser developed by a programmer is somehow not a "real
parser" is silly. Either you mispoke or you don't know what you are
talking about.


Rui Maciel
 
R

Rui Maciel

Malcolm said:
Think of the XML as a tree, and build what is known as a recursive
descent parser.

Basically it's the same problem as a mathematical expression with
deeply nested parentheses, in a slightly different form. You need one
token of lookahead.

Once you've converted the XML to a tree, you'll usually want to walk
the tree to convert to a set of nested arrays, but sometimes it will
be better to keep the data in tree form.

If someone goes through the trouble of writing a dedicated parser for a
particular language then there is no need to parse it to an intermediate
form. That just forces the need to parse essentially the same information
twice just to be able to access that information. Just parse the document
and handle the information in an appropriate way once it is parsed.


Rui Maciel
 
R

Rui Maciel

Nobody said:
I wasn't "replying" to your comments. I elaborated on your reply,
providing more reasons why it's a bad idea to assume that you only need
to handle a subset.

Let's say that we developed a new XML-based language intended to replace
all documents encoded in the INI document format. The language would be
something like:


<?xml version="1.0" encoding="UTF-8" ?>
<document version="1.0">
<section>
<name> section name </name>
<entry>
<label> label name </label> <value> this label's value</value>
</entry>
...
</section>
...
</document>


In this XML-based language, the only accepted element name for the root
element is the string "document". The root element must have an attribute
to declare the format's version number and may have zero or more "section"
elements. Each "section" element must have a "name" element, followed by
zero or more "entry" elements. Each "entry" element consists of a "label"
element followed by a "value" element, whose content can only be character
data. Every other XML construct is either ignored or declared as an
error.

Considering this, why do you believe it is a bad idea to write a parser
that only accepts this subset of XML?


Rui Maciel
 
N

Nobody

I wasn't "replying" to your comments. I elaborated on your reply,
providing more reasons why it's a bad idea to assume that you only need
to handle a subset.

Let's say that we developed a new XML-based language intended to replace
all documents encoded in the INI document format. The language would be
something like:
[snip]

In this XML-based language, the only accepted element name for the root
element is the string "document". The root element must have an
attribute to declare the format's version number and may have zero or
more "section" elements. Each "section" element must have a "name"
element, followed by zero or more "entry" elements. Each "entry"
element consists of a "label" element followed by a "value" element,
whose content can only be character data. Every other XML construct is
either ignored or declared as an error.

Considering this, why do you believe it is a bad idea to write a parser
that only accepts this subset of XML?

That isn't what we're talking about. Any validating parser rejects
invalid documents; that doesn't mean that such a parser only accepts a
subset of the language.

A subset of the /language/ implies that, for any given data, only a subset
of the valid representations are accepted, e.g. requiring <tag></tag>
rather than <tag/>, imposing constraints upon whitespace within
tags, requiring attributes to be specified in a particular order, etc.

Having said that, the main reasons why writing such a parser would be a
bad idea are:

1. It doesn't help. The parser wouldn't be significantly simpler than one
which parsed arbitrary XML; in fact, it would probably be more
complicated, as the parser would be performing checks which most
(non-validating) parsers leave to the application.

2. If you want to extend the format, you have to change the code for the
parser. With a generic non-validing parser, you don't have to change
anything; with a generaic validating parser, you only have to change
the DTD. In either case, the application would only need to be changed if
it didn't just ignore unrecognised elements.
 
N

Nobody

Adding white spaces can only mess up a parser if the parser wasn't develop
to handle that language. Therefore, you can't claim that writing parsers
by hand is a bad thing to do if the only problem that you can point out is
that your parser fails to parse the language it was intended to parse.

Right. Which is exactly what we mean by a "subset" of XML.

I'm not sure whether you're playing devil's advocate or you actually
aren't aware of just how common a problem this is. I've lost track of the
number of times I've seen stuff like "sed 's!<title>\(.*\)</title>!\1!' ...".
 
R

Rui Maciel

Nobody said:
Right. Which is exactly what we mean by a "subset" of XML.

This particular issue has nothing to do with a language being or not being
a subset of XML. It's a problem caused by adopting a poorly thought out
language which fails to cover the intended use case.

I'm not sure whether you're playing devil's advocate or you actually
aren't aware of just how common a problem this is. I've lost track of
the number of times I've seen stuff like "sed
's!<title>\(.*\)</title>!\1!' ...".

I've written a few parsers, including a couple of generic parsers for a
markup language, and supporting white spaces between elements (or any
equivalent nesting construct) is one of the most trivial things that one
can add to a parsers, particularly because it either represents a single
terminal in the production or it doesn't even need to be supported in the
language's grammar.

In the case of XML, as an element may have character data between the
element's start tag and end tag, then it would probably be better to add
support for it in the production and, depending on how the language was
designed, ignore it or throw some kind of error.


Rui Maciel
 
R

Rui Maciel

Nobody said:
That isn't what we're talking about. Any validating parser rejects
invalid documents; that doesn't mean that such a parser only accepts a
subset of the language.

A subset of the /language/ implies that, for any given data, only a
subset of the valid representations are accepted, e.g. requiring
<tag></tag> rather than <tag/>, imposing constraints upon whitespace
within tags, requiring attributes to be specified in a particular order,
etc.

A subset of a language is still a language on it's own, which means that a
parser designed to handle it either accepts a document as valid or rejects
it.

Knowing this, a subset of XML will only impose the constraints which it
was designed to impose; no more, no less. If you write a parser that
rejects certain language constructs then you either failed to design your
language or you failed to write your parser. Your failure to do any of
these things does not mean that it is a bad idea to develop parsers. It
only means that you failed to develop the language and/or parser that you
needed.

Having said that, the main reasons why writing such a parser would be a
bad idea are:

1. It doesn't help. The parser wouldn't be significantly simpler than
one which parsed arbitrary XML; in fact, it would probably be more
complicated, as the parser would be performing checks which most
(non-validating) parsers leave to the application.

If you keep in mind that a generic parser only manages to transform the
information between two formats (i.e., parse a document and build up a
data structure) and that you are still forced to parse the end-format to
validate your format and extract the information (i.e., traverse the data
structure, perform sanity checks according to the information found on the
data structure, extract information, etc...).

This means that once you adopt a generic parser to parse a document then,
unless you intend to parse a home-brew format that will not be exchanged
by anyone and will only be used by a specific version of a specific
program, you are only fooling yourself to believe that you are simplifying
things. You aren't. You are adding a new abstraction layer to your
program that does nothing more than convert the information between
formats, both of which you still have to parse.

2. If you want to extend the format, you have to change the code for the
parser. With a generic non-validating parser, you don't have to change
anything; with a generaic validating parser, you only have to change
the DTD. In either case, the application would only need to be changed
if it didn't just ignore unrecognised elements.

Not quite. The DTD only helps you set the generic parser to perform a set
of sanity checks. Meanwhile you are still forced to rely on two separate
parsers to parse a single piece of information.

Adding to this, relying on generic parsers and DTDs won't help you with
basic tasks such as adding support for multiple versions of the same
language. That means that if you rely on a generic parser and you are
suddenly forced to tweak your document format and therefore support
multiple versions of the same format then you are either screwed or you
are forced to employ a scheme to convert all instances of a format into
the new format, something which in some cases it's impossible.


Rui Maciel
 
I

Ian Collins

A subset of a language is still a language on it's own, which means that a
parser designed to handle it either accepts a document as valid or rejects
it.

Knowing this, a subset of XML will only impose the constraints which it
was designed to impose; no more, no less. If you write a parser that
rejects certain language constructs then you either failed to design your
language or you failed to write your parser. Your failure to do any of
these things does not mean that it is a bad idea to develop parsers. It
only means that you failed to develop the language and/or parser that you
needed.

The problem of what to do with the data in an XML document (or any other
structured document) is one of the reasons why there are two types of
XML parser. One can either use a SAX (stream) parser to process
elements as they are encountered, or parse the complete document into a
DOM (Document Object Model) tree.

I use both, depending on the problem at hand. If the data has to be
manipulated as a complete set, I use my (heavy) DOM parser. If not
(loading a configuration for example), I use my light SAX parser.

A SAX parser uses callback functions to handle various events triggered
by the document, which makes it easy to translate elements of interest
into application data structures or actions, which would be ideal for
the OP's requirement.
 
R

Rui Maciel

Ian said:
The problem of what to do with the data in an XML document (or any other
structured document) is one of the reasons why there are two types of
XML parser. One can either use a SAX (stream) parser to process
elements as they are encountered, or parse the complete document into a
DOM (Document Object Model) tree.

I use both, depending on the problem at hand. If the data has to be
manipulated as a complete set, I use my (heavy) DOM parser. If not
(loading a configuration for example), I use my light SAX parser.

A SAX parser uses callback functions to handle various events triggered
by the document, which makes it easy to translate elements of interest
into application data structures or actions, which would be ideal for
the OP's requirement.

The SAX approach is basically a partially developed parser. In essence, a
SAX API provides a stream of terminal tokens while performing sanity
checks on the base format. To put it in other words, a SAX parser is
basically a lexer that converts a set of terminal tokens from a base
language (say, XML) to a single terminal token from a different language
(say, SVG). In this process, it also implicitly performs a set of sanity
checks on the base language.

This means that when a programmer opts to parse a given document following
the SAX approach, what he is doing is essentially picking up a specialized
lexer and writing his own parser around that particular lexer. So, this
means that although the programmer avoids parsing a much larger language
(i.e., what the SAX lexer returns as "open element A" may be "terminal
token '<' followed by terminal token text string, with the string 'A',
followed by token '>') he still has to set a production for his language
and develop a parser to parse his language.


Rui Maciel
 
I

Ian Collins

The SAX approach is basically a partially developed parser. In essence, a
SAX API provides a stream of terminal tokens while performing sanity
checks on the base format. To put it in other words, a SAX parser is
basically a lexer that converts a set of terminal tokens from a base
language (say, XML) to a single terminal token from a different language
(say, SVG). In this process, it also implicitly performs a set of sanity
checks on the base language.

This means that when a programmer opts to parse a given document following
the SAX approach, what he is doing is essentially picking up a specialized
lexer and writing his own parser around that particular lexer. So, this
means that although the programmer avoids parsing a much larger language
(i.e., what the SAX lexer returns as "open element A" may be "terminal
token '<' followed by terminal token text string, with the string 'A',
followed by token '>') he still has to set a production for his language
and develop a parser to parse his language.

Which he will end up doing no matter what approach is used to parse the
source document.
 
N

Nobody

I've written a few parsers, including a couple of generic parsers for a
markup language, and supporting white spaces between elements (or any
equivalent nesting construct) is one of the most trivial things that one
can add to a parsers,

Dealing with whitespace may be trivial (unless the underlying I/O code is
line-oriented, as XML allows linefeeds within tags), but it's frequently
omitted.

It's less trivial to deal with the fact that attributes may appear in any
order.
 
R

Rui Maciel

Nobody said:
Dealing with whitespace may be trivial (unless the underlying I/O code
is line-oriented, as XML allows linefeeds within tags), but it's
frequently omitted.

The implementation details of the IO part of a parser are irrelevant.
Whether the IO is line-oriented or not, the IO code should never insert or
ommit information, which means that a parser only handles the information
provided by a stream.

It's less trivial to deal with the fact that attributes may appear in
any order.

I don't believe that constitutes a real problem. For example, consider a
XML-based file format which consists of a single element "element" which
may have a set of attributes labelled "alpha", "beta" an "gamma". For
that language, a valid document could be something like:

<element alpha="true" />


If the language accepts repeated attributes then a possible (and crude)
production[1] would be something like:

<example>

document = "<" "element" *<tags> "/" ">"

tag> = "alpha" "=" text string
= "beta" "=" text string
= "gamma" "=" text_string

</example>

The support for the tags specified in the above production in a LL parser,
ignoring error handling, may be around 3 states (6, if we count a "ghost"
state to push the attribute values into a data structure).

If, instead, the attributes must follow a specific order (alpha, beta,
gamma) where:
- each attribute can either be present or not
- an attribute appearing out of it's rightful place is considered an error

then, the following production applies:

<example>

document = "<" "element" *1alpha_tag *1beta_tag *1gamma_tag "/" ">"

alpha_tag = "alpha" "=" text string
beta_tag = "beta" "=" text string
gamma_tag = "gamma" "=" text_string

</example>

The support for the tags specified in the above production in a LL parser,
ignoring error handling, is yet again achieved by adding 3 states (6, with
the "ghost" states).

If your language accepts any possible attribute combination then the
production starts to become a bit more demanding. Yet, you only need to
deal with this if you specifically wish that your grammar accepts your
attributes in any random order, which means that you are creating your own
problem.

Nonetheless, notice that you will be faced with the exact same problem if
you wish to rely on a generic parser instead of one which you develop
yourself. In that case, you will be faced with a more demanding problem,
as you are forced to deal with nodes in a tree structure instead of a
simple stream of terminal tokens.


Rui Maciel

[1] http://tools.ietf.org/html/rfc5234
 
R

Rui Maciel

Ian said:
Which he will end up doing no matter what approach is used to parse the
source document.

If a programmer opts for a DOM-type approach then he will be faced with a
problem which is considerably (and needlessly) more complicated.

But considering that the programmer opts for a SAX-type approach, and
knowing that the only thing that he gets is a tricked-out lexer and that
he is still forced to develop his own parser, by adopting a XML library
which provides SAX the programmer is essentially being forced to adopt a
particular language which more often than not does not even fit the
intended purpose.

So, if a generic XML API doesn't eliminate the need to develop a parser to
extract information then what's the point of adopting a generic parser to
begin with, let alone base their document format on XML?


Rui Maciel
 
I

Ian Collins

If a programmer opts for a DOM-type approach then he will be faced with a
problem which is considerably (and needlessly) more complicated.

But considering that the programmer opts for a SAX-type approach, and
knowing that the only thing that he gets is a tricked-out lexer and that
he is still forced to develop his own parser, by adopting a XML library
which provides SAX the programmer is essentially being forced to adopt a
particular language which more often than not does not even fit the
intended purpose.

So, if a generic XML API doesn't eliminate the need to develop a parser to
extract information then what's the point of adopting a generic parser to
begin with, let alone base their document format on XML?

Indeed, that's one reason I prefer JSON.

But the choice of representation isn't always one the developer can
make. I have written a lot of code (in a variety of languages) to
extract data from OpenOffice documents. The client does not care that I
have to work with an XML document, they just want the data from the
document.
 
N

Nobody

But considering that the programmer opts for a SAX-type approach, and
knowing that the only thing that he gets is a tricked-out lexer and that
he is still forced to develop his own parser,

You make it sound as if it's a significant issue. Once you have the lexer,
XML is trivial to parse. There are no shift-reduce or reduce-reduce
conflicts, because every construct begins with a token which is unique to
that construct.
So, if a generic XML API doesn't eliminate the need to develop a parser
to extract information then what's the point of adopting a generic
parser to begin with, let alone base their document format on XML?

The point is that you don't have to code dedicated utilities for common
tasks, as you can just use xslt, xquery, etc. You don't have to write
bindings for a variety of languages, as every common language already has
XML parsers (and more, e.g. tools which will generate class definitions
from a DTD or vice-versa).

In many cases, the only valid reason for /not/ using XML is efficiency (I
don't consider the vendor lock-in which proprietary formats offer to be a
"valid" reason).
 
N

Nobody

The implementation details of the IO part of a parser are irrelevant.

Not if it constrains the data flow, i.e. when you don't get to carry
state over between lines, i.e. what happens when people try to parse XML
with grep/sed/perl/etc.
 
R

Rui Maciel

Nobody said:
Not if it constrains the data flow, i.e. when you don't get to carry
state over between lines, i.e. what happens when people try to parse XML
with grep/sed/perl/etc.

If people rely on grep to parse XML then they are intentionally creating
their own problems. No one decides to open a ditch with a screwdriver and
complains that the job is simply too complicated to perform.

The same applies to Perl if people try to employ it to parse XML as in the
grep case. This would, obviously, be stupid as it is quite possible to
write parsers in Perl.


Rui Maciel
 
D

David Resnick

If a programmer opts for a DOM-type approach then he will be faced with a
problem which is considerably (and needlessly) more complicated.

But considering that the programmer opts for a SAX-type approach, and
knowing that the only thing that he gets is a tricked-out lexer and that
he is still forced to develop his own parser, by adopting a XML library
which provides SAX the programmer is essentially being forced to adopt a
particular language which more often than not does not even fit the
intended purpose.  

So, if a generic XML API doesn't eliminate the need to develop a parser to
extract information then what's the point of adopting a generic parser to
begin with, let alone base their document format on XML?

Rui Maciel

You can use a DOM parser and a query language like XPATH, makes
getting information pretty simple. Parse, ask for what you need.
Of course, not appropriate for all uses, but nice for getting
what you want out of the doc.

-David
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,533
Members
45,006
Latest member
LauraSkx64

Latest Threads

Top