xml file parsing in C

Marc Dubois · Dec 12, 2006

hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :
Example 1:
<foo> blabla < bla </foo>

becomes

<foo> blabla bla </foo>

Example 2:

<foo>> blablabla </foo>

becomes

<foo> blablabla </foo>

2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

Thanks for your help.

Clever Monkey · Dec 12, 2006

Marc said:
hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. : [...]
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

Pretty much OT for this group. Try a newsgroup that deals with POSIX
tools, or try "man sed".

<state

ff-topic>
Also, look at XML tidy/validation tools. HTML tidy has limited XML support.
</>

james of tucson · Dec 12, 2006

Marc said:
hi,
is it possible to parse an XML file in C

Of course it is "possible." Is it easy?
Depends on your experience writing parsers.
The XML grammar is not especially complicated
-- that's sort of the point of it.

If you are willing to take a canned solution, there is expat for C
http://www.jclark.com/xml/expat.html

However, your problem seems to be formatting and error correction, not
XML parsing. For example,

<foo> blabla < bla </foo>

Is not XML.

<foo>> blablabla </foo>

Is not XML

2) Remove all extra spaces at the end of every line of the XML file

You don't need anything but an address of a char array and '\0' to do
that

3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

This part might be an interesting problem.

I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,

Right, so you realize this, and you realize that an XML parser will
simply choke on it and (maybe) tell you where the errors are

it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Hopefully, it will emit a diagnostic message ...

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

You could use a lookahead technique since you always know what you want
to match. The naive approach I'd start with, would be to work the
tokens from the outer extremes to inner, maybe making a pass first just
to validate that the angle brackets all match up.

Thanks for your help.

I replied to your post because I work in a Java environment, and I
realize I am spoiled. Doing XML in java is too simple to warrant much
discussion. Doing an XML parser in C, on the other hand, from scratch,
would be a very interesting problem.

After considering it for about half a second, I'd look into the
difficulty level of using the Xerces-C++ library in a C app. Or the
XML:

arser perl module.

I realize you want to feed it invalid XML and correct errors; I know
from experience that you can use Xerces to a certain extent to locate
errors, so it might not be terribly hard to take that approach - make
passes through the xerces validator to find errors, fix them, and end up
with the ability to do SAX or DOM on the document for free.

I have never, ever, even considered touching Xerces-C++, so I don't know
if it has anything in common with Xerces-Java. The docs on the xerces
site make it look easy enough to use.

Somebody out there has done this, right?

Rob Hoelz · Dec 12, 2006

Just curious, why do you want to use C for this? I'm not bashing C,
(I love it), but this seems like the kind of task Perl was created
for.

Marc Dubois · Dec 12, 2006

i dont know PErl

Rob Hoelz said:
Just curious, why do you want to use C for this? I'm not bashing C,
(I love it), but this seems like the kind of task Perl was created
for.

Rob Hoelz · Dec 13, 2006

It's a good language; I'd consider learning it if I were you.

Default User · Dec 13, 2006

Rob said:
Just curious, why do you want to use C for this?

Please don't top-post. Your replies belong following or interspersed
with properly trimmed quotes. See the majority of other posts in the
newsgroup, or:
<http://www.caliburn.nl/topposting.html>

Richard · Dec 13, 2006

Default User said:
Please don't top-post. Your replies belong following or interspersed
with properly trimmed quotes. See the majority of other posts in the
newsgroup, or:
<http://www.caliburn.nl/topposting.html>

Lecturing on top posting is OT.

John F · Dec 13, 2006

Richard said:
Lecturing on top posting is OT.

It's somwehow ironic but: so is lecturing on OT

Keith Thompson · Dec 13, 2006

John F said:
It's somwehow ironic but: so is lecturing on OT

By convention, meta-discussions about topicality are considered
topical.

In my opinion, discussions about how to post properly should also be
considered topical. If nobody ever complained about top-posting, we'd
end up with an ugly mixture of top-posting, bottom-posting,
mid-posting, and whatever other forms of posting some random person
decides Looks Really Cool. The newsgroup will become more difficult
to read, and those who spend the most time here will lose patience and
give up on the newsgroup. Since spending a lot of time here
correlates fairly strongly (but not perfectly) with expertise, I
suggest that this would be to the great detriment of the newsgroup.

Personally, I *usually* don't complain about top-posting unless I
happen to be replying to the article anyway.

Perhaps we should agree on a de facto standard tag, like "[TP]", for
articles that complain about top-posting without adding new content.
Or perhaps there should be a more generic tag for criticisms of
posting style. (In my opinion, articles that complain about posting
style *and* discuss C need no such tag.)

Default User · Dec 13, 2006

Richard wrote:

Lecturing on top posting is OT.

Where did you get that idea?

If you don't want to see the messages, I try to put "- TPA" in subject
line (as in this case). You can easily create a filter for it.

Brian

Default User · Dec 13, 2006

John said:
It's somwehow ironic but: so is lecturing on OT

Actually, no. Topicality is always on-topic.

Brian

Default User · Dec 13, 2006

Keith Thompson wrote:

Perhaps we should agree on a de facto standard tag, like "[TP]", for
articles that complain about top-posting without adding new content.
Or perhaps there should be a more generic tag for criticisms of
posting style. (In my opinion, articles that complain about posting
style and discuss C need no such tag.)

I shy away from stuff in [] these days, because Google has decided for
some unknown reason to strip them in messages that originate there,
including replies. I've gone with "- TPA" in the subject.

We discussed this some time back. I encourage anyone who can do create
a filter for that. I try put that in (modulo my memory) whenever it's
strictly a quick note about top-posting.

Brian

John F · Dec 13, 2006

Default said:
Actually, no. Topicality is always on-topic.

I know at least one group where off-topic posts are on topic as per
definition:

borland.public.off-topic

John F · Dec 13, 2006

Keith said:
By convention, meta-discussions about topicality are considered
topical.

Are meta-discussions about meta-discussions about topicality still on
topic?

In my opinion, discussions about how to post properly should also be
considered topical.

I'd second if there was only a single post to tell the poster not to
top post, but unfortunately c.l.c. is one of the few groups where real
OT and posting flame threads (like this one) can be very amusing and
usually end up in a cat fight with a spectaclular showdown.

If nobody ever complained about top-posting, we'd
end up with an ugly mixture of top-posting, bottom-posting,
mid-posting, and whatever other forms of posting some random person
decides Looks Really Cool.

I second that. I don't like it either. Meanwhile I even told various
sales guys not to "top-mail"

and to use plain text instead of
HTML. It works (sometimes)!

The newsgroup will become more difficult
to read, and those who spend the most time here will lose patience
and
give up on the newsgroup. Since spending a lot of time here
correlates fairly strongly (but not perfectly) with expertise, I
suggest that this would be to the great detriment of the newsgroup.

I gained a lot of knowledge on the C language here! e.g. coding
styles, interhuman communication (and the problems arising from
that)...

Personally, I *usually* don't complain about top-posting unless I
happen to be replying to the article anyway.

I don't complain. I correct it and I found that on the third reply the
poster will usually get the clue... Or i send a short e-mail
containing links to some articles (this is better than trashing the
newsgroup with top-post flames

Perhaps we should agree on a de facto standard tag, like "[TP]", for
articles that complain about top-posting without adding new content.
Or perhaps there should be a more generic tag for criticisms of
posting style. (In my opinion, articles that complain about posting
style *and* discuss C need no such tag.)

How about [COPS] as in "criticism of posting style"?

Or [OT-COPS] for criticism-only replys?

Or use dashes instead of brackets, since (as Brian noted correctly)
google strips these phrases in replys.

CBFalconer · Dec 13, 2006

Richard said:
Lecturing on top posting is OT.

On the contrary, such lectures are essential to maintaining proper
usenet protocol. Without correction such newbies will never
learn. It's something like training puppies.

CBFalconer · Dec 13, 2006

John said:
.... snip ...

I second that. I don't like it either. Meanwhile I even told
various sales guys not to "top-mail" and to use plain text
instead of HTML. It works (sometimes)!

What gets me is banks and credit cards that insist on sending me
html mail. I keep telling them that it is a security risk, and
they keep insisting that they can't do anything else. Idiots.
Bank of America is one.

Random832 · Dec 13, 2006

2006-12-13 said:
What gets me is banks and credit cards that insist on sending me
html mail. I keep telling them that it is a security risk, and
they keep insisting that they can't do anything else. Idiots.
Bank of America is one.

Sending HTML email is not a security risk. _receiving_ it from unknown
recepients can be, if your html email viewer mishandles the code, but
that's A) not their problem, and B) if you don't use outlook and have
external images turned off, not your problem.

CBFalconer · Dec 13, 2006

Random832 said:
Sending HTML email is not a security risk. _receiving_ it from
unknown recepients can be, if your html email viewer mishandles
the code, but that's A) not their problem, and B) if you don't
use outlook and have external images turned off, not your problem.

I am secure. Their customers aren't. It's stupid to endanger
people for no conceivable reason. Even stupider to deny that they
can send pure text mail.

goose · Dec 13, 2006

Marc said:
hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :

I'd consider looking up "state machine" and implementing a
state-machine.

If it's only a few simple rules like below, you could simply,
assuming that you have the string in memory to work on,
work across the string 1 character at a time in a loop and
use "if ... else ..." clauses for each of your rules and
maintain state with a few variables.

Something like this (I *think* it's all correct, but I've not
tested it on a large dataset, only on the examples you gave,
so you may need to fix the errors that are sure to crop up):

/* The main loop */
size_t i, len = strlen (src);
int in_tag = 0,
in_data = 0,
in_spaces = 0;
for (i=0; i<len; i++) {
/* Process src and output a single char:
* 1. output a single char dependent on the rule invoked.
* 2. output src if none of the rules are invoked.
*/
/* RULES GO HERE (see all the rules below) */
printf ("%c", src);
}

All that goes in the loop should be the few rules
that you are interested in.

1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :
Example 1:
<foo> blabla < bla </foo>

Click to expand...

/* RULE 1. */
if (src=='<') {
char *ending_tag = strchr (&src, '>'),
*next_opening_tag = strchr (&src[i+1], '<');
if ((ending_tag && next_opening_tag) &&
(ending_tag > next_opening_tag)) {
printf (" ");
continue;
} else {
in_tag = 1;
}
}

becomes

<foo> blabla bla </foo>

Example 2:

<foo>> blablabla </foo>

Click to expand...

/* RULE 2. */
if (src=='>') {
if (!in_tag) {
printf (" ");
continue;
} else {
in_tag = 0;
}
}

becomes

<foo> blablabla </foo>

2) Remove all extra spaces at the end of every line of the XML file

Click to expand...

/* RULE 3. This will replace ALL spaces that are not part
* of a tag or a tags data with just a single space.
*/
if (src=='<' && src[i+1]=='/') {
in_data = 0;
}
if (src=='<' && src[i+1]!='/') {
in_data = 1;
}
if (!in_data && !in_tag &&
(isspace (src) || src=='\n') && !in_spaces) {
in_spaces = 1;
continue;
}
if (in_spaces && isspace (src)) {
continue;
} else {
in_spaces = 0;
}

3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

Click to expand...

/* RULE 4. I've no idea what you mean by "hexadecimal
characters".
* You will have to write the function (possibly maintaining
state)
* that will determine whether the character is unicode. This is
* beyond my area of expertise.
*/
if (is_unicode(&src)) {
printf (" ");
i += 3;
continue;
}

I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Click to expand...

It will barf and hopefully give you a line number/character number
of the error/last good token parsed.

Do you have an idea on how i can write a program to deal with these
requirements ?

Click to expand...

Like I said above, use a state-machine[1]. Alternatively you
could try my attempt and hack it till it works on your dataset.

Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra

Click to expand...

<OT>
sed is OffTopic! If you really want to, try comp.unix.programmer. If
you are not limited to an "only-C" solution do what I do and use
flex for this type of thing (i.e. for homework the instructor wants
to test you, not just get a solution, and so may insist that flex
cannot be used).

spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

Thanks for your help.

Click to expand...

[1] I'm constantly amazed that so few people look at a "parsing"
type problem and go "Aha! StateMachine to the rescue". Is this
type of thing not taught anymore?

goose,

How to change key name in json file with python	0	Oct 2, 2022
How can I fix my pattern coding error in c++	0	Mar 19, 2023
Parsing cdata using expat in C	0	Mar 27, 2012
ElementTree XML parsing problem	8	Apr 27, 2011
XML parsing ExpatError with xml.dom.minidom at line 1, column 0	2	Feb 13, 2014
parsing nested unbounded XML fields with ElementTree	6	Nov 25, 2013
Generic programming in C	46	Apr 17, 2010
problems with xml parsing (python 3.3)	5	Oct 28, 2012

xml file parsing in C

Marc Dubois

Clever Monkey

james of tucson

Rob Hoelz

Marc Dubois

Rob Hoelz

Default User

Richard

John F

Keith Thompson

Default User

Default User

Default User

John F

John F

CBFalconer

CBFalconer

Random832

CBFalconer

goose

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads