xml file parsing in C

M

Marc Dubois

hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :
Example 1:
<foo> blabla < bla </foo>

becomes

<foo> blabla bla </foo>

Example 2:

<foo>> blablabla </foo>

becomes


<foo> blablabla </foo>


2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space


I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

Thanks for your help.
 
C

Clever Monkey

Marc said:
hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. : [...]
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.
Pretty much OT for this group. Try a newsgroup that deals with POSIX
tools, or try "man sed".

<state:eek:ff-topic>
Also, look at XML tidy/validation tools. HTML tidy has limited XML support.
</>
 
J

james of tucson

Marc said:
hi,
is it possible to parse an XML file in C

Of course it is "possible." Is it easy?
Depends on your experience writing parsers.
The XML grammar is not especially complicated
-- that's sort of the point of it.

If you are willing to take a canned solution, there is expat for C
http://www.jclark.com/xml/expat.html

However, your problem seems to be formatting and error correction, not
XML parsing. For example,
<foo> blabla < bla </foo>

Is not XML.
<foo>> blablabla </foo>

Is not XML
2) Remove all extra spaces at the end of every line of the XML file

You don't need anything but an address of a char array and '\0' to do
that :)
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

This part might be an interesting problem.
I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,

Right, so you realize this, and you realize that an XML parser will
simply choke on it and (maybe) tell you where the errors are :)
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Hopefully, it will emit a diagnostic message ...

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

You could use a lookahead technique since you always know what you want
to match. The naive approach I'd start with, would be to work the
tokens from the outer extremes to inner, maybe making a pass first just
to validate that the angle brackets all match up.

Thanks for your help.

I replied to your post because I work in a Java environment, and I
realize I am spoiled. Doing XML in java is too simple to warrant much
discussion. Doing an XML parser in C, on the other hand, from scratch,
would be a very interesting problem.

After considering it for about half a second, I'd look into the
difficulty level of using the Xerces-C++ library in a C app. Or the
XML::parser perl module.

I realize you want to feed it invalid XML and correct errors; I know
from experience that you can use Xerces to a certain extent to locate
errors, so it might not be terribly hard to take that approach - make
passes through the xerces validator to find errors, fix them, and end up
with the ability to do SAX or DOM on the document for free.

I have never, ever, even considered touching Xerces-C++, so I don't know
if it has anything in common with Xerces-Java. The docs on the xerces
site make it look easy enough to use.

Somebody out there has done this, right?
 
R

Rob Hoelz

Just curious, why do you want to use C for this? I'm not bashing C,
(I love it), but this seems like the kind of task Perl was created
for.
 
M

Marc Dubois

i dont know PErl
Rob Hoelz said:
Just curious, why do you want to use C for this? I'm not bashing C,
(I love it), but this seems like the kind of task Perl was created
for.
 
K

Keith Thompson

John F said:
It's somwehow ironic but: so is lecturing on OT :)

By convention, meta-discussions about topicality are considered
topical.

In my opinion, discussions about how to post properly should also be
considered topical. If nobody ever complained about top-posting, we'd
end up with an ugly mixture of top-posting, bottom-posting,
mid-posting, and whatever other forms of posting some random person
decides Looks Really Cool. The newsgroup will become more difficult
to read, and those who spend the most time here will lose patience and
give up on the newsgroup. Since spending a lot of time here
correlates fairly strongly (but not perfectly) with expertise, I
suggest that this would be to the great detriment of the newsgroup.

Personally, I *usually* don't complain about top-posting unless I
happen to be replying to the article anyway.

Perhaps we should agree on a de facto standard tag, like "[TP]", for
articles that complain about top-posting without adding new content.
Or perhaps there should be a more generic tag for criticisms of
posting style. (In my opinion, articles that complain about posting
style *and* discuss C need no such tag.)
 
D

Default User

Richard wrote:

Lecturing on top posting is OT.


Where did you get that idea?

If you don't want to see the messages, I try to put "- TPA" in subject
line (as in this case). You can easily create a filter for it.




Brian
 
D

Default User

Keith Thompson wrote:

Perhaps we should agree on a de facto standard tag, like "[TP]", for
articles that complain about top-posting without adding new content.
Or perhaps there should be a more generic tag for criticisms of
posting style. (In my opinion, articles that complain about posting
style and discuss C need no such tag.)


I shy away from stuff in [] these days, because Google has decided for
some unknown reason to strip them in messages that originate there,
including replies. I've gone with "- TPA" in the subject.

We discussed this some time back. I encourage anyone who can do create
a filter for that. I try put that in (modulo my memory) whenever it's
strictly a quick note about top-posting.




Brian
 
J

John F

Default said:
Actually, no. Topicality is always on-topic.

I know at least one group where off-topic posts are on topic as per
definition:

borland.public.off-topic
 
J

John F

Keith said:
By convention, meta-discussions about topicality are considered
topical.

Are meta-discussions about meta-discussions about topicality still on
topic?
In my opinion, discussions about how to post properly should also be
considered topical.

I'd second if there was only a single post to tell the poster not to
top post, but unfortunately c.l.c. is one of the few groups where real
OT and posting flame threads (like this one) can be very amusing and
usually end up in a cat fight with a spectaclular showdown.
If nobody ever complained about top-posting, we'd
end up with an ugly mixture of top-posting, bottom-posting,
mid-posting, and whatever other forms of posting some random person
decides Looks Really Cool.

I second that. I don't like it either. Meanwhile I even told various
sales guys not to "top-mail" :) and to use plain text instead of
HTML. It works (sometimes)!
The newsgroup will become more difficult
to read, and those who spend the most time here will lose patience
and
give up on the newsgroup. Since spending a lot of time here
correlates fairly strongly (but not perfectly) with expertise, I
suggest that this would be to the great detriment of the newsgroup.

I gained a lot of knowledge on the C language here! e.g. coding
styles, interhuman communication (and the problems arising from
that)...
Personally, I *usually* don't complain about top-posting unless I
happen to be replying to the article anyway.

I don't complain. I correct it and I found that on the third reply the
poster will usually get the clue... Or i send a short e-mail
containing links to some articles (this is better than trashing the
newsgroup with top-post flames :)
Perhaps we should agree on a de facto standard tag, like "[TP]", for
articles that complain about top-posting without adding new content.
Or perhaps there should be a more generic tag for criticisms of
posting style. (In my opinion, articles that complain about posting
style *and* discuss C need no such tag.)

How about [COPS] as in "criticism of posting style"? :)
Or [OT-COPS] for criticism-only replys?

Or use dashes instead of brackets, since (as Brian noted correctly)
google strips these phrases in replys.
 
C

CBFalconer

Richard said:
Lecturing on top posting is OT.

On the contrary, such lectures are essential to maintaining proper
usenet protocol. Without correction such newbies will never
learn. It's something like training puppies.
 
C

CBFalconer

John said:
.... snip ...

I second that. I don't like it either. Meanwhile I even told
various sales guys not to "top-mail" :) and to use plain text
instead of HTML. It works (sometimes)!

What gets me is banks and credit cards that insist on sending me
html mail. I keep telling them that it is a security risk, and
they keep insisting that they can't do anything else. Idiots.
Bank of America is one.
 
R

Random832

2006-12-13 said:
What gets me is banks and credit cards that insist on sending me
html mail. I keep telling them that it is a security risk, and
they keep insisting that they can't do anything else. Idiots.
Bank of America is one.

Sending HTML email is not a security risk. _receiving_ it from unknown
recepients can be, if your html email viewer mishandles the code, but
that's A) not their problem, and B) if you don't use outlook and have
external images turned off, not your problem.
 
C

CBFalconer

Random832 said:
Sending HTML email is not a security risk. _receiving_ it from
unknown recepients can be, if your html email viewer mishandles
the code, but that's A) not their problem, and B) if you don't
use outlook and have external images turned off, not your problem.

I am secure. Their customers aren't. It's stupid to endanger
people for no conceivable reason. Even stupider to deny that they
can send pure text mail.
 
G

goose

Marc said:
hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :

I'd consider looking up "state machine" and implementing a
state-machine.

If it's only a few simple rules like below, you could simply,
assuming that you have the string in memory to work on,
work across the string 1 character at a time in a loop and
use "if ... else ..." clauses for each of your rules and
maintain state with a few variables.

Something like this (I *think* it's all correct, but I've not
tested it on a large dataset, only on the examples you gave,
so you may need to fix the errors that are sure to crop up):

/* The main loop */
size_t i, len = strlen (src);
int in_tag = 0,
in_data = 0,
in_spaces = 0;
for (i=0; i<len; i++) {
/* Process src and output a single char:
* 1. output a single char dependent on the rule invoked.
* 2. output src if none of the rules are invoked.
*/
/* RULES GO HERE (see all the rules below) */
printf ("%c", src);
}

All that goes in the loop should be the few rules
that you are interested in.
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :
Example 1:
<foo> blabla < bla </foo>
/* RULE 1. */
if (src=='<') {
char *ending_tag = strchr (&src, '>'),
*next_opening_tag = strchr (&src[i+1], '<');
if ((ending_tag && next_opening_tag) &&
(ending_tag > next_opening_tag)) {
printf (" ");
continue;
} else {
in_tag = 1;
}
}
becomes

<foo> blabla bla </foo>

Example 2:

<foo>> blablabla </foo>

/* RULE 2. */
if (src=='>') {
if (!in_tag) {
printf (" ");
continue;
} else {
in_tag = 0;
}
}
becomes


<foo> blablabla </foo>


2) Remove all extra spaces at the end of every line of the XML file

/* RULE 3. This will replace ALL spaces that are not part
* of a tag or a tags data with just a single space.
*/
if (src=='<' && src[i+1]=='/') {
in_data = 0;
}
if (src=='<' && src[i+1]!='/') {
in_data = 1;
}
if (!in_data && !in_tag &&
(isspace (src) || src=='\n') && !in_spaces) {
in_spaces = 1;
continue;
}
if (in_spaces && isspace (src)) {
continue;
} else {
in_spaces = 0;
}

3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

/* RULE 4. I've no idea what you mean by "hexadecimal
characters".
* You will have to write the function (possibly maintaining
state)
* that will determine whether the character is unicode. This is
* beyond my area of expertise.
*/
if (is_unicode(&src)) {
printf (" ");
i += 3;
continue;
}
I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

It will barf and hopefully give you a line number/character number
of the error/last good token parsed.
Do you have an idea on how i can write a program to deal with these
requirements ?

Like I said above, use a state-machine[1]. Alternatively you
could try my attempt and hack it till it works on your dataset.
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra

<OT>
sed is OffTopic! If you really want to, try comp.unix.programmer. If
you are not limited to an "only-C" solution do what I do and use
flex for this type of thing (i.e. for homework the instructor wants
to test you, not just get a solution, and so may insist that flex
cannot be used).
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

Thanks for your help.

[1] I'm constantly amazed that so few people look at a "parsing"
type problem and go "Aha! StateMachine to the rescue". Is this
type of thing not taught anymore?

goose,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,158
Latest member
Vinay_Kumar Nevatia
Top