Request for comments on HTML tag removal function

K

Kieran Simkin

Hi All,

Just writing a quick function to remove HTML tags from a string (array of
chars) and I'd like your comments on my code - anything you'd do differently
or any mistakes etc. I'm still kinda new to C, so I'm not 100% confident
using pointers yet.

Anyway, the algorithm works like this: The loop steps over the string
character by character with two pointers, I have a toggle variable that
basically indicates whether the 's' pointer is currently within an HTML tag.
If this is the case, 's' is incremented, but 'c' isn't. If 's' isn't inside
an HTML tag, the value pointed to by 'c' is set to the value pointed to by
's' and both are incremented. So basically, the string is being rebuilt in
place, skipping over html tags and their content.

Here's the code, all comments welcome:

void striphtml (char *s) {
char *c,t=0;
c=s;
while (*s!='\0') {
if (*s=='<') {
t=1;
} else if (*s=='>') {
t=0;
} else if (!t) {
*(c++)=*s;
}
s++;
}
*c='\0';
}


Cheers.


~Kieran Simkin
Digital Crocus
http://digital-crocus.com/
 
K

Kieran Simkin

Chris McDonald said:
In comp.lang.c you write:


Have to worry about the nasty ones, too.
What if an HTML tag appears in an HTML comment?

That's a very good point, I've now made the following modification to my
code:

void striphtml (char *s) {
char *c,t=0;
c=s;
while (*s!='\0') {
if (*s=='<') {
t++;
} else if (*s=='>') {
t--;
} else if (t<1) {
*(c++)=*s;
}
s++;
}
*c='\0';
}

Now instead of toggling 't' on and off, t becomes a counter for the depth of
nested HTML tags and the string is only copied if the depth is less than
one, ie, we're inside of less than one HTML tag.

Anything other comments on this code?


~Kieran
 
G

G. S. Hayes

Kieran Simkin said:
Have to worry about the nasty ones, too.
What if an HTML tag appears in an HTML comment?
[SNIP]
Now instead of toggling 't' on and off, t becomes a counter for the depth of
nested HTML tags and the string is only copied if the depth is less than
one, ie, we're inside of less than one HTML tag.

Does that work? e.g.

<!-- Comment <!-- --> <a>

The <a> isn't in a comment. HTML comments (like C /* */ comments) don't nest.

Right?
 
A

Arthur J. O'Dwyer

Chris McDonald said:
In comp.lang.c you write:


Have to worry about the nasty ones, too.
What if an HTML tag appears in an HTML comment?

That's a very good point, I've now made the following modification to my
code: [...]
Now instead of toggling 't' on and off, t becomes a counter for the depth of
nested HTML tags and the string is only copied if the depth is less than
one, ie, we're inside of less than one HTML tag.

(According to another poster, HTML comments don't nest.)

What about quoted text?

<img src="rewind.png" alt="<<">

What about <pre> tags?

<pre>if (i>0) break; if (i<0) continue;</pre>

And then once you get your code to parse standard HTML, it's still a
good idea to do something semi-sensible with non-standard HTML.

<table border=1 bgcolor=FFFFFF
<tr><td>Hello, world!
</table>

:) But that's a question for comp.programming or comp.text.html
(if such a group exists; I don't think it does. Oh, well). In fact,
even your original question was kind of OT here---since you weren't
asking "does this code meet the spec," but rather "what kind of spec
should I make up for this code?" ;)

HTH,
-Arthur
 
S

Smoker

Hi All,

Just writing a quick function to remove HTML tags from a string (array of
chars) and I'd like your comments on my code - anything you'd do differently
or any mistakes etc. I'm still kinda new to C, so I'm not 100% confident
using pointers yet.

Anyway, the algorithm works like this: The loop steps over the string
character by character with two pointers, I have a toggle variable that
basically indicates whether the 's' pointer is currently within an HTML tag.
If this is the case, 's' is incremented, but 'c' isn't. If 's' isn't inside
an HTML tag, the value pointed to by 'c' is set to the value pointed to by
's' and both are incremented. So basically, the string is being rebuilt in
place, skipping over html tags and their content.

Here's the code, all comments welcome:

void striphtml (char *s) {
char *c,t=0;
c=s;
while (*s!='\0') {
if (*s=='<') {
t=1;
} else if (*s=='>') {
t=0;
} else if (!t) {
*(c++)=*s;
}
s++;
}
*c='\0';
}


Cheers.


~Kieran Simkin
Digital Crocus
http://digital-crocus.com/


You don't handle '<' characters not intended to open a tag (like in "
"x < 7" or "y > 1").

Sure, theoretically there should be a "&lt;" instead of '<', but... if
you parse «real world» html pages with that function, for most pages
it will strip important parts of the text.


Take another approach: check for '<', then jump into a function that
checks if it is a tag or not.
If the '<' doesn't open a tag, then that function can return 0. if it
does open a tag, the function may return the position of the tag end
'>' relative to the position of the '<' (call it the string size of
the tag).


Here is another problem you will have to consider:


<a href="foo.html" onclick='if(x > 5) return false'>, you will end up
with the ---» 5) return false'> «--- left over in the stripped text.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top