Request for comments on HTML tag removal function

Kieran Simkin · Aug 19, 2004

Hi All,

Just writing a quick function to remove HTML tags from a string (array of
chars) and I'd like your comments on my code - anything you'd do differently
or any mistakes etc. I'm still kinda new to C, so I'm not 100% confident
using pointers yet.

Anyway, the algorithm works like this: The loop steps over the string
character by character with two pointers, I have a toggle variable that
basically indicates whether the 's' pointer is currently within an HTML tag.
If this is the case, 's' is incremented, but 'c' isn't. If 's' isn't inside
an HTML tag, the value pointed to by 'c' is set to the value pointed to by
's' and both are incremented. So basically, the string is being rebuilt in
place, skipping over html tags and their content.

Here's the code, all comments welcome:

void striphtml (char *s) {
char *c,t=0;
c=s;
while (*s!='\0') {
if (*s=='<') {
t=1;
} else if (*s=='>') {
t=0;
} else if (!t) {
*(c++)=*s;
}
s++;
}
*c='\0';
}

Cheers.

~Kieran Simkin
Digital Crocus
http://digital-crocus.com/

Kieran Simkin · Aug 19, 2004

Chris McDonald said:
In comp.lang.c you write:

Have to worry about the nasty ones, too.
What if an HTML tag appears in an HTML comment?

That's a very good point, I've now made the following modification to my
code:

void striphtml (char *s) {
char *c,t=0;
c=s;
while (*s!='\0') {
if (*s=='<') {
t++;
} else if (*s=='>') {
t--;
} else if (t<1) {
*(c++)=*s;
}
s++;
}
*c='\0';
}

Now instead of toggling 't' on and off, t becomes a counter for the depth of
nested HTML tags and the string is only copied if the depth is less than
one, ie, we're inside of less than one HTML tag.

Anything other comments on this code?

~Kieran

G. S. Hayes · Aug 19, 2004

Kieran Simkin said:
Have to worry about the nasty ones, too.
What if an HTML tag appears in an HTML comment?

Click to expand...

[SNIP]
Now instead of toggling 't' on and off, t becomes a counter for the depth of
nested HTML tags and the string is only copied if the depth is less than
one, ie, we're inside of less than one HTML tag.

Does that work? e.g.

 <a>

The <a> isn't in a comment. HTML comments (like C /* */ comments) don't nest.

Right?

Arthur J. O'Dwyer · Aug 19, 2004

Chris McDonald said:
Chris McDonald said:

In comp.lang.c you write:

Have to worry about the nasty ones, too.
What if an HTML tag appears in an HTML comment?

Click to expand...

That's a very good point, I've now made the following modification to my
code: [...]
Now instead of toggling 't' on and off, t becomes a counter for the depth of
nested HTML tags and the string is only copied if the depth is less than
one, ie, we're inside of less than one HTML tag.

(According to another poster, HTML comments don't nest.)

What about quoted text?

<img src="rewind.png" alt="<<">

What about <pre> tags?

<pre>if (i>0) break; if (i<0) continue;</pre>

And then once you get your code to parse standard HTML, it's still a
good idea to do something semi-sensible with non-standard HTML.

<table border=1 bgcolor=FFFFFF
<tr><td>Hello, world!
</table>

But that's a question for comp.programming or comp.text.html
(if such a group exists; I don't think it does. Oh, well). In fact,
even your original question was kind of OT here---since you weren't
asking "does this code meet the spec," but rather "what kind of spec
should I make up for this code?"

HTH,
-Arthur

Smoker · Aug 19, 2004

Hi All,

Just writing a quick function to remove HTML tags from a string (array of
chars) and I'd like your comments on my code - anything you'd do differently
or any mistakes etc. I'm still kinda new to C, so I'm not 100% confident
using pointers yet.

Anyway, the algorithm works like this: The loop steps over the string
character by character with two pointers, I have a toggle variable that
basically indicates whether the 's' pointer is currently within an HTML tag.
If this is the case, 's' is incremented, but 'c' isn't. If 's' isn't inside
an HTML tag, the value pointed to by 'c' is set to the value pointed to by
's' and both are incremented. So basically, the string is being rebuilt in
place, skipping over html tags and their content.

Here's the code, all comments welcome:

void striphtml (char *s) {
char *c,t=0;
c=s;
while (*s!='\0') {
if (*s=='<') {
t=1;
} else if (*s=='>') {
t=0;
} else if (!t) {
*(c++)=*s;
}
s++;
}
*c='\0';
}

Cheers.

~Kieran Simkin
Digital Crocus
http://digital-crocus.com/

You don't handle '<' characters not intended to open a tag (like in "
"x < 7" or "y > 1").

Sure, theoretically there should be a "<" instead of '<', but... if
you parse «real world» html pages with that function, for most pages
it will strip important parts of the text.

Take another approach: check for '<', then jump into a function that
checks if it is a tag or not.
If the '<' doesn't open a tag, then that function can return 0. if it
does open a tag, the function may return the position of the tag end
'>' relative to the position of the '<' (call it the string size of
the tag).

Here is another problem you will have to consider:

<a href="foo.html" onclick='if(x > 5) return false'>, you will end up
with the ---» 5) return false'> «--- left over in the stripped text.

Why is this WordPress comments form not submitting?	1	Jan 12, 2020
Request for comments - kgets()	10	Aug 13, 2004
String concatenation function, request for comments.	35	Jun 5, 2005
Request for source code review of simple Ising model	88	Apr 10, 2014
program which removes comments from C source	9	Jun 14, 2011
Request for comments - concurrent ssh client	0	Nov 4, 2009
Regular Expression for Finding and Deleting comments	1	Jan 4, 2011
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021

Request for comments on HTML tag removal function

Kieran Simkin

Kieran Simkin

G. S. Hayes

Arthur J. O'Dwyer

Smoker

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads