Counting words in an Html Document

Francois · Oct 12, 2004

Hi,

I'm looking for a way to count the words in an Html Document or String.
Any idea or experience with this?

may thx in advance.
rgds,
Francois

Nathan Zumwalt · Oct 12, 2004

You can use the StringTokenizer class to split Strings into words:

http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html

The HTML document is more problematic in that you'll have to ignore
tags (I'm assuming). I would put the whole thing in the
StringTokenizer, and while looping through the tokens, detect tag
values. You'll have a lot of if statements.

//Nathan

Paul Lutus · Oct 12, 2004

Nathan said:
You can use the StringTokenizer class to split Strings into words:

http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html

The HTML document is more problematic in that you'll have to ignore
tags (I'm assuming). I would put the whole thing in the
StringTokenizer, and while looping through the tokens, detect tag
values. You'll have a lot of if statements.

Please do not recommend the use of StringTokenizer in a case like this.
StringTokenizer is historically a very unreliable approach to ... well ...
tokenizing strings, or anything else.

In any case, this approach is way too complex. Instead, how about:

private static int countWordsInHTML(String s)
{
s = s.replaceAll("<.+?>","");
String[] array = s.split("\\s+");
return array.length;
}

As embodied here:

import java.io.*;

public class CountHTMLWords {

private static String readFile(String s) throws Exception
{
StringBuffer sb = new StringBuffer();
String lineEnding = System.getProperty("line.separator");
BufferedReader br = new BufferedReader(new FileReader(s));
String line;
while((line = br.readLine()) != null) {
sb.append(line + lineEnding);
}
return sb.toString();
}

private static int countWordsInHTML(String s)
{
s = s.replaceAll("<.+?>","");
String[] array = s.split("\\s+");
return array.length;
}

static public void main(String[] args) throws Exception
{
if(args.length == 0) {
System.out.println("usage: file path.");
}
else {
String s = readFile(args[0]);
int w = countWordsInHTML(s);
System.out.println(w + " words in file " + args[0]);
}
}
}

Lasse Reichstein Nielsen · Oct 12, 2004

Paul Lutus said:
In any case, this approach is way too complex. Instead, how about:

private static int countWordsInHTML(String s)
{
s = s.replaceAll("<.+?>","");

String[] array = s.split("\\s+");

I don't know the original poster's definition of a "word", but if,
e.g., "he/she" should be two words, then splitting on whitespace is
not enough. Perhaps something like "[^\\w]+" would be better (but not
great, since \w only matches English letters and digits, not letters
from other alphabets, which are part of words).

So, a clear definition of "word" is necessary before we can evaluate
the adequacy of a solution

Also, since the input is HTML, one should also consider entities.
Which is hard. E.g., "Me&You" is two words, but "Blåbær"
is one. The best would probably be to convert all entities to their
Unicode character first (but after removing tags, so < won't cause
confuzion). That requires a list of all the textual entities (&,
&aring

whereas the numerical ones are easier (A).

Such a simple looking question

/L

Paul Lutus · Oct 12, 2004

Lasse said:
To take care of cases such as "Hello<br>world", the replacement string
should be something like " ".

Yes, thank you, very good suggestion. This does nothing but good, there are
no drawbacks (because I split on "\\s+").

String[] array = s.split("\\s+");

Click to expand...

I don't know the original poster's definition of a "word", but if,
e.g., "he/she" should be two words, then splitting on whitespace is
not enough. Perhaps something like "[^\\w]+" would be better (but not
great, since \w only matches English letters and digits, not letters
from other alphabets, which are part of words).

Then there is the issue of what defines a word, and according to whom. Is
"and/or" a word? A lawyer might insist that it is indeed a word, others
might insist it is two words.

So, a clear definition of "word" is necessary before we can evaluate
the adequacy of a solution
Yes.

Also, since the input is HTML, one should also consider entities.
Which is hard. E.g., "Me&You" is two words, but "Blåbær"
is one. The best would probably be to convert all entities to their
Unicode character first (but after removing tags, so < won't cause
confuzion). That requires a list of all the textual entities (&,
&aring whereas the numerical ones are easier (A).

Yes, but only because we are counting words, not using the resulting text.
Some additional complex examples:

  a space, not a word or part of a word.
" a punctuation mark, maybe part of a word, maybe not.
© a word. Or not?

And so forth.

Such a simple looking question

A conceptual onion, with many layers.

In any case, and disregarding most of that, here is the corrected method:

private static int countWordsInHTML(String s)
{
s = s.replaceAll("<.+?>"," ");
String[] array = s.split("\\s+");
return array.length;
}

Lasse Reichstein Nielsen · Oct 12, 2004

Paul Lutus said:
In any case, and disregarding most of that, here is the corrected method: ....
String[] array = s.split("\\s+");
return array.length;

It looks like an awfully expensive operation to create the array when
all you need is the length. So, with that as an excuse to read up on regex,
I *think* this would be more efficient (but definitly not shorter):
---
// match sequences of the Unicode letter category
// To give same result as above, use "[^\\s]+"
Pattern pattern = Pattern.compile("\\p{L}+");
Matcher matcher = pattern.matcher(s);
int words = 0;
while (matcher.find()) {
words ++;
}
return words;

David Hilsee · Oct 12, 2004

Lasse Reichstein Nielsen said:
Paul Lutus said:

In any case, and disregarding most of that, here is the corrected

Click to expand...

method:

...
String[] array = s.split("\\s+");
return array.length;

Click to expand...

It looks like an awfully expensive operation to create the array when
all you need is the length. So, with that as an excuse to read up on regex,
I *think* this would be more efficient (but definitly not shorter):
---
// match sequences of the Unicode letter category
// To give same result as above, use "[^\\s]+"
Pattern pattern = Pattern.compile("\\p{L}+");
Matcher matcher = pattern.matcher(s);
int words = 0;
while (matcher.find()) {
words ++;
}
return words;

HTML is hard (impossible?) to parse using regular expressions. See
http://www.perldoc.com/perl5.8.4/pod/perlfaq9.html#How-do-I-remove-HTML-from-a-string-

Paul Lutus · Oct 12, 2004

Lasse said:
Paul Lutus said:

In any case, and disregarding most of that, here is the corrected method: ...
String[] array = s.split("\\s+");
return array.length;

Click to expand...

It looks like an awfully expensive operation to create the array when
all you need is the length.

Yes, its only virtue is its brevity in the source file, nothing else.

So, with that as an excuse to read up on
regex, I *think* this would be more efficient (but definitly not shorter):
---
// match sequences of the Unicode letter category
// To give same result as above, use "[^\\s]+"
Pattern pattern = Pattern.compile("\\p{L}+");
Matcher matcher = pattern.matcher(s);
int words = 0;
while (matcher.find()) {
words ++;
}
return words;

Did you measure the speedup or assume it? That aside, your method has a lot
to say in its favor, while mine mostly reflects my well-known laziness.

Will Hartung · Oct 13, 2004

David Hilsee said:
HTML is hard (impossible?) to parse using regular expressions. See

http://www.perldoc.com/perl5.8.4/pod/perlfaq9.html#How-do-I-remove-HTML-from
-a-string-

But it's downright trivial to parse normally, especially when you simply
don't care about the content of the tags. When I had this problem, that's
basically what I did, write a simple parse. This was quite fortunate as
later I had to tweak it to be conscious of JSP scriptlets....it wasn't
perfect but suitable for the task (which at the time was to let us write
both HTML and plain text emails).

Regards,

Will Hartung
([email protected])

David Hilsee · Oct 14, 2004

Will Hartung said:
http://www.perldoc.com/perl5.8.4/pod/perlfaq9.html#How-do-I-remove-HTML-from
-a-string-

But it's downright trivial to parse normally, especially when you simply
don't care about the content of the tags. When I had this problem, that's
basically what I did, write a simple parse. This was quite fortunate as
later I had to tweak it to be conscious of JSP scriptlets....it wasn't
perfect but suitable for the task (which at the time was to let us write
both HTML and plain text emails).

Well, sure, regexes work in simpler cases. I just wanted to point out that
HTML can be difficult to parse if you want to be able to handle many various
inputs, even if you don't care about the content of the tags. At best, a
regex solution is incomplete, and at worst, it is a hack.

Chris Uppal · Oct 14, 2004

David said:
. At best, a regex solution is incomplete, and at worst, it is a hack.

I increasingly suspect that this is true of all, or nearly all, use of regexps.
I mean the cases where a regexp is hardwired (more-or-less) into the code;
they're undoubtedly valuable in applications where a user can enter a regexp to
control/configure something.

Certainly I can't think of any time that I've ever used them in "production"
(not throwaway) code, when I haven't eventually either removed them or
regretted that I couldn't. And just about every use of them I've read in this
newsgroup has been suspicious (either as a over-complicated tool for a simple
task, or a hacky approximation to a difficult task).

(BTW, as an old Unix hacker, with a particular fondness for awk and sed, I'm
happy to /use/ quite complicated regexps; it's just that I'm suspicious of
their ultimate value.)

-- chris

Sharp · Oct 14, 2004

Hi,

I'm looking for a way to count the words in an Html Document or String.
Any idea or experience with this?

may thx in advance.
rgds,
Francois

1. Use Swing HTML parser to extract just the text from a HTML document
(this will ignore tags) to form a string.
2. Use string split function to split string at white spaces - this will
return a string array
3. Use size() method of array to give you the counts of words.

Note: stringtokenizer is no longer in vogue and is recommened to use
string split function. Regrex is another alternative, but it's overkill for
such a simple problem.

Regards
Sharp

Paul Lutus · Oct 14, 2004

Chris said:
I increasingly suspect that this is true of all, or nearly all, use of
regexps. I mean the cases where a regexp is hardwired (more-or-less) into
the code; they're undoubtedly valuable in applications where a user can
enter a regexp to control/configure something.

Certainly I can't think of any time that I've ever used them in
"production" (not throwaway) code, when I haven't eventually either
removed them or
regretted that I couldn't. And just about every use of them I've read in
this newsgroup has been suspicious (either as a over-complicated tool for
a simple task, or a hacky approximation to a difficult task).

(BTW, as an old Unix hacker, with a particular fondness for awk and sed,
I'm happy to /use/ quite complicated regexps; it's just that I'm
suspicious of their ultimate value.)

Well, they (regular expressions) do have the property that they can
accomplish a lot using a small initial statement. The problem often comes
up in correctly evaluating that initial statement and seeing its
limitations and side effects.

For example, my regexp in this thread would parse HTML after a fashion, but
it makes some unrealistically simplifying assumptions about the content of
a typical Web page.

Will Hartung · Oct 14, 2004

Chris Uppal said:
I increasingly suspect that this is true of all, or nearly all, use of regexps.
I mean the cases where a regexp is hardwired (more-or-less) into the code;
they're undoubtedly valuable in applications where a user can enter a regexp to
control/configure something.

Certainly I can't think of any time that I've ever used them in "production"
(not throwaway) code, when I haven't eventually either removed them or
regretted that I couldn't. And just about every use of them I've read in this
newsgroup has been suspicious (either as a over-complicated tool for a simple
task, or a hacky approximation to a difficult task).

(BTW, as an old Unix hacker, with a particular fondness for awk and sed, I'm
happy to /use/ quite complicated regexps; it's just that I'm suspicious of
their ultimate value.)

Regex's offer a LOT of Bang for the Buck, but they are like any other good
abstraction -- implicitly limited in the long run. Since regex's aren't
extensible, they can only take the task so far before it runs out of
capability. Note, I'm not picking on regex's here; like I said, it works for
pretty much anything as you go to a higher level, they all have to
compromise something somewhere.

The real issue though is that regex's are considered a silver bullet, and
they also promote a style of development that is not easy to change when it
does break.

Since the regex is not only the grammar, but also the scanner, when you're
grammar fails, you lose the scanner as well. If you were writing your own
lexer/parser, they are much more robust and easier to change. If you find a
bug in your lexer/parser, you typically don't have to throw the entire thing
away to fix it. If you rely on regex's, it's very possible that you will run
into some wall that requires you to toss the entire thing out and start from
scratch.

Trvial case: Comma Seperated Values.

"1, 2, 3".split(",")

Ok.

Now, what about:
"1, \"This is a test, neet eh?\", 3" ?

Baby, bathwater, out the window.

All that said, I use them all the time as well in scripts. But pretty much
almost never in code.

Regards,

Will Hartung
([email protected])

Digital Signature field form in PDF generated document from HTML	5	Nov 16, 2022
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
RegExp - Match specific words, but not if they're inside parenthesis (with or without other words within)	6	Jan 29, 2023
Single put routine overlapping words during iteration	4	Jan 2, 2023
Can I count the number of times a video is played?	2	Oct 28, 2025
Is it possible to get some informations from a document in Google Docs and show it on my website ?	0	Nov 19, 2022
How to Migrate OST Data Files from Outlook to HTML?	2	Jan 28, 2025
I'm about to get in trouble with the HTML <body></body> tags	10	Aug 12, 2023

Counting words in an Html Document

Francois

Nathan Zumwalt

Paul Lutus

Lasse Reichstein Nielsen

Paul Lutus

Lasse Reichstein Nielsen

David Hilsee

Paul Lutus

Will Hartung

David Hilsee

Chris Uppal

Sharp

Paul Lutus

Will Hartung

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads