Counting words in an Html Document

F

Francois

Hi,

I'm looking for a way to count the words in an Html Document or String.
Any idea or experience with this?

may thx in advance.
rgds,
Francois
 
P

Paul Lutus

Nathan said:
You can use the StringTokenizer class to split Strings into words:

http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html

The HTML document is more problematic in that you'll have to ignore
tags (I'm assuming). I would put the whole thing in the
StringTokenizer, and while looping through the tokens, detect tag
values. You'll have a lot of if statements.

Please do not recommend the use of StringTokenizer in a case like this.
StringTokenizer is historically a very unreliable approach to ... well ...
tokenizing strings, or anything else.

In any case, this approach is way too complex. Instead, how about:

private static int countWordsInHTML(String s)
{
s = s.replaceAll("<.+?>","");
String[] array = s.split("\\s+");
return array.length;
}

As embodied here:

import java.io.*;

public class CountHTMLWords {

private static String readFile(String s) throws Exception
{
StringBuffer sb = new StringBuffer();
String lineEnding = System.getProperty("line.separator");
BufferedReader br = new BufferedReader(new FileReader(s));
String line;
while((line = br.readLine()) != null) {
sb.append(line + lineEnding);
}
return sb.toString();
}

private static int countWordsInHTML(String s)
{
s = s.replaceAll("<.+?>","");
String[] array = s.split("\\s+");
return array.length;
}

static public void main(String[] args) throws Exception
{
if(args.length == 0) {
System.out.println("usage: file path.");
}
else {
String s = readFile(args[0]);
int w = countWordsInHTML(s);
System.out.println(w + " words in file " + args[0]);
}
}
}
 
L

Lasse Reichstein Nielsen

Paul Lutus said:
In any case, this approach is way too complex. Instead, how about:

private static int countWordsInHTML(String s)
{
s = s.replaceAll("<.+?>","");

String[] array = s.split("\\s+");

I don't know the original poster's definition of a "word", but if,
e.g., "he/she" should be two words, then splitting on whitespace is
not enough. Perhaps something like "[^\\w]+" would be better (but not
great, since \w only matches English letters and digits, not letters
from other alphabets, which are part of words).

So, a clear definition of "word" is necessary before we can evaluate
the adequacy of a solution :)

Also, since the input is HTML, one should also consider entities.
Which is hard. E.g., "Me&amp;You" is two words, but "Bl&aring;b&aelig;r"
is one. The best would probably be to convert all entities to their
Unicode character first (but after removing tags, so &lt; won't cause
confuzion). That requires a list of all the textual entities (&amp;,
&aring;) whereas the numerical ones are easier (A).

Such a simple looking question :)
/L
 
P

Paul Lutus

Lasse said:
To take care of cases such as "Hello<br>world", the replacement string
should be something like " ".

Yes, thank you, very good suggestion. This does nothing but good, there are
no drawbacks (because I split on "\\s+").
String[] array = s.split("\\s+");

I don't know the original poster's definition of a "word", but if,
e.g., "he/she" should be two words, then splitting on whitespace is
not enough. Perhaps something like "[^\\w]+" would be better (but not
great, since \w only matches English letters and digits, not letters
from other alphabets, which are part of words).

Then there is the issue of what defines a word, and according to whom. Is
"and/or" a word? A lawyer might insist that it is indeed a word, others
might insist it is two words.
So, a clear definition of "word" is necessary before we can evaluate
the adequacy of a solution :)
Yes.


Also, since the input is HTML, one should also consider entities.
Which is hard. E.g., "Me&amp;You" is two words, but "Bl&aring;b&aelig;r"
is one. The best would probably be to convert all entities to their
Unicode character first (but after removing tags, so &lt; won't cause
confuzion). That requires a list of all the textual entities (&amp;,
&aring;) whereas the numerical ones are easier (A).

Yes, but only because we are counting words, not using the resulting text.
Some additional complex examples:

&nbsp; a space, not a word or part of a word.
&quot; a punctuation mark, maybe part of a word, maybe not.
&copy; a word. Or not?

And so forth.
Such a simple looking question :)

A conceptual onion, with many layers. :)

In any case, and disregarding most of that, here is the corrected method:

private static int countWordsInHTML(String s)
{
s = s.replaceAll("<.+?>"," ");
String[] array = s.split("\\s+");
return array.length;
}
 
L

Lasse Reichstein Nielsen

Paul Lutus said:
In any case, and disregarding most of that, here is the corrected method: ....
String[] array = s.split("\\s+");
return array.length;

It looks like an awfully expensive operation to create the array when
all you need is the length. So, with that as an excuse to read up on regex,
I *think* this would be more efficient (but definitly not shorter):
---
// match sequences of the Unicode letter category
// To give same result as above, use "[^\\s]+"
Pattern pattern = Pattern.compile("\\p{L}+");
Matcher matcher = pattern.matcher(s);
int words = 0;
while (matcher.find()) {
words ++;
}
return words;
 
D

David Hilsee

Lasse Reichstein Nielsen said:
Paul Lutus said:
In any case, and disregarding most of that, here is the corrected
method:
...
String[] array = s.split("\\s+");
return array.length;

It looks like an awfully expensive operation to create the array when
all you need is the length. So, with that as an excuse to read up on regex,
I *think* this would be more efficient (but definitly not shorter):
---
// match sequences of the Unicode letter category
// To give same result as above, use "[^\\s]+"
Pattern pattern = Pattern.compile("\\p{L}+");
Matcher matcher = pattern.matcher(s);
int words = 0;
while (matcher.find()) {
words ++;
}
return words;

HTML is hard (impossible?) to parse using regular expressions. See
http://www.perldoc.com/perl5.8.4/pod/perlfaq9.html#How-do-I-remove-HTML-from-a-string-
 
P

Paul Lutus

Lasse said:
Paul Lutus said:
In any case, and disregarding most of that, here is the corrected method: ...
String[] array = s.split("\\s+");
return array.length;

It looks like an awfully expensive operation to create the array when
all you need is the length.

Yes, its only virtue is its brevity in the source file, nothing else.
So, with that as an excuse to read up on
regex, I *think* this would be more efficient (but definitly not shorter):
---
// match sequences of the Unicode letter category
// To give same result as above, use "[^\\s]+"
Pattern pattern = Pattern.compile("\\p{L}+");
Matcher matcher = pattern.matcher(s);
int words = 0;
while (matcher.find()) {
words ++;
}
return words;

Did you measure the speedup or assume it? That aside, your method has a lot
to say in its favor, while mine mostly reflects my well-known laziness. :)
 
W

Will Hartung

David Hilsee said:
HTML is hard (impossible?) to parse using regular expressions. See
http://www.perldoc.com/perl5.8.4/pod/perlfaq9.html#How-do-I-remove-HTML-from
-a-string-

But it's downright trivial to parse normally, especially when you simply
don't care about the content of the tags. When I had this problem, that's
basically what I did, write a simple parse. This was quite fortunate as
later I had to tweak it to be conscious of JSP scriptlets....it wasn't
perfect but suitable for the task (which at the time was to let us write
both HTML and plain text emails).

Regards,

Will Hartung
([email protected])
 
D

David Hilsee

Will Hartung said:
http://www.perldoc.com/perl5.8.4/pod/perlfaq9.html#How-do-I-remove-HTML-from
-a-string-

But it's downright trivial to parse normally, especially when you simply
don't care about the content of the tags. When I had this problem, that's
basically what I did, write a simple parse. This was quite fortunate as
later I had to tweak it to be conscious of JSP scriptlets....it wasn't
perfect but suitable for the task (which at the time was to let us write
both HTML and plain text emails).

Well, sure, regexes work in simpler cases. I just wanted to point out that
HTML can be difficult to parse if you want to be able to handle many various
inputs, even if you don't care about the content of the tags. At best, a
regex solution is incomplete, and at worst, it is a hack.
 
C

Chris Uppal

David said:
. At best, a regex solution is incomplete, and at worst, it is a hack.

I increasingly suspect that this is true of all, or nearly all, use of regexps.
I mean the cases where a regexp is hardwired (more-or-less) into the code;
they're undoubtedly valuable in applications where a user can enter a regexp to
control/configure something.

Certainly I can't think of any time that I've ever used them in "production"
(not throwaway) code, when I haven't eventually either removed them or
regretted that I couldn't. And just about every use of them I've read in this
newsgroup has been suspicious (either as a over-complicated tool for a simple
task, or a hacky approximation to a difficult task).

(BTW, as an old Unix hacker, with a particular fondness for awk and sed, I'm
happy to /use/ quite complicated regexps; it's just that I'm suspicious of
their ultimate value.)

-- chris
 
S

Sharp

Hi,

I'm looking for a way to count the words in an Html Document or String.
Any idea or experience with this?

may thx in advance.
rgds,
Francois

1. Use Swing HTML parser to extract just the text from a HTML document
(this will ignore tags) to form a string.
2. Use string split function to split string at white spaces - this will
return a string array
3. Use size() method of array to give you the counts of words.

Note: stringtokenizer is no longer in vogue and is recommened to use
string split function. Regrex is another alternative, but it's overkill for
such a simple problem.

Regards
Sharp
 
P

Paul Lutus

Chris said:
I increasingly suspect that this is true of all, or nearly all, use of
regexps. I mean the cases where a regexp is hardwired (more-or-less) into
the code; they're undoubtedly valuable in applications where a user can
enter a regexp to control/configure something.

Certainly I can't think of any time that I've ever used them in
"production" (not throwaway) code, when I haven't eventually either
removed them or
regretted that I couldn't. And just about every use of them I've read in
this newsgroup has been suspicious (either as a over-complicated tool for
a simple task, or a hacky approximation to a difficult task).

(BTW, as an old Unix hacker, with a particular fondness for awk and sed,
I'm happy to /use/ quite complicated regexps; it's just that I'm
suspicious of their ultimate value.)

Well, they (regular expressions) do have the property that they can
accomplish a lot using a small initial statement. The problem often comes
up in correctly evaluating that initial statement and seeing its
limitations and side effects.

For example, my regexp in this thread would parse HTML after a fashion, but
it makes some unrealistically simplifying assumptions about the content of
a typical Web page.
 
W

Will Hartung

Chris Uppal said:
I increasingly suspect that this is true of all, or nearly all, use of regexps.
I mean the cases where a regexp is hardwired (more-or-less) into the code;
they're undoubtedly valuable in applications where a user can enter a regexp to
control/configure something.

Certainly I can't think of any time that I've ever used them in "production"
(not throwaway) code, when I haven't eventually either removed them or
regretted that I couldn't. And just about every use of them I've read in this
newsgroup has been suspicious (either as a over-complicated tool for a simple
task, or a hacky approximation to a difficult task).

(BTW, as an old Unix hacker, with a particular fondness for awk and sed, I'm
happy to /use/ quite complicated regexps; it's just that I'm suspicious of
their ultimate value.)

Regex's offer a LOT of Bang for the Buck, but they are like any other good
abstraction -- implicitly limited in the long run. Since regex's aren't
extensible, they can only take the task so far before it runs out of
capability. Note, I'm not picking on regex's here; like I said, it works for
pretty much anything as you go to a higher level, they all have to
compromise something somewhere.

The real issue though is that regex's are considered a silver bullet, and
they also promote a style of development that is not easy to change when it
does break.

Since the regex is not only the grammar, but also the scanner, when you're
grammar fails, you lose the scanner as well. If you were writing your own
lexer/parser, they are much more robust and easier to change. If you find a
bug in your lexer/parser, you typically don't have to throw the entire thing
away to fix it. If you rely on regex's, it's very possible that you will run
into some wall that requires you to toss the entire thing out and start from
scratch.

Trvial case: Comma Seperated Values.

"1, 2, 3".split(",")

Ok.

Now, what about:
"1, \"This is a test, neet eh?\", 3" ?

Baby, bathwater, out the window.

All that said, I use them all the time as well in scripts. But pretty much
almost never in code.

Regards,

Will Hartung
([email protected])
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top