count occurance of a word/string in the body of an HTML page

Question Boy · Aug 27, 2009

I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?

Thank you,

QB

Thomas 'PointedEars' Lahn · Aug 27, 2009

Question said:
I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?

You would read the FAQ of this newsgroup and find both the `textContent' or
`innerText' properties, and the properties and methods of String and RegExp
objects, described in the documentation referred to there.

<http://jibbering.com/faq/#posting>

PointedEars

SAM · Aug 27, 2009

Le 8/27/09 8:16 PM, Question Boy a écrit :

I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?

Thank you,

QB

Lasse Reichstein Nielsen · Aug 28, 2009

SAM said:
Le 8/27/09 8:16 PM, Question Boy a écrit :

I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?
Thank you,
QB

Click to expand...

<script type="text/javascript">

function counter(w) {
var t = document.body.innerHTML;
var r = new RegExp ( w+'(?=[\\s.,;—)"”\\'-]+)', 'gi');

Using regexps is generally a good idea when working with strings.

I'm not sure exactly what this regexp is trying to match, but it
seems like "the word followed by some non-word character".
It still matches any other word that the word is a suffix of,
e.g., counting the word "to", you would still get a count from
"tomato".

Much more direct to search for RegExp("\\b"+w+"\\b").
Possibly test that "w" contains only word characters.

var count = t.match(r).length;
alert(count + ' strings "'+w+'"');
}

/L

SAM · Aug 28, 2009

Le 8/28/09 7:02 AM, Lasse Reichstein Nielsen a écrit :

SAM said:
SAM said:

Le 8/27/09 8:16 PM, Question Boy a écrit :

I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?
Thank you,
QB

Click to expand...

<script type="text/javascript">

function counter(w) {
var t = document.body.innerHTML;
var r = new RegExp ( w+'(?=[\\s.,;—)"”\\'-]+)', 'gi');

Click to expand...

Using regexps is generally a good idea when working with strings.

I'm not sure exactly what this regexp is trying to match, but it
seems like "the word followed by some non-word character".
It still matches any other word that the word is a suffix of,
e.g., counting the word "to", you would still get a count from
"tomato".

I tested with 'ac' on the previous proposed demo and it did seem to
count only the words 'ac'

Much more direct to search for RegExp("\\b"+w+"\\b").
Possibly test that "w" contains only word characters.

No because \b consideres that é è à ù etc (non ASCI characters) are
frontiers of a word
Even if it could be very rare that a french word finish with 2 'é' or
that a word could be find with and without an 'é' at the end, what about
other languages ?

Anyway, your RegExp seems to do not catch the word 'à' :

Question Boy · Aug 28, 2009

You would read the FAQ of this newsgroup and find both the `textContent' or
`innerText' properties, and the properties and methods of String and RegExp
objects, described in the documentation referred to there.

<http://jibbering.com/faq/#posting>

PointedEars

Thank you for the link! I will take a serious look at it over the
course of the coming days.

Dr J R Stockton · Aug 28, 2009

In comp.lang.javascript message <aec1b339-3206-4aa8-b374-7943f02aee3f@c2
9g2000yqd.googlegroups.com>, Thu, 27 Aug 2009 11:16:27, Question Boy

I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?

No, occurrences.

If the Web page is not yours, you can take a copy of the source and work
on that, so one can assume source to be available. However,
straightforwardly counting words in the source is not going to give,
reliably, the right answer. The word may appear in comment, or within
HTML tags, or in JavaScript or VBScript; and code may write it
conditionally or repeatedly. The word may be in an undisplayed or
hidden part of the page. The word may be generated by included script,
and not be in the source at all. The word may be computed - consider
what document.write( ['mk'+'op', '\x44um'].reverse().join("")+"f" )
might give.

You wrote "appear on a webpage". Display the web page, use Select All
and Copy; then paste it into something which can count words. I think
MS Word can do it; alternatively, you can paste it into a textarea and
match its value property with a well-chosen RegExp. See in my
<URL:http://www.merlyn.demon.co.uk/js-valid.htm>.

You will need to be very careful to see that you implement an
appropriate definition of a word. Will, for example, the word "Accep-
ted" be found? If looking for "paw", should it be found in "cat's-paw"?

Given what you wrote above, should you also be looking for alternative
spellings?

Pherdnut · Aug 29, 2009

I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?

Thank you,

QB

RegEx is kind of a big gun for this problem. General rule of thumb: If
you don't need logic or loops, stick to plain-vanilla string methods.
Learn RegEx though. It's very powerful. It's just not typically as
efficient as regular string methods for simple problems. The second
you start hauling out a bunch of conditions and nested for loops
though, is usually when you're better off with RegEx.

The string split function is handy if you just need the number of
occurrences. Probably much faster than a loop or RegEx specific
method. Here would be my approach to your problem.

var splitBySearchWord = (document.body.textContent).split('Accepted');
alert(splitBySearchWord.length--);

That just split all the text in the body tags into everything that's
between occurrences of 'Accepted'. Length of the array will be # of
occurences + 1 since there will be one before every occurrence and one
bonus string in the array after the last occurrence.

If you think I just did your homework for you, you might want to test
in IE first. I recommend quirksmode.org if you start to get frustrated
with this or any other Microsoft-being-run-by-a-pack-of-gits-related
problems in the future.

Dr J R Stockton · Aug 30, 2009

In comp.lang.javascript message <c6cd16fe-1e26-430f-9326-0c95d68ecfee@e2
7g2000yqm.googlegroups.com>, Fri, 28 Aug 2009 19:08:46, Pherdnut

var splitBySearchWord = (document.body.textContent).split('Accepted');
alert(splitBySearchWord.length--);

Method .split with a string cannot reliably find words;
"A frantic anteater will eat an infant ant".split("ant").length-1
gives 4 (FF3.0.13).

That apparently (in FF3) does not show words appearing within <input
type=text> or <textarea></textarea>, thereby not answering the question
as asked - "appear on a webpage".

Whether copy'n'paste picks up such words is browser-dependent : IE8 yes,
FF3.0.13 no.

Apparently, document.body.textContent fails in IE8.

Actually, JavaScript cannot do the job as asked completely, since words
can appear in images.

Bart Lateur · Aug 31, 2009

Dr said:
Method .split with a string cannot reliably find words;
"A frantic anteater will eat an infant ant".split("ant").length-1
gives 4 (FF3.0.13).

This particular piece of code can be fixed with a regex:

"A frantic anteater will eat an infant ant".split(/\bant\b/).length-1

But the rest of your comments still apply.

Michael Wojcik · Aug 31, 2009

Bart said:
This particular piece of code can be fixed with a regex:

"A frantic anteater will eat an infant ant".split(/\bant\b/).length-1

Yes, but obviously that also fails in plausible circumstances:

"A frantic ant-eater ...".split(/\bant\b/).length-1

And that's really the point, of course: parsing natural language with
regular expressions will always just be applying heuristics to get an
approximation. You can improve those heuristics by filtering out some
false positives and recognizing unusual cases to reduce false
negatives, and in some cases get results that are good enough for your
purposes; but eventually you reach the point of diminishing returns.

That said, if you can get good-enough results, however those are
defined for your application, with reasonable effort, then ECMAScript
is pretty nice for doing this kind of heuristic text parsing, because
it's a relative expressive and convenient language (OO, functional,
dynamic) and has a decent set of string primitives. I built a
prototype extensible text-processing system in ECMAScript a while back
to demonstrate some ideas in computational rhetoric.

Dr J R Stockton · Sep 1, 2009

Mon said:
Yes, but obviously that also fails in plausible circumstances:

"A frantic ant-eater ...".split(/\bant\b/).length-1

"A frantic ant-eater ...".split(/\bant\b/)
gives (FF 3.0.13)
A frantic ,-eater ...
which is correct; "ant" and "eater" are two English words, connected
with a [representation of a] hyphen. The member of the myrmecophaga is
the "anteater", a single word. Even Webster gets that right.

Michael Wojcik · Sep 4, 2009

Dr said:
"A frantic ant-eater ...".split(/\bant\b/)
gives (FF 3.0.13)
A frantic ,-eater ...
which is correct;

No, it isn't, by definition. I've defined the problem implicitly by
posing the example, and the code in question fails to produce the
correct solution according to the definition of the problem.

Whether a similar problem *you* define is solved correctly by the code
is immaterial.

"ant" and "eater" are two English words, connected
with a [representation of a] hyphen.

Indeed. And they are thus combined into a single word, which is not
the word I wanted the code to count, and thus the code is wrong.

The member of the myrmecophaga is
the "anteater", a single word.

That is a convention. English lacks any authority to enforce such
conventions. There are, as I originally claimed, plausible
circumstances under which that convention is not maintained. The usage
"ant-eater" appears in practice,[1] and thus may be present in the
text passed to the code snippet.

Even Webster gets that right.

Webster is not a governing authority.

A prescriptivist stance on English usage may be comforting to some,
but it's of little value in this problem area - machine parsing of
real English text.

[1] See for example
http://www.encyclopedia.com/doc/1O8-bandedanteater.html?jse=0

Sort and count word pairs in a string	6	Jan 29, 2023
I'm about to get in trouble with the HTML <body></body> tags	10	Aug 12, 2023
Find and count strings of text from multiple files	17	Dec 16, 2021
Python client/server that reads HTML body from server	1	Apr 12, 2023
Converting an Array to a String in JavaScript	7	Sep 22, 2023
Need help making the position of an infinite animation sticky	1	Dec 18, 2022
Measuring a string of text	1	Sep 15, 2022
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023

count occurance of a word/string in the body of an HTML page

Question Boy

Thomas 'PointedEars' Lahn

SAM

Lasse Reichstein Nielsen

SAM

Question Boy

Dr J R Stockton

Pherdnut

Dr J R Stockton

Bart Lateur

Michael Wojcik

Dr J R Stockton

Michael Wojcik

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads