count occurance of a word/string in the body of an HTML page

Q

Question Boy

I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?

Thank you,

QB
 
T

Thomas 'PointedEars' Lahn

Question said:
I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?

You would read the FAQ of this newsgroup and find both the `textContent' or
`innerText' properties, and the properties and methods of String and RegExp
objects, described in the documentation referred to there.

<http://jibbering.com/faq/#posting>


PointedEars
 
S

SAM

Le 8/27/09 8:16 PM, Question Boy a écrit :
I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?

Thank you,

QB

<script type="text/javascript">

function counter(w) {
var t = document.body.innerHTML;
var r = new RegExp ( w+'(?=[\\s.,;—)"”\\'-]+)', 'gi');
var count = t.match(r).length;
alert(count + ' strings "'+w+'"');
}

</script>
</head>
<body>
<p>Enter the word to count : <input id="word"> then
<a href="javascript:counter(document.getElementById('word').value)">
click me</a></p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Morbi a
wisi. Mauris vulputate rutrum arcu. Sed varius. Vestibulum ante ipsum
primis in faucibus orci luctus et ultrices posuere cubilia Curae; In
dui. Aenean et turpis. Duis a sapien hendrerit turpis tempor feugiat.
Nulla facilisi. Praesent in mauris et ipsum aliquam commodo. Aenean ac
nunc. In sit amet elit. Morbi diam. Quisque sodales eleifend urna.
Aliquam suscipit velit in nunc. </p>
<p>Vestibulum id magna. Nulla ante pede, sodales non, scelerisque vel,
condimentum at, leo. Vestibulum diam. Pellentesque habitant morbi
tristique senectus et netus et malesuada fames ac turpis egestas. Nam
ullamcorper, wisi vitae aliquet aliquam, dolor arcu cursus magna, non
tincidunt nibh nibh vel sapien. Nulla feugiat elit eget urna. Nullam a
metus. Donec tempus sapien eu orci. Sed pulvinar, nunc in luctus
convallis, lacus ante gravida felis, ac sollicitudin turpis nulla
viverra justo. Fusce nunc dui, porta lacinia, tristique et, suscipit
vestibulum, lectus. Nunc fringilla sapien. Proin sed leo at velit
tincidunt sagittis. Nam mollis tincidunt mauris. Aliquam ipsum nulla,
rutrum id, pulvinar sit amet, pellentesque at, neque. </p>
<p>Curabitur ante. Praesent sit amet nibh facilisis est commodo
pulvinar. Duis auctor. Ut commodo volutpat massa. Aenean nec erat eget
erat adipiscing imperdiet. Curabitur ipsum. Quisque sem lacus, fermentum
ut, suscipit non, pulvinar pretium, wisi. Integer libero mauris,
ultricies vel, mattis at, luctus id, ipsum. Vestibulum porttitor, mi sit
amet vehicula bibendum, wisi sapien egestas purus, sit amet feugiat
dolor diam non diam. Sed quis nisl in nisl nonummy hendrerit. Sed ipsum
lorem, commodo congue, interdum sed, pretium at, nulla. Nulla facilisi.
Curabitur ipsum. Cras aliquam libero vel tellus. </p>
</body>
 
L

Lasse Reichstein Nielsen

SAM said:
Le 8/27/09 8:16 PM, Question Boy a écrit :
I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?
Thank you,
QB

<script type="text/javascript">

function counter(w) {
var t = document.body.innerHTML;
var r = new RegExp ( w+'(?=[\\s.,;—)"”\\'-]+)', 'gi');

Using regexps is generally a good idea when working with strings.

I'm not sure exactly what this regexp is trying to match, but it
seems like "the word followed by some non-word character".
It still matches any other word that the word is a suffix of,
e.g., counting the word "to", you would still get a count from
"tomato".

Much more direct to search for RegExp("\\b"+w+"\\b").
Possibly test that "w" contains only word characters.
var count = t.match(r).length;
alert(count + ' strings "'+w+'"');
}


/L
 
S

SAM

Le 8/28/09 7:02 AM, Lasse Reichstein Nielsen a écrit :
SAM said:
Le 8/27/09 8:16 PM, Question Boy a écrit :
I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?
Thank you,
QB
<script type="text/javascript">

function counter(w) {
var t = document.body.innerHTML;
var r = new RegExp ( w+'(?=[\\s.,;—)"”\\'-]+)', 'gi');

Using regexps is generally a good idea when working with strings.

I'm not sure exactly what this regexp is trying to match, but it
seems like "the word followed by some non-word character".
It still matches any other word that the word is a suffix of,
e.g., counting the word "to", you would still get a count from
"tomato".

I tested with 'ac' on the previous proposed demo and it did seem to
count only the words 'ac'
Much more direct to search for RegExp("\\b"+w+"\\b").
Possibly test that "w" contains only word characters.

No because \b consideres that é è à ù etc (non ASCI characters) are
frontiers of a word
Even if it could be very rare that a french word finish with 2 'é' or
that a word could be find with and without an 'é' at the end, what about
other languages ?

Anyway, your RegExp seems to do not catch the word 'à' :
 
Q

Question Boy

You would read the FAQ of this newsgroup and find both the `textContent' or
`innerText' properties, and the properties and methods of String and RegExp
objects, described in the documentation referred to there.

<http://jibbering.com/faq/#posting>

PointedEars




Thank you for the link! I will take a serious look at it over the
course of the coming days.
 
D

Dr J R Stockton

In comp.lang.javascript message <aec1b339-3206-4aa8-b374-7943f02aee3f@c2
9g2000yqd.googlegroups.com>, Thu, 27 Aug 2009 11:16:27, Question Boy
I'm trying to find an easy way to count how many time a given word
appear on a webpage. For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?

No, occurrences.

If the Web page is not yours, you can take a copy of the source and work
on that, so one can assume source to be available. However,
straightforwardly counting words in the source is not going to give,
reliably, the right answer. The word may appear in comment, or within
HTML tags, or in JavaScript or VBScript; and code may write it
conditionally or repeatedly. The word may be in an undisplayed or
hidden part of the page. The word may be generated by included script,
and not be in the source at all. The word may be computed - consider
what document.write( ['mk'+'op', '\x44um'].reverse().join("")+"f" )
might give.

You wrote "appear on a webpage". Display the web page, use Select All
and Copy; then paste it into something which can count words. I think
MS Word can do it; alternatively, you can paste it into a textarea and
match its value property with a well-chosen RegExp. See in my
<URL:http://www.merlyn.demon.co.uk/js-valid.htm>.

You will need to be very careful to see that you implement an
appropriate definition of a word. Will, for example, the word "Accep-
ted" be found? If looking for "paw", should it be found in "cat's-paw"?

Given what you wrote above, should you also be looking for alternative
spellings?
 
P

Pherdnut

I'm trying to find an easy way to count how many time a given word
appear on a webpage.  For instance, I would like to be able to count
the number of occurance of the word 'Accepted', how would I go about
this?

Thank you,

QB

RegEx is kind of a big gun for this problem. General rule of thumb: If
you don't need logic or loops, stick to plain-vanilla string methods.
Learn RegEx though. It's very powerful. It's just not typically as
efficient as regular string methods for simple problems. The second
you start hauling out a bunch of conditions and nested for loops
though, is usually when you're better off with RegEx.

The string split function is handy if you just need the number of
occurrences. Probably much faster than a loop or RegEx specific
method. Here would be my approach to your problem.

var splitBySearchWord = (document.body.textContent).split('Accepted');
alert(splitBySearchWord.length--);

That just split all the text in the body tags into everything that's
between occurrences of 'Accepted'. Length of the array will be # of
occurences + 1 since there will be one before every occurrence and one
bonus string in the array after the last occurrence.

If you think I just did your homework for you, you might want to test
in IE first. I recommend quirksmode.org if you start to get frustrated
with this or any other Microsoft-being-run-by-a-pack-of-gits-related
problems in the future.
 
D

Dr J R Stockton

In comp.lang.javascript message <c6cd16fe-1e26-430f-9326-0c95d68ecfee@e2
7g2000yqm.googlegroups.com>, Fri, 28 Aug 2009 19:08:46, Pherdnut
var splitBySearchWord = (document.body.textContent).split('Accepted');
alert(splitBySearchWord.length--);

Method .split with a string cannot reliably find words;
"A frantic anteater will eat an infant ant".split("ant").length-1
gives 4 (FF3.0.13).


That apparently (in FF3) does not show words appearing within <input
type=text> or <textarea></textarea>, thereby not answering the question
as asked - "appear on a webpage".

Whether copy'n'paste picks up such words is browser-dependent : IE8 yes,
FF3.0.13 no.

Apparently, document.body.textContent fails in IE8.

Actually, JavaScript cannot do the job as asked completely, since words
can appear in images.
 
B

Bart Lateur

Dr said:
Method .split with a string cannot reliably find words;
"A frantic anteater will eat an infant ant".split("ant").length-1
gives 4 (FF3.0.13).

This particular piece of code can be fixed with a regex:

"A frantic anteater will eat an infant ant".split(/\bant\b/).length-1


But the rest of your comments still apply.
 
M

Michael Wojcik

Bart said:
This particular piece of code can be fixed with a regex:

"A frantic anteater will eat an infant ant".split(/\bant\b/).length-1

Yes, but obviously that also fails in plausible circumstances:

"A frantic ant-eater ...".split(/\bant\b/).length-1

And that's really the point, of course: parsing natural language with
regular expressions will always just be applying heuristics to get an
approximation. You can improve those heuristics by filtering out some
false positives and recognizing unusual cases to reduce false
negatives, and in some cases get results that are good enough for your
purposes; but eventually you reach the point of diminishing returns.

That said, if you can get good-enough results, however those are
defined for your application, with reasonable effort, then ECMAScript
is pretty nice for doing this kind of heuristic text parsing, because
it's a relative expressive and convenient language (OO, functional,
dynamic) and has a decent set of string primitives. I built a
prototype extensible text-processing system in ECMAScript a while back
to demonstrate some ideas in computational rhetoric.
 
D

Dr J R Stockton

Mon said:
Yes, but obviously that also fails in plausible circumstances:

"A frantic ant-eater ...".split(/\bant\b/).length-1

"A frantic ant-eater ...".split(/\bant\b/)
gives (FF 3.0.13)
A frantic ,-eater ...
which is correct; "ant" and "eater" are two English words, connected
with a [representation of a] hyphen. The member of the myrmecophaga is
the "anteater", a single word. Even Webster gets that right.
 
M

Michael Wojcik

Dr said:
"A frantic ant-eater ...".split(/\bant\b/)
gives (FF 3.0.13)
A frantic ,-eater ...
which is correct;

No, it isn't, by definition. I've defined the problem implicitly by
posing the example, and the code in question fails to produce the
correct solution according to the definition of the problem.

Whether a similar problem *you* define is solved correctly by the code
is immaterial.
"ant" and "eater" are two English words, connected
with a [representation of a] hyphen.

Indeed. And they are thus combined into a single word, which is not
the word I wanted the code to count, and thus the code is wrong.
The member of the myrmecophaga is
the "anteater", a single word.

That is a convention. English lacks any authority to enforce such
conventions. There are, as I originally claimed, plausible
circumstances under which that convention is not maintained. The usage
"ant-eater" appears in practice,[1] and thus may be present in the
text passed to the code snippet.
Even Webster gets that right.

Webster is not a governing authority.

A prescriptivist stance on English usage may be comforting to some,
but it's of little value in this problem area - machine parsing of
real English text.


[1] See for example
http://www.encyclopedia.com/doc/1O8-bandedanteater.html?jse=0
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top