count occurance of a word/string in the body of an HTML page

Discussion in 'Javascript' started by Question Boy, Aug 27, 2009.

  1. Question Boy

    Question Boy Guest

    I'm trying to find an easy way to count how many time a given word
    appear on a webpage. For instance, I would like to be able to count
    the number of occurance of the word 'Accepted', how would I go about
    this?

    Thank you,

    QB
    Question Boy, Aug 27, 2009
    #1
    1. Advertising

  2. Question Boy wrote:
    > I'm trying to find an easy way to count how many time a given word
    > appear on a webpage. For instance, I would like to be able to count
    > the number of occurance of the word 'Accepted', how would I go about
    > this?


    You would read the FAQ of this newsgroup and find both the `textContent' or
    `innerText' properties, and the properties and methods of String and RegExp
    objects, described in the documentation referred to there.

    <http://jibbering.com/faq/#posting>


    PointedEars
    --
    Danny Goodman's books are out of date and teach practices that are
    positively harmful for cross-browser scripting.
    -- Richard Cornford, cljs, <cife6q$253$1$> (2004)
    Thomas 'PointedEars' Lahn, Aug 27, 2009
    #2
    1. Advertising

  3. Question Boy

    SAM Guest

    Le 8/27/09 8:16 PM, Question Boy a écrit :
    > I'm trying to find an easy way to count how many time a given word
    > appear on a webpage. For instance, I would like to be able to count
    > the number of occurance of the word 'Accepted', how would I go about
    > this?
    >
    > Thank you,
    >
    > QB


    <script type="text/javascript">

    function counter(w) {
    var t = document.body.innerHTML;
    var r = new RegExp ( w+'(?=[\\s.,;—)"”\\'-]+)', 'gi');
    var count = t.match(r).length;
    alert(count + ' strings "'+w+'"');
    }

    </script>
    </head>
    <body>
    <p>Enter the word to count : <input id="word"> then
    <a href="javascript:counter(document.getElementById('word').value)">
    click me</a></p>
    <p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Morbi a
    wisi. Mauris vulputate rutrum arcu. Sed varius. Vestibulum ante ipsum
    primis in faucibus orci luctus et ultrices posuere cubilia Curae; In
    dui. Aenean et turpis. Duis a sapien hendrerit turpis tempor feugiat.
    Nulla facilisi. Praesent in mauris et ipsum aliquam commodo. Aenean ac
    nunc. In sit amet elit. Morbi diam. Quisque sodales eleifend urna.
    Aliquam suscipit velit in nunc. </p>
    <p>Vestibulum id magna. Nulla ante pede, sodales non, scelerisque vel,
    condimentum at, leo. Vestibulum diam. Pellentesque habitant morbi
    tristique senectus et netus et malesuada fames ac turpis egestas. Nam
    ullamcorper, wisi vitae aliquet aliquam, dolor arcu cursus magna, non
    tincidunt nibh nibh vel sapien. Nulla feugiat elit eget urna. Nullam a
    metus. Donec tempus sapien eu orci. Sed pulvinar, nunc in luctus
    convallis, lacus ante gravida felis, ac sollicitudin turpis nulla
    viverra justo. Fusce nunc dui, porta lacinia, tristique et, suscipit
    vestibulum, lectus. Nunc fringilla sapien. Proin sed leo at velit
    tincidunt sagittis. Nam mollis tincidunt mauris. Aliquam ipsum nulla,
    rutrum id, pulvinar sit amet, pellentesque at, neque. </p>
    <p>Curabitur ante. Praesent sit amet nibh facilisis est commodo
    pulvinar. Duis auctor. Ut commodo volutpat massa. Aenean nec erat eget
    erat adipiscing imperdiet. Curabitur ipsum. Quisque sem lacus, fermentum
    ut, suscipit non, pulvinar pretium, wisi. Integer libero mauris,
    ultricies vel, mattis at, luctus id, ipsum. Vestibulum porttitor, mi sit
    amet vehicula bibendum, wisi sapien egestas purus, sit amet feugiat
    dolor diam non diam. Sed quis nisl in nisl nonummy hendrerit. Sed ipsum
    lorem, commodo congue, interdum sed, pretium at, nulla. Nulla facilisi.
    Curabitur ipsum. Cras aliquam libero vel tellus. </p>
    </body>


    --
    sm
    SAM, Aug 27, 2009
    #3
  4. SAM <> writes:

    > Le 8/27/09 8:16 PM, Question Boy a écrit :
    >> I'm trying to find an easy way to count how many time a given word
    >> appear on a webpage. For instance, I would like to be able to count
    >> the number of occurance of the word 'Accepted', how would I go about
    >> this?
    >> Thank you,
    >> QB

    >
    > <script type="text/javascript">
    >
    > function counter(w) {
    > var t = document.body.innerHTML;
    > var r = new RegExp ( w+'(?=[\\s.,;—)"”\\'-]+)', 'gi');


    Using regexps is generally a good idea when working with strings.

    I'm not sure exactly what this regexp is trying to match, but it
    seems like "the word followed by some non-word character".
    It still matches any other word that the word is a suffix of,
    e.g., counting the word "to", you would still get a count from
    "tomato".

    Much more direct to search for RegExp("\\b"+w+"\\b").
    Possibly test that "w" contains only word characters.

    > var count = t.match(r).length;
    > alert(count + ' strings "'+w+'"');
    > }



    /L
    --
    Lasse Reichstein Holst Nielsen
    'Javascript frameworks is a disruptive technology'
    Lasse Reichstein Nielsen, Aug 28, 2009
    #4
  5. Question Boy

    SAM Guest

    Le 8/28/09 7:02 AM, Lasse Reichstein Nielsen a écrit :
    > SAM <> writes:
    >
    >> Le 8/27/09 8:16 PM, Question Boy a écrit :
    >>> I'm trying to find an easy way to count how many time a given word
    >>> appear on a webpage. For instance, I would like to be able to count
    >>> the number of occurance of the word 'Accepted', how would I go about
    >>> this?
    >>> Thank you,
    >>> QB

    >> <script type="text/javascript">
    >>
    >> function counter(w) {
    >> var t = document.body.innerHTML;
    >> var r = new RegExp ( w+'(?=[\\s.,;—)"”\\'-]+)', 'gi');

    >
    > Using regexps is generally a good idea when working with strings.
    >
    > I'm not sure exactly what this regexp is trying to match, but it
    > seems like "the word followed by some non-word character".
    > It still matches any other word that the word is a suffix of,
    > e.g., counting the word "to", you would still get a count from
    > "tomato".


    I tested with 'ac' on the previous proposed demo and it did seem to
    count only the words 'ac'

    > Much more direct to search for RegExp("\\b"+w+"\\b").
    > Possibly test that "w" contains only word characters.


    No because \b consideres that é è à ù etc (non ASCI characters) are
    frontiers of a word
    Even if it could be very rare that a french word finish with 2 'é' or
    that a word could be find with and without an 'é' at the end, what about
    other languages ?

    Anyway, your RegExp seems to do not catch the word 'à' :
    <http://cjoint.com/data/iCmshTUkPm_cpte_un_mot_fr.htm>

    >> var count = t.match(r).length;
    >> alert(count + ' strings "'+w+'"');
    >> }



    --
    sm
    SAM, Aug 28, 2009
    #5
  6. Question Boy

    Question Boy Guest

    On Aug 27, 2:59 pm, Thomas 'PointedEars' Lahn <>
    wrote:
    > Question Boy wrote:
    > > I'm trying to find an easy way to count how many time a given word
    > > appear on a webpage.  For instance, I would like to be able to count
    > > the number of occurance of the word 'Accepted', how would I go about
    > > this?

    >
    > You would read the FAQ of this newsgroup and find both the `textContent' or
    > `innerText' properties, and the properties and methods of String and RegExp
    > objects, described in the documentation referred to there.
    >
    > <http://jibbering.com/faq/#posting>
    >
    > PointedEars
    > --
    > Danny Goodman's books are out of date and teach practices that are
    > positively harmful for cross-browser scripting.
    >  -- Richard Cornford, cljs, <cife6q$253$1$> (2004)





    Thank you for the link! I will take a serious look at it over the
    course of the coming days.
    Question Boy, Aug 28, 2009
    #6
  7. In comp.lang.javascript message <aec1b339-3206-4aa8-b374-7943f02aee3f@c2
    9g2000yqd.googlegroups.com>, Thu, 27 Aug 2009 11:16:27, Question Boy
    <> posted:
    >I'm trying to find an easy way to count how many time a given word
    >appear on a webpage. For instance, I would like to be able to count
    >the number of occurance of the word 'Accepted', how would I go about
    >this?


    No, occurrences.

    If the Web page is not yours, you can take a copy of the source and work
    on that, so one can assume source to be available. However,
    straightforwardly counting words in the source is not going to give,
    reliably, the right answer. The word may appear in comment, or within
    HTML tags, or in JavaScript or VBScript; and code may write it
    conditionally or repeatedly. The word may be in an undisplayed or
    hidden part of the page. The word may be generated by included script,
    and not be in the source at all. The word may be computed - consider
    what document.write( ['mk'+'op', '\x44um'].reverse().join("")+"f" )
    might give.

    You wrote "appear on a webpage". Display the web page, use Select All
    and Copy; then paste it into something which can count words. I think
    MS Word can do it; alternatively, you can paste it into a textarea and
    match its value property with a well-chosen RegExp. See in my
    <URL:http://www.merlyn.demon.co.uk/js-valid.htm>.

    You will need to be very careful to see that you implement an
    appropriate definition of a word. Will, for example, the word "Accep-
    ted" be found? If looking for "paw", should it be found in "cat's-paw"?

    Given what you wrote above, should you also be looking for alternative
    spellings?

    --
    (c) John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
    Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
    Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
    Do not Mail News to me. Before a reply, quote with ">" or "> " (SonOfRFC1036)
    Dr J R Stockton, Aug 28, 2009
    #7
  8. Question Boy

    Pherdnut Guest

    On Aug 27, 1:16 pm, Question Boy <> wrote:
    > I'm trying to find an easy way to count how many time a given word
    > appear on a webpage.  For instance, I would like to be able to count
    > the number of occurance of the word 'Accepted', how would I go about
    > this?
    >
    > Thank you,
    >
    > QB


    RegEx is kind of a big gun for this problem. General rule of thumb: If
    you don't need logic or loops, stick to plain-vanilla string methods.
    Learn RegEx though. It's very powerful. It's just not typically as
    efficient as regular string methods for simple problems. The second
    you start hauling out a bunch of conditions and nested for loops
    though, is usually when you're better off with RegEx.

    The string split function is handy if you just need the number of
    occurrences. Probably much faster than a loop or RegEx specific
    method. Here would be my approach to your problem.

    var splitBySearchWord = (document.body.textContent).split('Accepted');
    alert(splitBySearchWord.length--);

    That just split all the text in the body tags into everything that's
    between occurrences of 'Accepted'. Length of the array will be # of
    occurences + 1 since there will be one before every occurrence and one
    bonus string in the array after the last occurrence.

    If you think I just did your homework for you, you might want to test
    in IE first. I recommend quirksmode.org if you start to get frustrated
    with this or any other Microsoft-being-run-by-a-pack-of-gits-related
    problems in the future.
    Pherdnut, Aug 29, 2009
    #8
  9. In comp.lang.javascript message <c6cd16fe-1e26-430f-9326-0c95d68ecfee@e2
    7g2000yqm.googlegroups.com>, Fri, 28 Aug 2009 19:08:46, Pherdnut
    <> posted:
    >On Aug 27, 1:16 pm, Question Boy <> wrote:
    >> I'm trying to find an easy way to count how many time a given word
    >> appear on a webpage.  For instance, I would like to be able to count
    >> the number of occurance of the word 'Accepted', how would I go about
    >> this?


    >var splitBySearchWord = (document.body.textContent).split('Accepted');
    >alert(splitBySearchWord.length--);


    Method .split with a string cannot reliably find words;
    "A frantic anteater will eat an infant ant".split("ant").length-1
    gives 4 (FF3.0.13).


    That apparently (in FF3) does not show words appearing within <input
    type=text> or <textarea></textarea>, thereby not answering the question
    as asked - "appear on a webpage".

    Whether copy'n'paste picks up such words is browser-dependent : IE8 yes,
    FF3.0.13 no.

    Apparently, document.body.textContent fails in IE8.

    Actually, JavaScript cannot do the job as asked completely, since words
    can appear in images.

    --
    (c) John Stockton, nr London UK. ?@merlyn.demon.co.uk BP7, Delphi 3 & 2006.
    <URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/&c., FAQqy topics & links;
    <URL:http://www.bancoems.com/CompLangPascalDelphiMisc-MiniFAQ.htm> clpdmFAQ;
    NOT <URL:http://support.codegear.com/newsgroups/>: news:borland.* Guidelines
    Dr J R Stockton, Aug 30, 2009
    #9
  10. Question Boy

    Bart Lateur Guest

    Dr J R Stockton wrote:

    >Method .split with a string cannot reliably find words;
    > "A frantic anteater will eat an infant ant".split("ant").length-1
    >gives 4 (FF3.0.13).


    This particular piece of code can be fixed with a regex:

    "A frantic anteater will eat an infant ant".split(/\bant\b/).length-1


    But the rest of your comments still apply.

    --
    Bart.
    Bart Lateur, Aug 31, 2009
    #10
  11. Bart Lateur wrote:
    > Dr J R Stockton wrote:
    >
    >> Method .split with a string cannot reliably find words;
    >> "A frantic anteater will eat an infant ant".split("ant").length-1
    >> gives 4 (FF3.0.13).

    >
    > This particular piece of code can be fixed with a regex:
    >
    > "A frantic anteater will eat an infant ant".split(/\bant\b/).length-1


    Yes, but obviously that also fails in plausible circumstances:

    "A frantic ant-eater ...".split(/\bant\b/).length-1

    And that's really the point, of course: parsing natural language with
    regular expressions will always just be applying heuristics to get an
    approximation. You can improve those heuristics by filtering out some
    false positives and recognizing unusual cases to reduce false
    negatives, and in some cases get results that are good enough for your
    purposes; but eventually you reach the point of diminishing returns.

    That said, if you can get good-enough results, however those are
    defined for your application, with reasonable effort, then ECMAScript
    is pretty nice for doing this kind of heuristic text parsing, because
    it's a relative expressive and convenient language (OO, functional,
    dynamic) and has a decent set of string primitives. I built a
    prototype extensible text-processing system in ECMAScript a while back
    to demonstrate some ideas in computational rhetoric.

    --
    Michael Wojcik
    Micro Focus
    Rhetoric & Writing, Michigan State University
    Michael Wojcik, Aug 31, 2009
    #11
  12. In comp.lang.javascript message <>, Mon, 31
    Aug 2009 11:52:57, Michael Wojcik <> posted:
    >Bart Lateur wrote:
    >> Dr J R Stockton wrote:
    >>
    >>> Method .split with a string cannot reliably find words;
    >>> "A frantic anteater will eat an infant ant".split("ant").length-1
    >>> gives 4 (FF3.0.13).

    >>
    >> This particular piece of code can be fixed with a regex:
    >>
    >> "A frantic anteater will eat an infant ant".split(/\bant\b/).length-1

    >
    >Yes, but obviously that also fails in plausible circumstances:
    >
    > "A frantic ant-eater ...".split(/\bant\b/).length-1


    "A frantic ant-eater ...".split(/\bant\b/)
    gives (FF 3.0.13)
    A frantic ,-eater ...
    which is correct; "ant" and "eater" are two English words, connected
    with a [representation of a] hyphen. The member of the myrmecophaga is
    the "anteater", a single word. Even Webster gets that right.

    --
    (c) John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
    Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
    Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
    Do not Mail News to me. Before a reply, quote with ">" or "> " (SonOfRFC1036)
    Dr J R Stockton, Sep 1, 2009
    #12
  13. Dr J R Stockton wrote:
    > In comp.lang.javascript message <>, Mon, 31
    > Aug 2009 11:52:57, Michael Wojcik <> posted:
    >> Bart Lateur wrote:
    >>> Dr J R Stockton wrote:
    >>>
    >>>> Method .split with a string cannot reliably find words;
    >>>> "A frantic anteater will eat an infant ant".split("ant").length-1
    >>>> gives 4 (FF3.0.13).
    >>> This particular piece of code can be fixed with a regex:
    >>>
    >>> "A frantic anteater will eat an infant ant".split(/\bant\b/).length-1

    >> Yes, but obviously that also fails in plausible circumstances:
    >>
    >> "A frantic ant-eater ...".split(/\bant\b/).length-1

    >
    > "A frantic ant-eater ...".split(/\bant\b/)
    > gives (FF 3.0.13)
    > A frantic ,-eater ...
    > which is correct;


    No, it isn't, by definition. I've defined the problem implicitly by
    posing the example, and the code in question fails to produce the
    correct solution according to the definition of the problem.

    Whether a similar problem *you* define is solved correctly by the code
    is immaterial.

    > "ant" and "eater" are two English words, connected
    > with a [representation of a] hyphen.


    Indeed. And they are thus combined into a single word, which is not
    the word I wanted the code to count, and thus the code is wrong.

    > The member of the myrmecophaga is
    > the "anteater", a single word.


    That is a convention. English lacks any authority to enforce such
    conventions. There are, as I originally claimed, plausible
    circumstances under which that convention is not maintained. The usage
    "ant-eater" appears in practice,[1] and thus may be present in the
    text passed to the code snippet.

    > Even Webster gets that right.


    Webster is not a governing authority.

    A prescriptivist stance on English usage may be comforting to some,
    but it's of little value in this problem area - machine parsing of
    real English text.


    [1] See for example
    http://www.encyclopedia.com/doc/1O8-bandedanteater.html?jse=0

    --
    Michael Wojcik
    Micro Focus
    Rhetoric & Writing, Michigan State University
    Michael Wojcik, Sep 4, 2009
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. news.hku.hk
    Replies:
    7
    Views:
    7,666
    Michiel Salters
    Apr 26, 2004
  2. magix
    Replies:
    3
    Views:
    313
    user923005
    May 21, 2008
  3. Patrick Spence
    Replies:
    2
    Views:
    118
    Patrick Spence
    Aug 19, 2006
  4. Replies:
    2
    Views:
    79
  5. Stephen O'D

    Matching final occurance of a string in text

    Stephen O'D, Jun 1, 2005, in forum: Perl Misc
    Replies:
    6
    Views:
    81
    Anno Siegel
    Jun 2, 2005
Loading...

Share This Page