Re: Finding keywords

Discussion in 'Python' started by Heather Brown, Mar 8, 2011.

  1. On 01/-10/-28163 02:59 PM, Cross wrote:
    > Hello
    >
    > I have got a project in which I have to extract keywords given a URL. I
    > would like to know methods for extraction of keywords. Frequency of
    > occurence is one; but it seems naive. I would prefer something more
    > robust. Please suggest.
    >
    > Regards
    > Cross
    >
    > --- news://freenews.netfront.net/ - complaints: ---
    >


    The keywords are an attribute in a tag called <meta>, in the section
    called <head>. Are you having trouble parsing the xhtml to that point?

    Be more specific in your question, and somebody is likely to chime in.
    Although I'm not the one, if it's a question of parsing the xhtml.

    DaveA
     
    Heather Brown, Mar 8, 2011
    #1
    1. Advertising

  2. Heather Brown

    Matt Chaput Guest

    On 08/03/2011 8:58 AM, Cross wrote:
    > I know meta tags contain keywords but they are not always reliable. I
    > can parse xhtml to obtain keywords from meta tags; but how do I verify
    > them. To obtain reliable keywords, I have to parse the plain text
    > obtained from the URL.


    I think maybe what the OP is asking about is extracting key words from a
    text, i.e. a short list of words that characterize the text. This is an
    information retrieval problem, not really a Python problem.

    One simple way to do this is to calculate word frequency histograms for
    each document in your corpus, and then for a given document, select
    words that are frequent in that document but infrequent in the corpus as
    a whole. Whoosh does this. There are different ways of calculating the
    importance of words, and stemming and conflating synonyms can give you
    better results as well.

    A more sophisticated method uses "part of speech" tagging. See the
    Python Natural Language Toolkit (NLTK) and topia.termextract for more
    information.

    http://pypi.python.org/pypi/topia.termextract/

    Yahoo has a web service for key word extraction:

    http://developer.yahoo.com/search/content/V1/termExtraction.html

    You might want to investigate these resources and try google searches
    for e.g. "extracting key terms from documents" and then come back if you
    have a question about the Python implementation.

    Cheers,

    Matt
     
    Matt Chaput, Mar 8, 2011
    #2
    1. Advertising

  3. 2011/3/8 Cross <>:
    > On 03/08/2011 06:09 PM, Heather Brown wrote:
    >>
    >> The keywords are an attribute in a tag called <meta>, in the section
    >> called
    >> <head>. Are you having trouble parsing the xhtml to that point?
    >>
    >> Be more specific in your question, and somebody is likely to chime in.
    >> Although
    >> I'm not the one, if it's a question of parsing the xhtml.
    >>
    >> DaveA

    >
    > I know meta tags contain keywords but they are not always reliable. I can
    > parse xhtml to obtain keywords from meta tags; but how do I verify them. To
    > obtain reliable keywords, I have to parse the plain text obtained from the
    > URL.
    >
    > Cross
    >
    > --- news://freenews.netfront.net/ - complaints: ---
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >


    Hi,
    if you need to extract meaningful keywords in terms of data mining
    using natural language processing, it might become quite a complex
    task, depending on the requirements; the NLTK toolkit may help with
    some approaches [ http://www.nltk.org/ ].
    One possibility would be to filter out more frequent and less
    meaningful words ("stopwords") and extract the more frequent words
    from the reminder., e.g. (with some simplifications/hacks in the
    interactive mode):

    >>> import re, urllib2, nltk
    >>> page_src = urllib2.urlopen("http://www.python.org/doc/essays/foreword/").read().decode("utf-8")
    >>> page_plain = nltk.clean_html(page_src).lower()
    >>> txt_filtered = nltk.Text((word for word in re.findall(r"(?u)\w+", page_plain) if word not in set(nltk.corpus.stopwords.words("english"))))
    >>> frequency_dist = nltk.FreqDist(txt_filtered)
    >>> [(word, freq) for (word, freq) in frequency_dist.items() if freq > 2]

    [(u'python', 39), (u'abc', 11), (u'code', 10), (u'c', 7),
    (u'language', 7), (u'programming', 7), (u'unix', 7), (u'foreword', 5),
    (u'new', 5), (u'would', 5), (u'1st', 4), (u'book', 4), (u'ed', 4),
    (u'features', 4), (u'many', 4), (u'one', 4), (u'programmer', 4),
    (u'time', 4), (u'use', 4), (u'community', 3), (u'documentation', 3),
    (u'early', 3), (u'enough', 3), (u'even', 3), (u'first', 3), (u'help',
    3), (u'indentation', 3), (u'instance', 3), (u'less', 3), (u'like', 3),
    (u'makes', 3), (u'personal', 3), (u'programmers', 3), (u'readability',
    3), (u'readable', 3), (u'write', 3)]
    >>>


    Another possibility would be to extract parts of speech (e.g. nouns,
    adjective, verbs) using e.g. nltk.pos_tag(input_txt) etc.;
    for more convoluted html code e.g. BeautifulSoup might be used and
    there are likely many other options.

    hth,
    vbr
     
    Vlastimil Brom, Mar 8, 2011
    #3
  4. Heather Brown

    Terry Reedy Guest

    On 3/8/2011 2:00 PM, Matt Chaput wrote:
    > On 08/03/2011 8:58 AM, Cross wrote:
    >> I know meta tags contain keywords but they are not always reliable. I
    >> can parse xhtml to obtain keywords from meta tags; but how do I verify
    >> them. To obtain reliable keywords, I have to parse the plain text
    >> obtained from the URL.


    This, of course, is a problem for all search engines, especially given
    'search optimization' games.

    > I think maybe what the OP is asking about is extracting key words from a
    > text, i.e. a short list of words that characterize the text. This is an
    > information retrieval problem, not really a Python problem.
    >
    > One simple way to do this is to calculate word frequency histograms for
    > each document in your corpus, and then for a given document, select
    > words that are frequent in that document but infrequent in the corpus as
    > a whole. Whoosh does this.


    I believe Google does something like this also. I have seen a claim that
    Google only looks at the first x words, hence the advice 'Make sure your
    target keywords are in the first x words.'. You, of course, can and
    should process entire docs


    --
    Terry Jan Reedy
     
    Terry Reedy, Mar 8, 2011
    #4
  5. Heather Brown

    Guest

    Hi , If you got the solutions please let me know also. I have to implement asap.
    On Wednesday, 9 March 2011 23:43:26 UTC+5:30, Cross wrote:
    > On 03/09/2011 01:21 AM, Vlastimil Brom wrote:
    > > 2011/3/8 Cross<>:
    > >> On 03/08/2011 06:09 PM, Heather Brown wrote:
    > >>>
    > >>> The keywords are an attribute in a tag called<meta>, in the section
    > >>> called
    > >>> <head>. Are you having trouble parsing the xhtml to that point?
    > >>>
    > >>> Be more specific in your question, and somebody is likely to chime in.
    > >>> Although
    > >>> I'm not the one, if it's a question of parsing the xhtml.
    > >>>
    > >>> DaveA
    > >>
    > >> I know meta tags contain keywords but they are not always reliable. I can
    > >> parse xhtml to obtain keywords from meta tags; but how do I verify them. To
    > >> obtain reliable keywords, I have to parse the plain text obtained from the
    > >> URL.
    > >>
    > >> Cross
    > >>
    > >> --- news://freenews.netfront.net/ - complaints: ---
    > >> --
    > >> http://mail.python.org/mailman/listinfo/python-list
    > >>

    > >
    > > Hi,
    > > if you need to extract meaningful keywords in terms of data mining
    > > using natural language processing, it might become quite a complex
    > > task, depending on the requirements; the NLTK toolkit may help with
    > > some approaches [ http://www.nltk.org/ ].
    > > One possibility would be to filter out more frequent and less
    > > meaningful words ("stopwords") and extract the more frequent words
    > > from the reminder., e.g. (with some simplifications/hacks in the
    > > interactive mode):
    > >
    > >>>> import re, urllib2, nltk
    > >>>> page_src = urllib2.urlopen("http://www.python.org/doc/essays/foreword/").read().decode("utf-8")
    > >>>> page_plain = nltk.clean_html(page_src).lower()
    > >>>> txt_filtered = nltk.Text((word for word in re.findall(r"(?u)\w+", page_plain) if word not in set(nltk.corpus.stopwords.words("english"))))
    > >>>> frequency_dist = nltk.FreqDist(txt_filtered)
    > >>>> [(word, freq) for (word, freq) in frequency_dist.items() if freq> 2]

    > > [(u'python', 39), (u'abc', 11), (u'code', 10), (u'c', 7),
    > > (u'language', 7), (u'programming', 7), (u'unix', 7), (u'foreword', 5),
    > > (u'new', 5), (u'would', 5), (u'1st', 4), (u'book', 4), (u'ed', 4),
    > > (u'features', 4), (u'many', 4), (u'one', 4), (u'programmer', 4),
    > > (u'time', 4), (u'use', 4), (u'community', 3), (u'documentation', 3),
    > > (u'early', 3), (u'enough', 3), (u'even', 3), (u'first', 3), (u'help',
    > > 3), (u'indentation', 3), (u'instance', 3), (u'less', 3), (u'like', 3),
    > > (u'makes', 3), (u'personal', 3), (u'programmers', 3), (u'readability',
    > > 3), (u'readable', 3), (u'write', 3)]
    > >>>>

    > >
    > > Another possibility would be to extract parts of speech (e.g. nouns,
    > > adjective, verbs) using e.g. nltk.pos_tag(input_txt) etc.;
    > > for more convoluted html code e.g. BeautifulSoup might be used and
    > > there are likely many other options.
    > >
    > > hth,
    > > vbr

    > I had considered nltk. That is why I said that straightforward frequency
    > calculation of words would be naive. I have to look into this BeautifulSoup thing.
    >
    > --- news://freenews.netfront.net/ - complaints: ---
     
    , Dec 5, 2013
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bruce
    Replies:
    0
    Views:
    713
    Bruce
    Aug 13, 2004
  2. valentin tihomirov

    Advantages of denying keywords as identifiers

    valentin tihomirov, Dec 17, 2004, in forum: VHDL
    Replies:
    8
    Views:
    530
    Mike Treseler
    Dec 28, 2004
  3. dw

    Friend & Protected keywords

    dw, May 15, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    4,890
  4. David Lozzi

    Search using multiple keywords

    David Lozzi, Jun 2, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    539
    David Lozzi
    Jun 2, 2005
  5. =?Utf-8?B?c2VyZ2UgY2FsZGVyYXJh?=

    Where to find HTML keywords ?

    =?Utf-8?B?c2VyZ2UgY2FsZGVyYXJh?=, Oct 11, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    410
    =?Utf-8?B?Q3VydF9DIFtNVlBd?=
    Oct 11, 2005
Loading...

Share This Page