grabbing random words

J

Jay

How would I be able to grab random words from an internet source. I'd
like to grab a random word from a comprehensive internet dictionary.
What would be the best source and the best way to go about this?
Thanks.

(Sorry if this sounds/is super noobish.)
 
B

Bjoern Schliessmann

Jay said:
How would I be able to grab random words from an internet source.
I'd like to grab a random word from a comprehensive internet
dictionary. What would be the best source and the best way to go
about this?

The *best* source would be a function of the internet dictionary
that selects a random word and passes it to you. Otherwise you'd
have to read quite an amount of words, and select one yourself.
(Sorry if this sounds/is super noobish.)

It's quite difficult to let readable and complete questions (also
with meaningful subject) sound noobish ;)

Regards,


Björn
 
B

bearophileHUGS

Jay:
How would I be able to grab random words from an internet source. I'd
like to grab a random word from a comprehensive internet dictionary.
What would be the best source and the best way to go about this?

Why do you need to grab them from the net?
A simpler solution seems to keep a local file containing the sequence
of words. You can find some open source sequences of such words. Then
you can read all the words in a list, and use random.choice to take one
of them randomly. If you don't want to keep all the dictionary/lexer
(that can be up to 20 MB if it's a lexer) in memory you can (knowing
the len of the file) seek a random position, and read 20-30 bytes, and
take the word inside it (or you can create a dictionary file where each
word is contained in in a fixed len of chars, so you can seek exactly a
single word).

Bye,
bearophile
 
M

MonkeeSage

Another approach would be to just scrape a CS's random (5.75 x 10^30)
word haiku generator. ;)

import urllib
import libxml2
import random

uri = 'http://www.cs.indiana.edu/cgi-bin/haiku'

sock = urllib.urlopen(uri)
data = sock.read()
sock.close()

doc = libxml2.htmlParseDoc(data, None)
words = [p.content for p in doc.xpathEval('//a')[8:-3]]
doc.freeDoc()

print random.choice(words)

Regards,
Jordan
 
S

Steven D'Aprano

Another approach would be to just scrape a CS's random (5.75 x 10^30)
word haiku generator. ;)

That isn't 5.75e30 words, it is the number of possible haikus. There
aren't that many words in all human languages combined.

Standard English working vocabulary is about 800 words in typical daily
use, and 5000 words that most people can understand. Particularly
well-read people might understand a dozen times that, about 60,000 words.
The total number of words in English is hard to count, but the Oxford
English Dictionary estimates about three quarters of a million words.

http://www.askoxford.com/asktheexperts/faq/aboutenglish/numberwords


Call it a million; and lets say that there are, or have every been, a
million distinct human languages (which is surely a large overestimate,
even including dialects and pigeons). That gives only a "mere" 10**12
words, about a million million million times smaller than the number of
haikus.

(Note however that there are languages like Finnish which allow you to
stick together words into a single "word" of indefinite length, sort of as
if we could say in English "therearelanguageswhichallowyou" to
"sticktogetherwordsintoasinglewordofindefinitelength". Such languages
might be said to have an infinite number of words, in some sense.)
 
N

Nick Vatamaniuc

Jay,

Your problem is specific to a particular internet dictionary provider.


UNLESS

1) The dictionary page has some specific link that gets you a
random word, OR

2) after you click through a couple of word definitions you find in
the URLs of the pages that the words are indexed using integers and
there no gaps in the sequence, OR

3) The dictionary somehow exposes its database for all to access,

THEN you cannot really get random words from it.

If you need random words find yourself lists of such words online
(sites devoted to natural language processing or linguistics might have
them) then load them up into a list and randomly choose between the
indices of the list to get your words.

Nick V.
 
M

MonkeeSage

Steven said:
That isn't 5.75e30 words, it is the number of possible haikus. There
aren't that many words in all human languages combined.

Doh! This is why _I'm_ not a computer scientist. I'm kinda slow. ;)
(Note however that there are languages like Finnish which allow you to
stick together words into a single "word" of indefinite length, sort of as
if we could say in English "therearelanguageswhichallowyou" to
"sticktogetherwordsintoasinglewordofindefinitelength". Such languages
might be said to have an infinite number of words, in some sense.)

Imagine an agglutinating (sp?) programming language, heh!

Regards,
Jordan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,679
Members
48,796
Latest member
Greg L.

Latest Threads

Top