Open source English dictionary to use programmatically w/ python

dgoldsmith_89 · Jan 7, 2008

Can anyone point me to a downloadable open source English dictionary
suitable for programmatic use with python: I'm programming a puzzle
generator, and I need to be able to generate more or less complete
lists of English words, alphabetized. Thanks! DG

Rick Dooling · Jan 7, 2008

Can anyone point me to a downloadable open source English dictionary
suitable for programmatic use with python: I'm programming a puzzle
generator, and I need to be able to generate more or less complete
lists of English words, alphabetized. Thanks! DG

On Linux? WordNet and Dict and many others.

On Windows, maybe try WordWeb?

rd

Fredrik Lundh · Jan 7, 2008

dgoldsmith_89 said:
Can anyone point me to a downloadable open source English dictionary
suitable for programmatic use with python: I'm programming a puzzle
generator, and I need to be able to generate more or less complete
lists of English words, alphabetized. Thanks! DG

here's one:

http://www.dcs.shef.ac.uk/research/ilash/Moby/

</F>

dgoldsmith_89 · Jan 7, 2008

On Linux? WordNet and Dict and many others.

On Windows, maybe try WordWeb?

rd

Sorry, didn't know it would make a difference: on Mac, actually.

DG

mensanator · Jan 7, 2008

Can anyone point me to a downloadable open source English dictionary
suitable for programmatic use with python: I'm programming a puzzle
generator, and I need to be able to generate more or less complete
lists of English words, alphabetized. Thanks! DG

www.puzzlers.org has numerous word lists & dictionarys in text
format that can be downloaded. I recommend you insert them into
some form of database. I have most of them in an Access db and
it's 95 MB. That's a worse case as I also have some value-added
stuff, the OSPD alone would be a lot smaller.

<http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:start>

Tobiah · Jan 7, 2008

dgoldsmith_89 said:
Can anyone point me to a downloadable open source English dictionary
suitable for programmatic use with python: I'm programming a puzzle
generator, and I need to be able to generate more or less complete
lists of English words, alphabetized. Thanks! DG

If all you want are the words themselves, then any linux box
has a fairly complete list. I put mine here:

http://tobiah.org/words.zip

dgoldsmith_89 · Jan 7, 2008

here's one:

http://www.dcs.shef.ac.uk/research/ilash/Moby/

</F>

Excellent, that'll do nicely! Thanks!!!

DG

dgoldsmith_89 · Jan 7, 2008

www.puzzlers.orghas numerous word lists & dictionarys in text
format that can be downloaded. I recommend you insert them into
some form of database. I have most of them in an Access db and
it's 95 MB. That's a worse case as I also have some value-added
stuff, the OSPD alone would be a lot smaller.

<http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:start>

Sorry for my ignorance: I can query an Access DB w/ standard SQL
queries (and this is how I would access it w/ Python)?

DG

Paul McGuire · Jan 7, 2008

Sorry for my ignorance: I can query an Access DB w/ standard SQL
queries (and this is how I would access it w/ Python)?

DG

If you are running on a Mac, just use sqlite, it's built-in to Python
as of v2.5 and you will find more help, documentation, and fellow
Python+sqlite users than you will Python+Access.

-- Paul

mensanator · Jan 7, 2008

Sorry for my ignorance: I can query an Access DB w/ standard SQL
queries (and this is how I would access it w/ Python)?

Yes, if you have the appropriate way to link to the DB.
I use Windows and ODBC from Win32. I don't know what you
would use on a Mac.

As Paul McGuire said, you could easily do this with SqlLite3.
Personnaly, I always use Access since my job requires it
and I find it much more convenient. I often use Crosstab
tables which I think SqlLite3 doesn't support. Typically,
I'll write complex queries in Access and simple select SQL
statements in Python to grab them.

Here's my anagram locator. (the [signature] is an example
of the value-added I mentioned).

## I took a somewhat different approach. Instead of in a file,
## I've got my word list (562456 words) in an MS-Access database.
## And instead of calculating the signature on the fly, I did it
## once and added the signature as a second field:
##
## TABLE CONS_alpha_only_signature_unique
## --------------------------------------
## CONS text 75
## signature text 26
##
## The signature is a 26 character string where each character is
## the count of occurences of the matching letter. Luckily, in
## only a single case was there more than 9 occurences of any
## given letter, which turned not to be a word but a series of
## words concatenated so I just deleted it from the database
## (lots of crap in the original word list I used).
##
## Example:
##
## CONS signature
## aah 20000001000000000000000000 # 'a' occurs twice & 'h' once
## aahed 20011001000000000000000000
## aahing 20000011100001000000000000
## aahs 20000001000000000010000000
## aaii 20000000200000000000000000
## aaker 20001000001000000100000000
## aal 20000000000100000000000000
## aalborg 21000010000100100100000000
## aalesund
20011000000101000010100000
##
## Any words with identical signatures must be anagrams.
##
## Once this was been set up, I wrote a whole bunch of queries
## to use this table. I use the normal Access drag and drop
## design, but the SQL can be extracted from each, so I can
## simply open the query from Python or I can grab the SQL
## and build it inside the program. The example
##
## signatures_anagrams_select_signature
##
## is hard coded for criteria 9 & 10 and should be cast inside
## Python so the criteria can be changed dynamically.
##
##
## QUERY signatures_anagrams_longest
## ---------------------------------
## SELECT Len([CONS]) AS Expr1,
## Count(Cons_alpha_only_signature_unique.CONS) AS
CountOfCONS,
## Cons_alpha_only_signature_unique.signature
## FROM Cons_alpha_only_signature_unique
## GROUP BY Len([CONS]),
## Cons_alpha_only_signature_unique.signature
## HAVING (((Count(Cons_alpha_only_signature_unique.CONS))>1))
## ORDER BY Len([CONS]) DESC ,
## Count(Cons_alpha_only_signature_unique.CONS) DESC;
##
## This is why I don't use SQLite3, must have crosstab queries.
##
## QUERY signatures_anagram_summary
## --------------------------------
## TRANSFORM Count(signatures_anagrams_longest.signature) AS
CountOfsignature
## SELECT signatures_anagrams_longest.Expr1 AS [length of word]
## FROM signatures_anagrams_longest
## GROUP BY signatures_anagrams_longest.Expr1
## PIVOT signatures_anagrams_longest.CountOfCONS;
##
##
## QUERY signatures_anagrams_select_signature
## ------------------------------------------
## SELECT Len([CONS]) AS Expr1,
## Count(Cons_alpha_only_signature_unique.CONS) AS
CountOfCONS,
## Cons_alpha_only_signature_unique.signature
## FROM Cons_alpha_only_signature_unique
## GROUP BY Len([CONS]),
## Cons_alpha_only_signature_unique.signature
## HAVING (((Len([CONS]))=9) AND
## ((Count(Cons_alpha_only_signature_unique.CONS))=10))
## ORDER BY Len([CONS]) DESC ,
## Count(Cons_alpha_only_signature_unique.CONS) DESC;
##
## QUERY signatures_lookup_by_anagram_select_signature
## ---------------------------------------------------
## SELECT signatures_anagrams_select_signature.Expr1,
## signatures_anagrams_select_signature.CountOfCONS,
## Cons_alpha_only_signature_unique.CONS,
## Cons_alpha_only_signature_unique.signature
## FROM signatures_anagrams_select_signature
## INNER JOIN Cons_alpha_only_signature_unique
## ON signatures_anagrams_select_signature.signature
## = Cons_alpha_only_signature_unique.signature;
##
##
## Now it's a simple matter to use the ODBC from Win32 to extract
## the query output into Python.

import dbi
import odbc

con = odbc.odbc("words")
cursor = con.cursor()

## This first section grabs the anagram summary. Note that
## queries act just like tables (as long as they don't have
## internal dependencies. I read somewhere you can get the
## field names, but here I put them in by hand.

##cursor.execute("SELECT * FROM signature_anagram_summary")
##
##results = cursor.fetchall()
##
##for i in results:
## for j in i:
## print '%4s' % (str(j)),
## print

## (if this wraps, each line is 116 characters)
## 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 23
## 2 259 None None None None None None None None None None None
None None None None None None
## 3 487 348 218 150 102 None None None None None None None
None None None None None None
## 4 1343 718 398 236 142 101 51 26 25 9 8 3
2 None None None None None
## 5 3182 1424 777 419 274 163 106 83 53 23 20 10
6 4 5 1 3 1
## 6 5887 2314 1051 545 302 170 114 54 43 21 15 6
5 4 4 2 None None
## 7 7321 2251 886 390 151 76 49 37 14 7 5 1
1 1 None None None None
## 8 6993 1505 452 166 47 23 8 6 4 2 2 None
None None None None None None
## 9 5127 830 197 47 17 6 None None 1 None None None
None None None None None None
## 10 2975 328 66 8 2 None None None None None None None
None None None None None None
## 11 1579 100 5 4 2 None None None None None None None
None None None None None None
## 12 781 39 2 1 None None None None None None None None
None None None None None None
## 13 326 11 2 None None None None None None None None None
None None None None None None
## 14 166 2 None None None None None None None None None None
None None None None None None
## 15 91 None 1 None None None None None None None None None
None None None None None None
## 16 60 None None None None None None None None None None None
None None None None None None
## 17 35 None None None None None None None None None None None
None None None None None None
## 18 24 None None None None None None None None None None None
None None None None None None
## 19 11 None None None None None None None None None None None
None None None None None None
## 20 6 None None None None None None None None None None None
None None None None None None
## 21 6 None None None None None None None None None None None
None None None None None None
## 22 4 None None None None None None None None None None None
None None None None None None

## From the query we have the word size as row header and size of
## anagram set as column header. The data value is the count of
## how many different anagram sets match the row/column header.
##
## For example, there are 7321 different 7-letter signatures that
## have 2 anagram sets. There is 1 5-letter signature having a
## 23 member anagram set.
##
## We can then pick any of these, say the single 10 member anagram
## set of 9-letter words, and query out out the anagrams:

cursor.execute("SELECT * FROM
signatures_lookup_by_anagram_select_signature")
results = cursor.fetchall()
for i in results:
for j in i:
print j,
print

## 9 10 anoretics 10101000100001100111000000
## 9 10 atroscine 10101000100001100111000000
## 9 10 certosina 10101000100001100111000000
## 9 10 creations 10101000100001100111000000
## 9 10 narcotise 10101000100001100111000000
## 9 10 ostracine 10101000100001100111000000
## 9 10 reactions 10101000100001100111000000
## 9 10 secration 10101000100001100111000000
## 9 10 tinoceras 10101000100001100111000000
## 9 10 tricosane 10101000100001100111000000

## Nifty, eh?

dgoldsmith_89 · Jan 8, 2008

Sorry for my ignorance: I can query an Access DB w/ standard SQL
queries (and this is how I would access it w/ Python)?

Click to expand...

Yes, if you have the appropriate way to link to the DB.
I use Windows and ODBC from Win32. I don't know what you
would use on a Mac.

As Paul McGuire said, you could easily do this with SqlLite3.
Personnaly, I always use Access since my job requires it
and I find it much more convenient. I often use Crosstab
tables which I think SqlLite3 doesn't support. Typically,
I'll write complex queries in Access and simple select SQL
statements in Python to grab them.

Here's my anagram locator. (the [signature] is an example
of the value-added I mentioned).

## I took a somewhat different approach. Instead of in a file,
## I've got my word list (562456 words) in an MS-Access database.
## And instead of calculating the signature on the fly, I did it
## once and added the signature as a second field:
##
## TABLE CONS_alpha_only_signature_unique
## --------------------------------------
## CONS text 75
## signature text 26
##
## The signature is a 26 character string where each character is
## the count of occurences of the matching letter. Luckily, in
## only a single case was there more than 9 occurences of any
## given letter, which turned not to be a word but a series of
## words concatenated so I just deleted it from the database
## (lots of crap in the original word list I used).
##
## Example:
##
## CONS signature
## aah 20000001000000000000000000 # 'a' occurs twice & 'h' once
## aahed 20011001000000000000000000
## aahing 20000011100001000000000000
## aahs 20000001000000000010000000
## aaii 20000000200000000000000000
## aaker 20001000001000000100000000
## aal 20000000000100000000000000
## aalborg 21000010000100100100000000
## aalesund
20011000000101000010100000
##
## Any words with identical signatures must be anagrams.
##
## Once this was been set up, I wrote a whole bunch of queries
## to use this table. I use the normal Access drag and drop
## design, but the SQL can be extracted from each, so I can
## simply open the query from Python or I can grab the SQL
## and build it inside the program. The example
##
## signatures_anagrams_select_signature
##
## is hard coded for criteria 9 & 10 and should be cast inside
## Python so the criteria can be changed dynamically.
##
##
## QUERY signatures_anagrams_longest
## ---------------------------------
## SELECT Len([CONS]) AS Expr1,
## Count(Cons_alpha_only_signature_unique.CONS) AS
CountOfCONS,
## Cons_alpha_only_signature_unique.signature
## FROM Cons_alpha_only_signature_unique
## GROUP BY Len([CONS]),
## Cons_alpha_only_signature_unique.signature
## HAVING (((Count(Cons_alpha_only_signature_unique.CONS))>1))
## ORDER BY Len([CONS]) DESC ,
## Count(Cons_alpha_only_signature_unique.CONS) DESC;
##
## This is why I don't use SQLite3, must have crosstab queries.
##
## QUERY signatures_anagram_summary
## --------------------------------
## TRANSFORM Count(signatures_anagrams_longest.signature) AS
CountOfsignature
## SELECT signatures_anagrams_longest.Expr1 AS [length of word]
## FROM signatures_anagrams_longest
## GROUP BY signatures_anagrams_longest.Expr1
## PIVOT signatures_anagrams_longest.CountOfCONS;
##
##
## QUERY signatures_anagrams_select_signature
## ------------------------------------------
## SELECT Len([CONS]) AS Expr1,
## Count(Cons_alpha_only_signature_unique.CONS) AS
CountOfCONS,
## Cons_alpha_only_signature_unique.signature
## FROM Cons_alpha_only_signature_unique
## GROUP BY Len([CONS]),
## Cons_alpha_only_signature_unique.signature
## HAVING (((Len([CONS]))=9) AND
## ((Count(Cons_alpha_only_signature_unique.CONS))=10))
## ORDER BY Len([CONS]) DESC ,
## Count(Cons_alpha_only_signature_unique.CONS) DESC;
##
## QUERY signatures_lookup_by_anagram_select_signature
## ---------------------------------------------------
## SELECT signatures_anagrams_select_signature.Expr1,
## signatures_anagrams_select_signature.CountOfCONS,
## Cons_alpha_only_signature_unique.CONS,
## Cons_alpha_only_signature_unique.signature
## FROM signatures_anagrams_select_signature
## INNER JOIN Cons_alpha_only_signature_unique
## ON signatures_anagrams_select_signature.signature
## = Cons_alpha_only_signature_unique.signature;
##
##
## Now it's a simple matter to use the ODBC from Win32 to extract
## the query output into Python.

import dbi
import odbc

con = odbc.odbc("words")
cursor = con.cursor()

## This first section grabs the anagram summary. Note that
## queries act just like tables (as long as they don't have
## internal dependencies. I read somewhere you can get the
## field names, but here I put them in by hand.

##cursor.execute("SELECT * FROM signature_anagram_summary")
##
##results = cursor.fetchall()
##
##for i in results:
## for j in i:
## print '%4s' % (str(j)),
## print

## (if this wraps, each line is 116 characters)
## 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 23
## 2 259 None None None None None None None None None None None
None None None None None None
## 3 487 348 218 150 102 None None None None None None None
None None None None None None
## 4 1343 718 398 236 142 101 51 26 25 9 8 3
2 None None None None None
## 5 3182 1424 777 419 274 163 106 83 53 23 20 10
6 4 5 1 3 1
## 6 5887 2314 1051 545 302 170 114 54 43 21 15 6
5 4 4 2 None None
## 7 7321 2251 886 390 151 76 49 37 14 7 5 1
1 1 None None None None
## 8 6993 1505 452 166 47 23 8 6 4 2 2 None
None None None None None None
## 9 5127 830 197 47 17 6 None None 1 None None None
None None None None None None
## 10 2975 328 66 8 2 None None None None None None None
None None None None None None
## 11 1579 100 5 4 2 None None None None None None None
None None None None None None
## 12 781 39 2 1 None None None None None None None None
None None None None None None
## 13 326 11 2 None None None None None None None None None
None None None None None None
## 14 166 2 None None None None None None None None None None
None None None None None None
## 15 91 None 1 None None None None None None None None None
None None None None None None
## 16 60 None None None None None None None None None None None
None None None None None None
## 17 35 None None None None None None None None None None None
None None None None None None
## 18 24 None None None None None None None None None None None
None None None None None None
## 19 11 None None None None None None None None None None None
None None None None None None
## 20 6 None None None None None None None None None None None
None None None None None None
## 21 6 None None None None None None None None None None None
None None None None None None
## 22 4 None None None None None None None None None None None
None None None None None None

## From the query we have the word size as row header and size of
## anagram set as column header. The data value is the count of
## how many different anagram sets match the row/column header.
##
## For example, there are 7321 different 7-letter signatures that
## have 2 anagram sets. There is 1 5-letter signature having a
## 23 member anagram set.
##
## We can then pick any of these, say the single 10 member anagram
## set of 9-letter words, and query out out the anagrams:

cursor.execute("SELECT * FROM
signatures_lookup_by_anagram_select_signature")
results = cursor.fetchall()
for i in results:
for j in i:
print j,
print

## 9 10 anoretics 10101000100001100111000000
## 9 10 atroscine 10101000100001100111000000
## 9 10 certosina 10101000100001100111000000
## 9 10 creations 10101000100001100111000000
## 9 10 narcotise 10101000100001100111000000
## 9 10 ostracine 10101000100001100111000000
## 9 10 reactions 10101000100001100111000000
## 9 10 secration 10101000100001100111000000
## 9 10 tinoceras 10101000100001100111000000
## 9 10 tricosane 10101000100001100111000000

## Nifty, eh?

DG

Click to expand...

Yes, nifty. Thanks for all the help, all!

DG

How to create language translation program from dsl or bgl (by Open Source)	0	Nov 12, 2022
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
experiments with dictionary attacks against password hashes, in Python	0	May 21, 2011
how to use dictionary - newby	0	Jul 23, 2013
Make Python Compilable, convert to Python source to Go	12	May 25, 2014
Free hosting for open source Python projects	0	Jul 27, 2009
New to python, open source Mac OS X IDE?	25	Jan 27, 2009
contributing to an open source project	2	Dec 5, 2007

Open source English dictionary to use programmatically w/ python

dgoldsmith_89

Rick Dooling

Fredrik Lundh

dgoldsmith_89

mensanator

Tobiah

dgoldsmith_89

dgoldsmith_89

Paul McGuire

mensanator

dgoldsmith_89

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads