Wikipedia XML Dump

K

kevingloveruk

Hi

I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want touse Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".

I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before Iget much further, I wanted to ask:

Is what I am trying to do actually feasible?

Are there any example programs or code snippets that would help me?

Any advice or guidance would be gratefully received.

Best regards,
Kevin Glover
 
R

Rustom Mody

I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".
I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. BeforeI get much further, I wanted to ask:
Is what I am trying to do actually feasible?

Cant really visualize what youve got...
When you 'download' wikipedia what do you get?
One 40GB file?
A zillion files?
Some other database format?

Another point:
sax is painful to use compared to full lxml (dom)
But then sax is the only choice when files cross a certain size
Thats why the above question

Also you may want to explore nltk
 
S

Skip Montanaro

Another point:
sax is painful to use compared to full lxml (dom)
But then sax is the only choice when files cross a certain size
Thats why the above question

No matter what the choice of XML parser, I suspect you'll want to
convert it to some other form for processing.

Skip
 
K

Kevin Glover

Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?

Kevin
 
A

alex23

I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".

I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask:

Is what I am trying to do actually feasible?

Rather than parsing through 40GB+ every time you need to do a search,
you should get better performance using an XML database which will allow
you to do queries directly on the xml data.

http://basex.org/ is one such db, and comes with a Python API:

http://docs.basex.org/wiki/Clients
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top