Trouble writing to database: RSS-reader

A

Arne

Hi!

I try to make a rss-reader in python just for fun, and I'm almost
finished. I don't have any syntax-errors, but when i run my program,
nothing happends.

This program is supposed to download a .xml-file, save the contents in
a buffer-file(buffer.txt) and parse the file looking for start-tags.
When it has found a start tag, it asumes that the content (between the
start-tag and the end-tag) is on the same line, so then it removes the
start-tag and the end-tag and saves the content and put it into a
database.

The problem is that i cant find the data in the database! If i watch
my program while im running it, i can see that it sucsessfuly
downloads the .xml-file from the web and saves it in the buffer.

But I dont think that i save the data in the correct way, so it would
be nice if someone had some time to help me.

Full code: http://pastebin.com/m56487698
Saving to database: http://pastebin.com/m7ec69e1b
Retrieving from database: http://pastebin.com/m714c3ef8

And yes, I know that there is rss-parseres already built, but this is
only for learning.
 
D

Dennis Lee Bieber

The problem is that i cant find the data in the database! If i watch
my program while im running it, i can see that it sucsessfuly
downloads the .xml-file from the web and saves it in the buffer.
Did you COMMIT the transaction with the database?

DB-API specification is that connections do NOT perform auto-commit;
so it you do a string of INSERT, and just close the connection, the
changes are supposed to be rolled-back (deleted).
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
B

Bruno Desthuilliers

Arne a écrit :
Hi!

I try to make a rss-reader in python just for fun, and I'm almost
finished.

Bad news : you're not.
I don't have any syntax-errors, but when i run my program,
nothing happends.

This program is supposed to download a .xml-file, save the contents in
a buffer-file(buffer.txt) and parse the file looking for start-tags.
When it has found a start tag, it asumes that the content (between the
start-tag and the end-tag) is on the same line,

Very hazardous assumption. FWIW, you can more safely assule this will
almost never be the case. FWIW, don't assume *anything* wrt/ newlines
when it comes to XML - you can even have newlines between two attributes
of a same tag...
so then it removes the
start-tag and the end-tag and saves the content and put it into a
database.

The problem is that i cant find the data in the database! If i watch
my program while im running it, i can see that it sucsessfuly
downloads the .xml-file from the web and saves it in the buffer.

But I dont think that i save the data in the correct way, so it would
be nice if someone had some time to help me.

Full code: http://pastebin.com/m56487698
Saving to database: http://pastebin.com/m7ec69e1b
Retrieving from database: http://pastebin.com/m714c3ef8

1/ you don't need to make each and every variable an attribute of the
class - only use attributes for what constitute the object state (ie:
need to be maintain between different, possibly unrelated method calls).
In your update_sql method, for exemple, beside self.connection and
_eventually_ self.cursor, you don't need any attribute - local variables
are enough.

2/ you don't need these <xxx>Stored variables at all - just reset
title/link/description to None *when needed* (cf below), then test these
variables against None.

3/ learn to use if/elif properly !-)

4/ *big* logic flaw (and probably the first cause of your problem): on
*each* iteration, you reset your <xxx>Stored flags to False - whether
you stored something in the database or not. Since you don't expect to
have all there data on a single line (another wrong assumption : you
might get a whole rss stream as one single big line), I bet you never
write anything into the database .

5/ other big flaw : either use an autoincrement for your primary key -
and *dont* pass any value for it in your query - or provide (a
*unique*) id by yourself.

6/ FWIW, also learn to properly use the DB api - don't build your SQL
query using string formatting, but pass the argument as a tuple, IOW:

# bad:
cursor.execute(
'''INSERT INTO main VALUES(null, %s, %s, %s)'''
% title, link, description
)

# good (assuming you're using an autoincrementing key for your id) :
cursor.execute(
"INSERT INTO main VALUES(<X>, <X>, <X>)",
(title, link, description)
)

NB : replace <X> with the appropriate placeholder for your database - cf
your db module documentation (usually either '?' or '%s')

This will make the db module properly escape and convert values.

7/ str.replace() doesn't modify the string in-place (Python strings are
immutable), but returns a new string. so you want:
line = line.replace('x', 'y')

8/ you don't need to explicitely call connection.commit on each and
every statement, and you don't need to call it at all on SELECT
statements !-)

9/ have you tried calling print_rss *twice* on the same instance ?-)

10/ are you sure it's useful to open the same 'buffer.txt' file for
writing *twice* (once in __init__, the other in update_sql). BTW, use
open(), not file().

11/ are you sure you need to use this buffer file at all ?

12/ are you really *sure* you want to *destroy* your table and recreate
it each time you call your script ?
And yes, I know that there is rss-parseres already built, but this is
only for learning.

This should not prevent you from learning how to properly parse XML
(hint: with an XML parser). XML is *not* a line-oriented format, so you
just can't get nowhere trying to parse it this way.



HTH
 
B

Bruno Desthuilliers

Dennis Lee Bieber a écrit :
Did you COMMIT the transaction with the database?

Did you READ the code ?-)

NB : Yes, he did. The problem*s* are elsewhere.
 
G

Gabriel Genellina

I try to make a rss-reader in python just for fun, and I'm almost
finished. I don't have any syntax-errors, but when i run my program,
nothing happends.

This program is supposed to download a .xml-file, save the contents in
a buffer-file(buffer.txt) and parse the file looking for start-tags.
When it has found a start tag, it asumes that the content (between the
start-tag and the end-tag) is on the same line, so then it removes the
start-tag and the end-tag and saves the content and put it into a
database.

That's a gratuitous assumption and may not hold on many sources; you
should use a proper XML parser instead (using ElementTree, by example, is
even easier than your sequence of find and replace)
The problem is that i cant find the data in the database! If i watch
my program while im running it, i can see that it sucsessfuly
downloads the .xml-file from the web and saves it in the buffer.

Ok. So the problem should be either when you read the buffer again, when
processing it, or when saving in the database.
It's very strange to create the table each time you want to save anything,
but this gives you another clue: the table is created and remains empty,
else the select statement in print_rss would have failed. So you know that
those lines are executed. Now, the print statement is your friend:

self.buffer = file('buffer.txt')
for line in self.buffer.readline():
print "line=",line # add this and see what you get

Once you get your code working, it's time to analyze it. I think someone
told you "in Python, you have to use self. everywhere" and you read it
literally. Let's see:

def update_buffer(self):
self.buffer = file('buffer.txt', 'w')
self.temp_buffer = urllib2.urlopen(self.rssurl).read()
self.buffer.write(self.temp_buffer)
self.buffer.close()

All those "self." are unneeded and wrong. You *can*, and *should*, use
local variables. Perhaps it's a bit hard to grasp at first, but local
variables, instance attributes and global variables are different things
used for different purposes. I'll try an example: you [an object] have a
diary, where you record things that you have to remember [your instance
attributes, or "data members" as they are called on other languages]. You
also carry a tiny notepad in your pocket, where you make a few notes when
you are doing something, but you always throw away the page once the job
is finished [local variables]. Your brothers, sisters and parents [other
objects] use the same schema, but there is a whiteboard on the kitchen
where important things that all of you have to know are recorded [global
variables] (anybody can read and write on the board).
Now, back to the code, why "self." everywhere? Let's see, self.buffer is a
file: opened, written, and closed, all inside the same function. Once it's
closed, there is no need to keep a reference to the file elsewhere. It's
discardable, as your notepad pages: use a local variable instead. In fact,
*all* your variables should be locals, the *only* things you should keep
inside your object are rssurl and the database location, and perhaps
temp_buffer (with another, more meaningful name, rssdata by example).

Other -more or less random- remarks:

if self.titleStored == True and self.linkStored == True and
descriptionStored == True:

Don't compare against True/False. Just use their boolean value:

if titleStored and linkStored and descriptionStored:

Your code resets those flags at *every* line read, and since a line
contains at most one tag, they will never be True at the same time. You
should reset the flags only after you got the three items and wrote them
onto the database.

The rss feed, after being read, is available into self.temp_buffer; why do
you read it again from the buffer file? If you want to iterate over the
individual lines, use:

for line in self.temp_buffer.splitlines():
 
A

Arne

This should not prevent you from learning how to properly parse XML
(hint: with an XML parser). XML is *not* a line-oriented format, so you
just can't get nowhere trying to parse it this way.

HTH

Do you think i should use xml.dom.minidom for this? I've never used
it, and I don't know how to use it, but I've heard it's useful.

So, I shouldn't use this techinicke (probably wrong spelled) trying to
parse XML? Should i rather use minidom?

Thank you for for answering, I've learnt a lot from both of you,
Desthuilliers and Genellina! :)
 
B

Bruno Desthuilliers

Arne a écrit :
Do you think i should use xml.dom.minidom for this?

I'd rather go for a sax parser. A dom parser is only useful if you need
an in-memory representation of the whole document tree.
So, I shouldn't use this techinicke (probably wrong spelled)

May I suggest "technic" ?-)
 
G

Gabriel Genellina

Do you think i should use xml.dom.minidom for this? I've never used
it, and I don't know how to use it, but I've heard it's useful.

So, I shouldn't use this techinicke (probably wrong spelled) trying to
parse XML? Should i rather use minidom?

Thank you for for answering, I've learnt a lot from both of you,
Desthuilliers and Genellina! :)

Try ElementTree instead; there is an implementation included with Python
2.5, documentation at http://effbot.org/zone/element.htm and another
implementation available at http://codespeak.net/lxml/

import xml.etree.cElementTree as ET
import urllib2

rssurl = 'http://www.jabber.org/news/rss.xml'
rssdata = urllib2.urlopen(rssurl).read()
rssdata = rssdata.replace('&', '&amp;') # ouch!

tree = ET.fromstring(rssdata)
for item in tree.getiterator('item'):
print item.find('link').text
print item.find('title').text
print item.find('description').text
print

Note that this particular RSS feed is NOT a well formed XML document - I
had to replace the & with &amp; to make the parser happy.
 
M

MRAB

Arne a écrit :



I'd rather go for a sax parser. A dom parser is only useful if you need
an in-memory representation of the whole document tree.




May I suggest "technic" ?-)

That should be "technique"; just ask a Francophone! :)
 
D

Dennis Lee Bieber

Did you READ the code ?-)
I was in a bit of a hurry to get to work (why, I don't know -- since
I'm due to be surplused in three weeks)... so took a quick grab at the
most common reason for not finding "expected" data in a database -- the
common /lack/ of commits.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
A

Arne

En Mon, 21 Jan 2008 18:38:48 -0200, Arne <[email protected]> escribi�:







Try ElementTree instead; there is an implementation included with Python  
2.5, documentation  athttp://effbot.org/zone/element.htmand another  
implementation available athttp://codespeak.net/lxml/

import xml.etree.cElementTree as ET
import urllib2

rssurl = 'http://www.jabber.org/news/rss.xml'
rssdata = urllib2.urlopen(rssurl).read()
rssdata = rssdata.replace('&', '&amp;') # ouch!

tree = ET.fromstring(rssdata)
for item in tree.getiterator('item'):
   print item.find('link').text
   print item.find('title').text
   print item.find('description').text
   print

Note that this particular RSS feed is NOT a well formed XML document - I  
had to replace the & with &amp; to make the parser happy.

This look very interesting! But it looks like that no documents is
well-formed! I've tried several RSS-feeds, but they are eighter
"undefined entity" or "not well-formed". This is not how it should be,
right? :)
 
G

Gabriel Genellina

This look very interesting! But it looks like that no documents is
well-formed! I've tried several RSS-feeds, but they are eighter
"undefined entity" or "not well-formed". This is not how it should be,
right? :)

Well, the RSS feed "should" be valid XML...
Try a more forgiving parser like BeautifulStone, or preprocess the input
with Tidy or a similar program before feeding it to ElementTree.
 
M

member thudfoo

En Mon, 21 Jan 2008 18:38:48 -0200, Arne <[email protected]> escribi�:
[...]


This look very interesting! But it looks like that no documents is
well-formed! I've tried several RSS-feeds, but they are eighter
"undefined entity" or "not well-formed". This is not how it should be,
right? :)

Go to http://www.feedparser.org
Download feedparser.py
Read the documentation, at least.: you will find out a lot about
working with rss.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top