file.getvalue() with _ or other characters

M

martijn

H!

I do this to get a htmlTOtext file

class mvbHTMLParser(htmllib.HTMLParser):

def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose)
self.imglist = []

def handle_image(self,src,alt,*args):
self.imglist.append(src)


file = StringIO.StringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(file))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()

But then the _ characters are away.
is it possible to keep that character in file.getvalue()

[the p.anchorlist = oke : test_bla.html]
 
P

Peter Hansen

file = StringIO.StringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(file))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()

But then the _ characters are away.
is it possible to keep that character in file.getvalue()

I consider this a defect in StringIO, but it's pretty easy to
fix it, at least for the narrow usage you describe:

class PreservingStringIO(StringIO.StringIO):
def close(self):
pass

file = PreservingStringIO()
....etc

The problem is (if I'm right about this) that the close()
method on the object returned by mvbHTMLParser() will actually
call close() on the file object in the formatter (whether
directly or not, I don't know). One might consider _this_
a bug as well, but if the above approach works, in the end
it's no big deal.

So basically redefine close() to do nothing (or have it save
a copy of the buffer's getvalue() results first) and you
should be good to go.

-Peter
 
M

martijn

mmm I'm a newbie with python.

I did this but don't work:

class mvbHTMLParser(htmllib.HTMLParser):

def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose)
self.imglist = []

def handle_image(self,src,alt,*args):
self.imglist.append(src)

class PreservingStringIO(StringIO.StringIO):
def close(self):
pass

file = PreservingStringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(file))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()


---- i will try some things
 
P

Peter Otten

I do this to get a htmlTOtext file
[...]

But then the _ characters are away.
is it possible to keep that character in file.getvalue()

Just to make sure: you did look into the HTML file and verified that there
are actually underscores and not spaces that are _rendered_ similar to "_"
via <u>some text</u> or CSS?

Peter
 
P

Peter Hansen

I did this but don't work:

It is quite possible I misunderstood the problem you
were having. I am familiar with a problem with StringIO
whereby if you call close() on it, you can no longer call
getvalue() afterwards. Perhaps that's not the problem
you were seeing.

Can you clarify your comment "But then the _ characters are
away. is it possible to keep that character in file.getvalue()"?

Please show actual (small!) examples of the sort of input
you are dealing with, and the output which you are getting
(if any).

-Peter
 
M

martijn

srry I needed some sleep.
it works oke.

But if you want to answer a question.

I use this code:
----------------------------------------------------------
import StringIO
import re
import urllib2,htmllib, formatter

class mvbHTMLParser(htmllib.HTMLParser):
def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose)

def getContent(url):
try:
line = urllib2.urlopen(url)
htmlToText(line.read().lower())
except IOError,(strerror):
print strerror

def htmlToText(html):
file = StringIO.StringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(file))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()

getContent('http://www.zquare.nl/test.html')
----------------------------------------------------------
then the output is:
text_text
a_link[1]

that's oke but how to delete [n]
like this? : del = re.compile(r'[0-9]',).sub

Thanks for the fast helping,
GC-Martijn
 
P

Peter Otten

class mvbHTMLParser(htmllib.HTMLParser):
def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose)

def anchor_end(self):
self.anchor = None

[...]
then the output is:
text_text
a_link[1]

that's oke but how to delete [n]
like this? : del = re.compile(r'[0-9]',).sub

Overriding the anchor_end() method as shown above will suppress the [n]
suffix after links.

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top