file.getvalue() with _ or other characters

martijn · Mar 3, 2005

H!

I do this to get a htmlTOtext file

class mvbHTMLParser(htmllib.HTMLParser):

def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose)
self.imglist = []

def handle_image(self,src,alt,*args):
self.imglist.append(src)

file = StringIO.StringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(file))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()

But then the _ characters are away.
is it possible to keep that character in file.getvalue()

[the p.anchorlist = oke : test_bla.html]

Peter Hansen · Mar 3, 2005

file = StringIO.StringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(file))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()

But then the _ characters are away.
is it possible to keep that character in file.getvalue()

I consider this a defect in StringIO, but it's pretty easy to
fix it, at least for the narrow usage you describe:

class PreservingStringIO(StringIO.StringIO):
def close(self):
pass

file = PreservingStringIO()
....etc

The problem is (if I'm right about this) that the close()
method on the object returned by mvbHTMLParser() will actually
call close() on the file object in the formatter (whether
directly or not, I don't know). One might consider _this_
a bug as well, but if the above approach works, in the end
it's no big deal.

So basically redefine close() to do nothing (or have it save
a copy of the buffer's getvalue() results first) and you
should be good to go.

-Peter

martijn · Mar 3, 2005

mmm I'm a newbie with python.

I did this but don't work:

class mvbHTMLParser(htmllib.HTMLParser):

def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose)
self.imglist = []

def handle_image(self,src,alt,*args):
self.imglist.append(src)

class PreservingStringIO(StringIO.StringIO):
def close(self):
pass

file = PreservingStringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(file))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()

---- i will try some things

Peter Otten · Mar 3, 2005

I do this to get a htmlTOtext file
[...]

But then the _ characters are away.
is it possible to keep that character in file.getvalue()

Just to make sure: you did look into the HTML file and verified that there
are actually underscores and not spaces that are _rendered_ similar to "_"
via <u>some text</u> or CSS?

Peter

Peter Hansen · Mar 3, 2005

I did this but don't work:

It is quite possible I misunderstood the problem you
were having. I am familiar with a problem with StringIO
whereby if you call close() on it, you can no longer call
getvalue() afterwards. Perhaps that's not the problem
you were seeing.

Can you clarify your comment "But then the _ characters are
away. is it possible to keep that character in file.getvalue()"?

Please show actual (small!) examples of the sort of input
you are dealing with, and the output which you are getting
(if any).

-Peter

martijn · Mar 4, 2005

srry I needed some sleep.
it works oke.

But if you want to answer a question.

I use this code:
----------------------------------------------------------
import StringIO
import re
import urllib2,htmllib, formatter

class mvbHTMLParser(htmllib.HTMLParser):
def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose)

def getContent(url):
try:
line = urllib2.urlopen(url)
htmlToText(line.read().lower())
except IOError,(strerror):
print strerror

def htmlToText(html):
file = StringIO.StringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(file))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()

getContent('http://www.zquare.nl/test.html')
----------------------------------------------------------
then the output is:
text_text
a_link[1]

that's oke but how to delete [n]
like this? : del = re.compile(r'[0-9]',).sub

Thanks for the fast helping,
GC-Martijn

Peter Otten · Mar 4, 2005

class mvbHTMLParser(htmllib.HTMLParser):
def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose)

def anchor_end(self):
self.anchor = None

[...]

then the output is:
text_text
a_link[1]

that's oke but how to delete [n]
like this? : del = re.compile(r'[0-9]',).sub

Overriding the anchor_end() method as shown above will suppress the [n]
suffix after links.

Peter

help with link parsing?	3	Dec 20, 2010
Python battle game help	2	Feb 23, 2023
urllib2.urlopen(url) pulling something other than HTML	7	Aug 20, 2007
Python 2.7.x - problem with obejct.__init__() not accepting args and *kwargs	5	May 15, 2013
Cannot import htmllib	3	Apr 13, 2006
Struggling with this concept please help	4	Sep 26, 2005
Dynamically altering __init__	0	Oct 13, 2011
Optimizing methods away or not?	8	Dec 14, 2008

file.getvalue() with _ or other characters

martijn

Peter Hansen

martijn

Peter Otten

Peter Hansen

martijn

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads