Clean "Durty" strings

U

Ulysse

Hello,

I need to clean the string like this :

string =
"""
bonne mentalit&eacute; mec!:) \n <br>bon pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n <br>mais pour avoir des resultats probant il
faut pas faire les mariolles, comme le &quot;fondateur&quot; de bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le &quot;fondateur&quot; de bvs krew \n
"""

into :
bonne mentalité mec!:) bon pour info moi je suis un serial posteur
arceleur dictateur ^^* mais pour avoir des resultats probant il faut
pas faire les mariolles, comme le "fondateur" de bvs krew
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew

To do this I wold like to use only strandard librairies.

Thanks
 
D

Diez B. Roggisch

Ulysse said:
Hello,

I need to clean the string like this :

string =
"""
bonne mentalit&eacute; mec!:) \n <br>bon pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n <br>mais pour avoir des resultats probant il
faut pas faire les mariolles, comme le &quot;fondateur&quot; de bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le &quot;fondateur&quot; de bvs krew \n
"""

into :
bonne mentalité mec!:) bon pour info moi je suis un serial posteur
arceleur dictateur ^^* mais pour avoir des resultats probant il faut
pas faire les mariolles, comme le "fondateur" de bvs krew
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew

The obvious way that has been suggested to you at other places is to use
BeautifulSoup.
To do this I wold like to use only strandard librairies.

Then you need to reprogram what BeautifulSoup does. Happy hacking!

Diez
 
R

rzed

The obvious way that has been suggested to you at other places
is to use BeautifulSoup.


Then you need to reprogram what BeautifulSoup does. Happy
hacking!

If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.
 
D

Diez B. Roggisch

If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.

The OP had <br>-tags in his text. Which is _more_ than a half dozen lines of
code to clean up. Because your simple replacement-approach won't help here:

<br>foo <br> bar </br>

Which is perfectly legal HTML, but nasty to parse.

Diez
 
I

irstas

The OP had <br>-tags in his text. Which is _more_ than a half dozen lines of
code to clean up. Because your simple replacement-approach won't help here:

<br>foo <br> bar </br>

Which is perfectly legal HTML, but nasty to parse.

Diez

But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s). For whitespace, re.sub(r'\s+', ' ', s). For XML
characters like &eacute;, re.sub(r'&(\w+);', lambda mo:
unichr(htmlentitydefs.name2codepoint[mo.group(1)]), s) and
re.sub(r'&#(\d+);', lambda mo: unichr(int(mo.group(1))), s). That's it
pretty much.

I'd like to see how this transformation can be done with
BeautifulSoup. Well, the last two regexps can be replaced with this:

unicode(BeautifulStoneSoup(s,convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0])
 
M

Marc 'BlackJack' Rintsch

I'd like to see how this transformation can be done with
BeautifulSoup. Well, the last two regexps can be replaced with this:

unicode(BeautifulStoneSoup(s,convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0])

Completely without regular expressions:

def main():
soup = BeautifulSoup(source, convertEntities=BeautifulSoup.HTML_ENTITIES)
print ' '.join(''.join(soup(text=True)).split())

Ciao,
Marc 'BlackJack' Rintsch
 
M

Michael Hoffman

But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s).

Won't work for, say, this:

<img src="src" alt="<text>">
 
I

irstas

But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s).

Won't work for, say, this:

<img src="src" alt="<text>">

True, but is that legal? I think the alt attribute needs to use &lt;
and &gt;. Although I know what you're going to reply. That
BeautifulSoup probably parses it even if it's invalid HTML. And I'd
say that I agree, using BeautifulSoup is a better solution than custom
regexps.
 
R

rzed

The OP had <br>-tags in his text. Which is _more_ than a half
dozen lines of code to clean up. Because your simple
replacement-approach won't help here:

<br>foo <br> bar </br>

Which is perfectly legal HTML, but nasty to parse.

Well, as I said, given the input the OP supplied, it's not even
necessary to parse it. It isn't clear what the true desired
operation is, but this seems to meet the criteria given:

<code -- the string 's' is wrapped nastily, but ...>
s ="""\
bonne mentalit&eacute; mec!:) \n <br>bon
pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n <br>mais pour avoir des resultats
probant il
faut pas faire les mariolles, comme le &quot;fondateur&quot; de
bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les
mariolles,
comme le &quot;fondateur&quot; de bvs krew \n"""

fromlist = ['<br>', '&eacute;', '&quot;']
tolist = ['', 'é', '"' ]


def withReplacements( s, flist,tlist ):
for ix, f in enumerate(flist):
t = tlist[ix]
s = s.replace( f,t )
return s

print withReplacements(' '.join(s.split()),fromlist,tolist)

</code>

If the question is about efficiency or robustness or generality,
then that's another set of issues, but that's for the 1.1 version
to handle.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,431
Messages
2,571,677
Members
48,796
Latest member
Greg L.

Latest Threads

Top