Clean "Durty" strings

Ulysse · Apr 1, 2007

Hello,

I need to clean the string like this :

string =
"""
bonne mentalité mec!

\n bon pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n mais pour avoir des resultats probant il
faut pas faire les mariolles, comme le "fondateur" de bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew \n
"""

into :
bonne mentalité mec!

bon pour info moi je suis un serial posteur
arceleur dictateur ^^* mais pour avoir des resultats probant il faut
pas faire les mariolles, comme le "fondateur" de bvs krew
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew

To do this I wold like to use only strandard librairies.

Thanks

Diez B. Roggisch · Apr 2, 2007

Ulysse said:
Hello,

I need to clean the string like this :

string =
"""
bonne mentalité mec! \n bon pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n mais pour avoir des resultats probant il
faut pas faire les mariolles, comme le "fondateur" de bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew \n
"""

into :
bonne mentalitÃ© mec! bon pour info moi je suis un serial posteur
arceleur dictateur ^^* mais pour avoir des resultats probant il faut
pas faire les mariolles, comme le "fondateur" de bvs krew
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew

The obvious way that has been suggested to you at other places is to use
BeautifulSoup.

To do this I wold like to use only strandard librairies.

Then you need to reprogram what BeautifulSoup does. Happy hacking!

Diez

rzed · Apr 2, 2007

The obvious way that has been suggested to you at other places
is to use BeautifulSoup.

Then you need to reprogram what BeautifulSoup does. Happy
hacking!

If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.

Diez B. Roggisch · Apr 2, 2007

If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.

The OP had -tags in his text. Which is _more_ than a half dozen lines of
code to clean up. Because your simple replacement-approach won't help here:

 foo bar 

Which is perfectly legal HTML, but nasty to parse.

Diez

irstas · Apr 2, 2007

The OP had -tags in his text. Which is _more_ than a half dozen lines of
code to clean up. Because your simple replacement-approach won't help here:

 foo bar 

Which is perfectly legal HTML, but nasty to parse.

Diez

But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s). For whitespace, re.sub(r'\s+', ' ', s). For XML
characters like é, re.sub(r'&(\w+);', lambda mo:
unichr(htmlentitydefs.name2codepoint[mo.group(1)]), s) and
re.sub(r'&#(\d+);', lambda mo: unichr(int(mo.group(1))), s). That's it
pretty much.

I'd like to see how this transformation can be done with
BeautifulSoup. Well, the last two regexps can be replaced with this:

unicode(BeautifulStoneSoup(s,convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0])

Marc 'BlackJack' Rintsch · Apr 2, 2007

I'd like to see how this transformation can be done with
BeautifulSoup. Well, the last two regexps can be replaced with this:

unicode(BeautifulStoneSoup(s,convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0])

Completely without regular expressions:

def main():
soup = BeautifulSoup(source, convertEntities=BeautifulSoup.HTML_ENTITIES)
print ' '.join(''.join(soup(text=True)).split())

Ciao,
Marc 'BlackJack' Rintsch

Michael Hoffman · Apr 2, 2007

But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s).

Won't work for, say, this:

<img src="src" alt="<text>">

irstas · Apr 2, 2007

[email protected] said:
[email protected] said:

But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s).

Click to expand...

Won't work for, say, this:

<img src="src" alt="<text>">

True, but is that legal? I think the alt attribute needs to use <
and >. Although I know what you're going to reply. That
BeautifulSoup probably parses it even if it's invalid HTML. And I'd
say that I agree, using BeautifulSoup is a better solution than custom
regexps.

rzed · Apr 2, 2007

The OP had -tags in his text. Which is _more_ than a half
dozen lines of code to clean up. Because your simple
replacement-approach won't help here:

 foo bar 

Which is perfectly legal HTML, but nasty to parse.

Well, as I said, given the input the OP supplied, it's not even
necessary to parse it. It isn't clear what the true desired
operation is, but this seems to meet the criteria given:

<code -- the string 's' is wrapped nastily, but ...>
s ="""\
bonne mentalité mec!

\n bon
pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n mais pour avoir des resultats
probant il
faut pas faire les mariolles, comme le "fondateur" de
bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les
mariolles,
comme le "fondateur" de bvs krew \n"""

fromlist = [' ', 'é', '"']
tolist = ['', 'é', '"' ]

def withReplacements( s, flist,tlist ):
for ix, f in enumerate(flist):
t = tlist[ix]
s = s.replace( f,t )
return s

print withReplacements(' '.join(s.split()),fromlist,tolist)

</code>

If the question is about efficiency or robustness or generality,
then that's another set of issues, but that's for the 1.1 version
to handle.

Codage pour sondage	0	Apr 19, 2022
Aide pour bien démarrer en Python	2	Sep 27, 2013
Implementing a Q-Learning Algorithm with Logistic Regression Normalization in C++	0	Jun 4, 2025
Chatbot	0	Oct 8, 2024
Database Manager: A C++ Console Application	14	May 12, 2025
Le secret c'est dans le traffic	1	Mar 13, 2011
PyOpenGL pour python 2.5 ???	5	Sep 25, 2006
Commande % sudo apt-get install ruby irb rdoc	2	Mar 15, 2009

Clean "Durty" strings

Ulysse

Diez B. Roggisch

rzed

Diez B. Roggisch

irstas

Marc 'BlackJack' Rintsch

Michael Hoffman

irstas

rzed

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads