Is there a function to remove escape characters from a string ?

S

Stef Mientki

hello,

Is there a function to remove escape characters from a string ?
(preferable all escape characters except "\n").

thanks,
Stef
 
J

James Stroud

Stef said:
hello,

Is there a function to remove escape characters from a string ?
(preferable all escape characters except "\n").

thanks,
Stef


import string

WANTED = string.printable[:-5] + "\n"

def descape(s, w=WANTED):
return "".join(c for c in s if c in w)


James


--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com
 
J

John Machin

hello,

Is there a function to remove escape characters from a string ?
(preferable all escape characters except "\n").

"\n" is not what most people would call an escape character. The "\"
is what most people would call an escape character when it is used in
a manner like in a Python non-raw string (e.g. "1\tStef\r\n2\tJames\r
\n").

Assuming (as James has done) that you meant you want to remove all but
"truly visible ASCII characters, plus newline", I'd have to ask: Are
you sure?? Do you really want to throw away tabs, when they might be
separating fields, as in the above example?

Let's start at the beginning:

Python 2.x or 3.x?
Type of your data objects is str/bytes or unicode/str or both?
If str/bytes, what encoding(s)?
What exactly are these "escape characters"?
Are you sure that you need to remove them all i.e. you don't want to
replace some with other characters?

HTH,
John
 
S

Steven D'Aprano

hello,

Is there a function to remove escape characters from a string ?
(preferable all escape characters except "\n").


Can you explain what you mean? I can think of at least four alternatives:

(1) Remove literal escape sequences (backslash-char):
"abc\\t\\ad" => "abcd"
r"abc\t\ad" => "abcd"


(2) Replace literal escape sequences with the character they represent:
"abc\\t\\ad" => "abc\t\ad"


(3) Remove characters generated by escape sequences:
"abc\t\ad" => "abcd"
"abc" => "abc" but "a\x62c" => "ac"

This is likely to be impossible without deep magic.


(4) Remove so-called binary characters which are typically inserted using
escape sequences:
"abc\t\ad" => "abcd"
"abc" => "abc" but "a\x62c" => "abc"

This is probably the easiest, assuming you have bytes instead of unicode.

import string
table = string.maketrans('', '')
delchars =''.join(chr(n) for n in range(32))

s = string.translate(s, table, delchars)
 
S

Stef Mientki

Steven said:
Can you explain what you mean? I can think of at least four alternatives:
I have the following kind of strings,
the funny "þ" is ASCII character 254, used as a separator character

[FSM]
Counts = "1þ11þ16" ==> 1,11,16
Init1 = "1þ\BCtrl" ==> 1,Ctrl
State5 = "8þ\BJUMP_COMPL\b\n>PCWrite = 1\n>PCSource = 10"
==> 8, JUMP_COMPL\n>PCWrite = 1\n>PCSource = 10

Seeing and testing all your answers, with great solutions that I've
never seen before,
knowing nothing of escape sequences (I'm a windows guy ;-)
I now see that the characters I need to remove, like \B and \b are
not "official" escape sequences.
So in this case the best (easiest to understand) method is a few replace
statements:
s = s.replace ( '\b', '' ).replace( '\B', '' )

Nevertheless, thank you all for the other examples,

cheers,
Stef
 
J

John Machin

I have the following kind of strings,
the funny "þ" is ASCII character 254, used as a separator character

ASCII ends at 127. Just refer to it as chr(254).
[FSM]
Counts = "1þ11þ16"     ==>   1,11,16
Init1 = "1þ\BCtrl"     ==>    1,Ctrl
State5 = "8þ\BJUMP_COMPL\b\n>PCWrite = 1\n>PCSource = 10"
         ==> 8, JUMP_COMPL\n>PCWrite = 1\n>PCSource = 10

After making those substitutions, what are you going to do with it?
Split it up into fields using the csv module or stuff.split(",") or
some other DIY method? Is there a possibility that whoever "designed"
that data format used chr(254) as a separator because the data fields
contained "," sometimes and so "," could not be used as a separator?
Seeing and testing all your answers, with great solutions that I've
never seen before,

As far as str methods and built-ins that work on str objects are
concerned, there is no corpus of secret knowledge known only to a
cabal of wizards; it's all in the manual, and you don't need special
magical spectacles to see it :)
knowing nothing of escape sequences (I'm a windows guy ;-)

Why do you think that whether or not you are a "windows guy" is
relevant to knowing anything about escape sequences?
I now see that the characters I need to remove, like  \B  and \b  are
not "official" escape sequences.

\b *is* an "official" escape sequence, just like \n; see below:

| >>> x = '\b'; print len(x), repr(x)
| 1 '\x08'
| >>> x = r'\b'; print len(x), repr(x)
| 2 '\\b'
| >>> x = '\B'; print len(x), repr(x)
| 2 '\\B'
| >>> x = r'\B'; print len(x), repr(x)
| 2 '\\B'
So in this case the best (easiest to understand) method is a few replace
statements:
s = s.replace ( '\b', '' ).replace( '\B',  '' )

It's probable that \b and \B are both TWO-byte sequences, in which
case you should use r'\b' so that it does what you want it to do, and
use r'\B' for consistency.
 
S

Stef Mientki

I have the following kind of strings,
the funny "þ" is ASCII character 254, used as a separator character

ASCII ends at 127. Just refer to it as chr(254).

note 1)
[FSM]
Counts = "1þ11þ16" ==> 1,11,16
Init1 = "1þ\BCtrl" ==> 1,Ctrl
State5 = "8þ\BJUMP_COMPL\b\n>PCWrite = 1\n>PCSource = 10"
==> 8, JUMP_COMPL\n>PCWrite = 1\n>PCSource = 10

After making those substitutions, what are you going to do with it?
Split it up into fields using the csv module or stuff.split(",") or
some other DIY method? Is there a possibility that whoever "designed"
that data format used chr(254) as a separator because the data fields
contained "," sometimes and so "," could not be used as a separator?
Yep, chr(254), because it's not in the human range of characters
and it's accepted by windows ini-files.
As far as str methods and built-ins that work on str objects are
concerned, there is no corpus of secret knowledge known only to a
cabal of wizards; it's all in the manual, and you don't need special
magical spectacles to see it :)

note 2)

Why do you think that whether or not you are a "windows guy" is
relevant to knowing anything about escape sequences?
Just a windows guy,
or maybe better, "being a windows guy for many years",
windows users are wysiwyg users, they are not dealing with individual bits.
I personally left escape sequences and values of ASCII characters behind
me more than 25 years ago.
And now maybe you might understand note 1) and note 2) .

cheers,
Stef
 
J

John Machin

Yep, chr(254), because it's not in the human range of characters
and it's accepted by windows ini-files.
.... s = chr(254)
.... enc = 'cp125' + str(i)
.... try:
.... u = s.decode(enc)
.... except UnicodeDecodeError:
.... continue
.... print enc, 'U+%04X' % ord(u), ucd.name(u)
....
cp1250 U+0163 LATIN SMALL LETTER T WITH CEDILLA
cp1251 U+044E CYRILLIC SMALL LETTER YU
cp1252 U+00FE LATIN SMALL LETTER THORN
cp1253 U+03CE GREEK SMALL LETTER OMEGA WITH TONOS
cp1254 U+015F LATIN SMALL LETTER S WITH CEDILLA
cp1257 U+017E LATIN SMALL LETTER Z WITH CARON
cp1258 U+20AB DONG SIGN

Either you have a strange and narrow definition of "human", or you are
so brave as to cheerfully insult (inter alia) Romanians, Russians,
Icelanders, Greeks, Turks, Czechs, Estonians, Finns, Slovaks,
Slovenians, and Vietnamese :)
 
S

Stef Mientki

John said:
... s = chr(254)
... enc = 'cp125' + str(i)
... try:
... u = s.decode(enc)
... except UnicodeDecodeError:
... continue
... print enc, 'U+%04X' % ord(u), ucd.name(u)
...
cp1250 U+0163 LATIN SMALL LETTER T WITH CEDILLA
cp1251 U+044E CYRILLIC SMALL LETTER YU
cp1252 U+00FE LATIN SMALL LETTER THORN
cp1253 U+03CE GREEK SMALL LETTER OMEGA WITH TONOS
cp1254 U+015F LATIN SMALL LETTER S WITH CEDILLA
cp1257 U+017E LATIN SMALL LETTER Z WITH CARON
cp1258 U+20AB DONG SIGN

Either you have a strange and narrow definition of "human", or you are
so brave as to cheerfully insult (inter alia) Romanians, Russians,
Icelanders, Greeks, Turks, Czechs, Estonians, Finns, Slovaks,
Slovenians, and Vietnamese :)
Sorry if I offended someone, that was certainly not my intention.
And I guess you will be surprised, if I tell you, I don't (want) to
understand any bit of the above code ;-)
Come on, the home computer was invented about 1980.
If we look at hardware, it follows the Moore's law,
for software I would expect at least 0.1 of Moore's law ;-)
I hope that clarifies my point.

cheers,
Stef
 
S

Steven D'Aprano

Sorry if I offended someone, that was certainly not my intention. And I
guess you will be surprised, if I tell you, I don't (want) to understand
any bit of the above code ;-) Come on, the home computer was invented
about 1980. If we look at hardware, it follows the Moore's law, for
software I would expect at least 0.1 of Moore's law ;-) I hope that
clarifies my point.

No, that only makes it even more confusing. What does Moore's Law have to
do with your willful ignorance about the existence of human languages
other than English?
 
S

Stef Mientki

Steven said:
No, that only makes it even more confusing. What does Moore's Law have to
do with your willful ignorance about the existence of human languages
other than English?
Nothing.
I even don't (want to) see what bits / bytes / escape sequences have to
do with modern programming techniques,
so I certainly don't see any relation between these and human languages.

But the lack of Moore's law in software explains why we still need to
concern about bits and bytes ;-)

cheers,
Stef
 
M

Martin

2008/12/27 Stef Mientki said:
Nothing.
I even don't (want to) see what bits / bytes / escape sequences have to do
with modern programming techniques,
so I certainly don't see any relation between these and human languages.

But the lack of Moore's law in software explains why we still need to
concern about bits and bytes ;-)

http://www.joelonsoftware.com/articles/Unicode.html



--
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top