Re: Stripping ASCII codes when parsing

Discussion in 'Python' started by David Pratt, Oct 17, 2005.

  1. David Pratt

    David Pratt Guest

    This is very nice :) Thank you Tony. I think this will be the way to
    go. My concern ATM is where it will be best to unicode. The data after
    this will go into dict and a few processes and into database. Because
    input source if not explicit encoding, I will have to assume ISO-8859-1
    I believe but could well be cp1252 for most part ( because it says no
    ASCII (0-30) but alright ASCII chars 128-254) and because most are
    Windows users. Am thinking to unicode after stripping these characters
    and validating text, then unicoding (utf-8) so it is unicode in dict.
    Then when I perform these other processes it should be uniform and then
    it will go into database as unicode. I think this should be ok.

    Regards,
    David

    On Monday, October 17, 2005, at 01:48 PM, Tony Nelson wrote:

    > In article <>,
    > David Pratt <> wrote:
    >
    >> I am working with a text format that advises to strip any ascii
    >> control
    >> characters (0 - 30) as part of parsing data and also the ascii pipe
    >> character (124) from the data. I think many of these characters are
    >> from a different time. Since I have never seen most of these
    >> characters
    >> in text I am not sure how these first 30 control characters are all
    >> represented (other than say tab (\t), newline(\n), line return(\r) )
    >> so
    >> what should I do to remove these characters if they are ever
    >> encountered. Many thanks.

    >
    > Most of those characters are hard to see.
    >
    > Represent arbitrary characters in a string in hex: "\x00\x01\x02" or
    > with chr(n).
    >
    > If you just want to remove some characters, look into "".translate().
    >
    > nullxlate = "".join([chr(n) for n in xrange(256)])
    > delchars = nullxlate[:31] + chr(124)
    > outputstr = inputstr.translate(nullxlate, delchars)
    > _______________________________________________________________________
    > _
    > TonyN.:'
    > *firstname*nlsnews@georgea*lastname*.com
    > '
    > <http://www.georgeanelson.com/>
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
    David Pratt, Oct 17, 2005
    #1
    1. Advertising

  2. David Pratt

    Tony Nelson Guest

    In article <>,
    David Pratt <> wrote:

    > This is very nice :) Thank you Tony. I think this will be the way to
    > go. My concern ATM is where it will be best to unicode. The data after
    > this will go into dict and a few processes and into database. Because
    > input source if not explicit encoding, I will have to assume ISO-8859-1
    > I believe but could well be cp1252 for most part ( because it says no
    > ASCII (0-30) but alright ASCII chars 128-254) and because most are
    > Windows users. Am thinking to unicode after stripping these characters
    > and validating text, then unicoding (utf-8) so it is unicode in dict.
    > Then when I perform these other processes it should be uniform and then
    > it will go into database as unicode. I think this should be ok.


    Definitely "".translate() then unicode(). See the docs for
    "".translate(). As far as charset, well, if you can't know in advance
    you'll want to have some way to configure it for when it's wrong. Also,
    maybe 255 is not allowed and should be checked for?
    ________________________________________________________________________
    TonyN.:' *firstname*nlsnews@georgea*lastname*.com
    ' <http://www.georgeanelson.com/>
    Tony Nelson, Oct 18, 2005
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg  --
    Replies:
    4
    Views:
    2,137
  2. David Pratt

    Stripping ASCII codes when parsing

    David Pratt, Oct 14, 2005, in forum: Python
    Replies:
    2
    Views:
    1,034
    Erik Max Francis
    Oct 17, 2005
  3. Colin Green

    Stripping Hex and then ASCII

    Colin Green, Jul 7, 2004, in forum: C Programming
    Replies:
    6
    Views:
    379
    Emmanuel Delahaye
    Jul 7, 2004
  4. Replies:
    2
    Views:
    2,798
    Malcolm
    Aug 20, 2005
  5. Allen
    Replies:
    1
    Views:
    627
    Mark Rae [MVP]
    Dec 3, 2007
Loading...

Share This Page