Fastest way to detect a non-ASCII character in a list of strings.

Discussion in 'Python' started by Dun Peal, Oct 17, 2010.

  1. Dun Peal

    Dun Peal Guest

    `all_ascii(L)` is a function that accepts a list of strings L, and
    returns True if all of those strings contain only ASCII chars, False
    otherwise.

    What's the fastest way to implement `all_ascii(L)`?

    My ideas so far are:

    1. Match against a regexp with a character range: `[ -~]`
    2. Use s.decode('ascii')
    3. `return all(31< ord(c) < 127 for s in L for c in s)`

    Any other ideas? Which one do you think will be fastest?

    Will reply with final benchmarks and implementations if there's any interest.

    Thanks, D
     
    Dun Peal, Oct 17, 2010
    #1
    1. Advertising

  2. Dun Peal

    Seebs Guest

    On 2010-10-17, Dun Peal <> wrote:
    > What's the fastest way to implement `all_ascii(L)`?


    Start by defining it.

    > 1. Match against a regexp with a character range: `[ -~]`


    What about tabs and newlines? For that matter, what about DEL and
    BEL? Seems to me that the entire 0-127 range are "ASCII characters".
    Perhaps you mean "printable"?

    > Any other ideas? Which one do you think will be fastest?


    I'd guess that a suitable regex (and see whether there's an
    existing character class that already has the right semantics) will
    be by far the fastest. Just anchor it on both ends and nothing will
    have to do any fancy evaluation to test it.

    -s
    --
    Copyright 2010, all wrongs reversed. Peter Seebach /
    http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
    http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
    I am not speaking for my employer, although they do rent some of my opinions.
     
    Seebs, Oct 17, 2010
    #2
    1. Advertising

  3. Dun Peal

    Carl Banks Guest

    On Oct 17, 12:59 pm, Dun Peal <> wrote:
    > `all_ascii(L)` is a function that accepts a list of strings L, and
    > returns True if all of those strings contain only ASCII chars, False
    > otherwise.
    >
    > What's the fastest way to implement `all_ascii(L)`?
    >
    > My ideas so far are:
    >
    > 1. Match against a regexp with a character range: `[ -~]`
    > 2. Use s.decode('ascii')
    > 3. `return all(31< ord(c) < 127 for s in L for c in s)`
    >
    > Any other ideas?  Which one do you think will be fastest?


    If you do numpy the fastest way might be something like:

    ns = np.ndarray(len(s),np.uint8,s)
    return np.all(np.logical_and(ns>=32,ns<=127))


    Carl Banks
     
    Carl Banks, Oct 18, 2010
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steven D'Aprano

    Re: fastest way to detect a user type

    Steven D'Aprano, Feb 1, 2009, in forum: Python
    Replies:
    3
    Views:
    277
    Steven D'Aprano
    Feb 1, 2009
  2. Replies:
    12
    Views:
    259
  3. Jochen Lehmeier

    DBD::Oracle, Unicode, non-UTF8-non-ASCII strings

    Jochen Lehmeier, Jul 23, 2009, in forum: Perl Misc
    Replies:
    0
    Views:
    412
    Jochen Lehmeier
    Jul 23, 2009
  4. bruce
    Replies:
    38
    Views:
    281
    Mark Lawrence
    Nov 1, 2013
  5. MRAB
    Replies:
    0
    Views:
    99
Loading...

Share This Page