Re: Need to know if a file as only ASCII charaters

Discussion in 'Python' started by Dave Angel, Jun 16, 2009.

  1. Dave Angel

    Dave Angel Guest

    Jorge wrote:
    > Hi there,
    > I'm making a application that reads 3 party generated ASCII files, but some
    > times
    > the files are corrupted totally or partiality and I need to know if it's a
    > ASCII file with *nix line terminators.
    > In linux I can run the file command but the applications should run in
    > windows.
    >
    > Any help will be great.
    >
    > Thank you in advance.
    >
    >

    So, which is the assignment:
    1) determine if a file has non-ASCII characters
    2) determine whether the line-endings are crlf or just lf

    In the former case, look at translating the file contents to Unicode,
    specifying ASCII as source. If it fails, you have non-ASCII
    In the latter case, investigate the 'u' attribute of the mode parameter
    in the open() function.

    You also need to ask yourself whether you're doing a validation of the
    file, or doing a "best guess" like the file command.
    Dave Angel, Jun 16, 2009
    #1
    1. Advertising

  2. Dave Angel

    pdpi Guest

    On Jun 16, 2:17 pm, Dave Angel <> wrote:
    > Jorge wrote:
    > > Hi there,
    > > I'm making  a application that reads 3 party generated ASCII files, but some
    > > times
    > > the files are corrupted totally or partiality and I need to know if it's a
    > > ASCII file with *nix line terminators.
    > > In linux I can run the file command but the applications should run in
    > > windows.

    >
    > > Any help will be great.

    >
    > > Thank you in advance.

    >
    > So, which is the assignment:
    >    1) determine if a file has non-ASCII characters
    >    2) determine whether the line-endings are crlf or just lf
    >
    > In the former case, look at translating the file contents to Unicode,
    > specifying ASCII as source.  If it fails, you have non-ASCII
    > In the latter case, investigate the 'u' attribute of the mode parameter
    > in the open() function.
    >
    > You also need to ask yourself whether you're doing a validation of the
    > file, or doing a "best guess" like the file command.


    From your requisites, you're already assuming something that _should_
    be ASCII, so it's easiest to check for ASCIIness at the binary level:

    Open the file as binary
    Loop at the bytes
    exit with error upon reading a byte outside the printable range
    (32-126 decimal)
    or any of a number of lower-range exceptions (\n, \t -- not \r since
    you want UNIX-style linefeeds)
    exit with success if the loop ended cleanly

    This supposes you're dealing strictly with ASCII, and not a full 8 bit
    codepage, of course.
    pdpi, Jun 16, 2009
    #2
    1. Advertising

  3. Dave Angel

    norseman Guest

    Scott David Daniels wrote:
    > Dave Angel wrote:
    >> Jorge wrote:
    >>> Hi there,
    >>> I'm making a application that reads 3 party generated ASCII files,
    >>> but some
    >>> times
    >>> the files are corrupted totally or partiality and I need to know if
    >>> it's a
    >>> ASCII file with *nix line terminators.
    >>> In linux I can run the file command but the applications should run in
    >>> windows.


    you are looking for a \x0D (the Carriage Return) \x0A (the Line feed)
    combination. If present you have Microsoft compatibility. If not you
    don't. If you think High Bits might be part of the corruption, filter
    each byte with byte && \x7F (byte AND'ed with hex 7F or 127 base 10)
    then check for the \x0D \x0A combination.
    Run the test on a known text setup. Intel uses one order and the SUN and
    the internet another. The BIG/Little ending confuses many. Intel
    reverses the order of multibyte numerics. Thus - Small machine has big
    ego or largest byte value last. Big Ending. Big machine has small ego.
    Little Ending. Some coders get the 0D0A backwards, some don't. You
    might want to test both.

    (2^32)(2^24)(2^16(2^8) 4 bytes correct math order little ending
    Intel stores them (2^8)(2^16)(2^24)(2^32) big ending
    SUN/Internet stores them in correct math order.


    Python will use \r\n (0D0A) and \n\r (0A0D) correctly.

    HTH

    Steve
    >>>
    >>> Any help will be great.
    >>>
    >>> Thank you in advance.
    >>>
    >>>

    >> So, which is the assignment:
    >> 1) determine if a file has non-ASCII characters
    >> 2) determine whether the line-endings are crlf or just lf
    >>
    >> In the former case, look at translating the file contents to Unicode,
    >> specifying ASCII as source. If it fails, you have non-ASCII
    >> In the latter case, investigate the 'u' attribute of the mode
    >> parameter in the open() function.
    >>
    >> You also need to ask yourself whether you're doing a validation of the
    >> file, or doing a "best guess" like the file command.
    >>
    >>

    > Also, realize that ASCII is a 7-bit code, with printing characters all
    > greater than space, and very few people use delete ('\x7F'), so you
    > can define a function to determine if a file contains only printing
    > ASCII and a few control characters. This one is False unless some ink
    > would be printed.
    >
    > Python 3.X:
    > def ascii_file(name, controls=b'\t\n'):
    > ctrls = set(controls + b' ')
    > with open(name, 'rb') as f:
    > chars = set(f.read())
    > return min(chars) >= min(ctrls) ord('~') >= max(chars)
    > ) and min(chars - ctrls) > ord(' ')
    >
    > Python 2.X:
    > def ascii_file(name, controls='\t\n'):
    > ctrls = set(controls + ' ')
    > with open(name, 'rb') as f:
    > chars = set(f.read())
    > return min(chars) >= min(ctrls) and '~' >= max(chars
    > ) and min(chars - ctrls) > ' '
    >
    > For potentially more performance (at least on 2.X), you could do min
    > and max on the data read, and only do the set(data) if the min and
    > max are OK.
    >
    > --Scott David Daniels
    >
    norseman, Jun 16, 2009
    #3
  4. Dave Angel

    MRAB Guest

    norseman wrote:
    > Scott David Daniels wrote:
    >> Dave Angel wrote:
    >>> Jorge wrote:
    >>>> Hi there,
    >>>> I'm making a application that reads 3 party generated ASCII files,
    >>>> but some
    >>>> times
    >>>> the files are corrupted totally or partiality and I need to know if
    >>>> it's a
    >>>> ASCII file with *nix line terminators.
    >>>> In linux I can run the file command but the applications should run in
    >>>> windows.

    >
    > you are looking for a \x0D (the Carriage Return) \x0A (the Line feed)
    > combination. If present you have Microsoft compatibility. If not you
    > don't. If you think High Bits might be part of the corruption, filter
    > each byte with byte && \x7F (byte AND'ed with hex 7F or 127 base 10)
    > then check for the \x0D \x0A combination.
    > Run the test on a known text setup. Intel uses one order and the SUN and
    > the internet another. The BIG/Little ending confuses many. Intel
    > reverses the order of multibyte numerics. Thus - Small machine has big
    > ego or largest byte value last. Big Ending. Big machine has small ego.
    > Little Ending. Some coders get the 0D0A backwards, some don't. You
    > might want to test both.
    >

    In an ASCII file endianness is irrelevant.
    MRAB, Jun 16, 2009
    #4
  5. Dave Angel

    norseman Guest

    Scott David Daniels wrote:
    > norseman wrote:
    >> Scott David Daniels wrote:
    >>> Dave Angel wrote:
    >>>> Jorge wrote: ...
    >>>>> I'm making a application that reads 3 party generated ASCII files,
    >>>>> but some times the files are corrupted totally or partiality and I
    >>>>> need to know if it's a ASCII file with *nix line terminators.
    >>>>> In linux I can run the file command but the applications should run in
    >>>>> windows.

    >> you are looking for a \x0D (the Carriage Return) \x0A (the Line feed)
    >> combination. If present you have Microsoft compatibility. If not you
    >> don't. If you think High Bits might be part of the corruption, filter
    >> each byte with byte && \x7F (byte AND'ed with hex 7F or 127 base 10)
    >> then check for the \x0D \x0A combination.

    >
    > Well ASCII defines a \x0D as the return code, and \x0A as line feed.
    > It is unix that is wrong, not Microsoft (don't get me wrong, I know
    > Microsoft has often redefined what it likes invalidly). If you
    > open the file with 'U', Python will return lines w/o the \r character
    > whether or not they started with it, equally well on both unix and
    > Microsoft systems.


    Yep - but if you are on Microsoft systems you will usually need the \r.

    Remove them and open the file in Notepad to see what I mean.
    Wordpad handles the lack of \r OK. Handles larger files too.

    > Many moons ago the high order bit was used as a
    > parity bit, but few communication systems do that these days, so
    > anything with the high bit set is likely corruption.
    >


    OH? How did one transfer binary files over the phone?
    I used PIP or Kermit and it got there just fine, high bits and all.
    Mail and other so called "text only" programs CAN (but not necessarily
    do) use 7bit transfer protocols. Can we say MIME? FTP transfers high
    bit just fine too.
    Set protocols to 8,1 and none. (8bit, 1 stop, no parity)
    As to how his 3rd party ASCII files are generated? He does not know, I
    do not know, we do not know (or care), so test before use.
    Filter out the high bits, remove all control characters except cr,lf and
    perhaps keep the ff too, then test what's left.

    ASCII
    cr - carriage return ^M x0D \r
    lf - line feed ^J x0A \n
    ff - form feed (new page) ^L x0C \f


    >> .... Intel uses one order and the SUN and the internet another. The
    > > BIG/Little ending confuses many. Intel reverses the order of multibyte
    > > numerics. Thus- Small machine has big ego or largest byte value last.
    > > Big Ending. Big machine has small ego.
    >> Little Ending. Some coders get the 0D0A backwards, some don't. You
    >> might want to test both.
    >> (2^32)(2^24)(2^16(2^8) 4 bytes correct math order little ending
    >> Intel stores them (2^8)(2^16)(2^24)(2^32) big ending
    >> SUN/Internet stores them in correct math order.
    >> Python will use \r\n (0D0A) and \n\r (0A0D) correctly.

    >
    > This is the most confused summary of byte sex I've ever read.
    > There is no such thing as "correct math order" (numbers are numbers).


    "...number are numbers..." Nope! Numbers represented as characters may
    be in ASCII but you should take a look at at IBM mainframes. They use
    EBCDIC and the 'numbers' are different bit patterns. Has anyone taken
    the time to read the IEEE floating point specs? To an electronic
    calculating machine, internally everything is a bit. Bytes are a group
    of bits and the CPU structure determines what a given bit pattern is.
    The computer has no notion of number, character or program instruction.
    It only knows what it is told. Try this - set the next instruction
    (jump) to a data value and watch the machine try to execute it as a
    program instruction. (I assume you can program in assembly. If not -
    don't tell because 'REAL programmers do assembly'. I think the last time
    I used it was 1980 or so. The program ran until the last of the hardware
    died and replacements could not be found. The client hired another to
    write for the new machines and closed shop shortly after. I think the
    owner was tired and found an excuse to retire. :)


    > The '\n\r' vs. '\r\n' has _nothing_ to do with little-endian vs.
    > big-endian. By the way, there are great arguments for each order,
    > and no clear winner.


    I don't care. Not the point. Point is some people get it fouled up and
    cause others problems. Test for both. You will save yourself a great
    deal of trouble in the long run.

    > Network order was defined for sending numbers
    > across a wire, the idea was that you'd unpack them to native order
    > as you pulled the data off the wire.


    "... sending BINARY FORMATTED numbers..." (verses character - type'able)

    Network order was defined to reduce machine time. Since the servers that
    worked day in and day out were SUN, SUN order won.
    I haven't used EBCDIC in so long I really don't remember for sure but it
    seems to me they used SUN order before SUN was around. Same for the
    VAX, I think.

    >
    > The '\n\r' vs. '\r\n' differences harken back to the days when they were
    > format effectors (carriage return moved the carriage to the extreme
    > left, line feed advanced the paper). You needed both to properly
    > position the print head.


    Yep. There wasn't enough intelligence in the old printers to 'cook" the
    stream.

    > ASCII uses the pair, and defined the effect
    > of each.


    Actually the Teletype people defined most of the \x00 - \x1f concepts.
    If I remember the trivia correctly - original teletype was 6 bit bytes.
    Bit pattern was neither ASCII nor EBCDIC. Both of those adopted the
    teletype control-character concept.

    > As ASCII was being worked out, MIT even defined a "line
    > starve" character to move up one line just as line feed went down one.
    > The order of the format effectors most used was '\r\n' because the
    > carriage return involved the most physical motion on many devices, and
    > the vertical motion time of the line feed could happen while the
    > carriage was moving.


    True. My experiment with reversing the two instructions would sometimes
    cause the printer to malfunction. One of my first 'black boxes'
    (filters) included instructions to see and correct the "wrong" pattern.
    Then I had to modify it to allow pure binary to get 'pictures' on the
    dot matrix types.

    > After that, you often added padding bytes
    > (typically ASCII NUL ('\x00') or DEL ('\x7F')) to allow the hardware
    > time to finish before you the did spacing and printing.
    >


    If I remember correctly:
    ASCII NULL x00 In my opinion, NULL should be none set :)
    IBM NULL x80 IBM card 80 Cols
    Sperry-Rand x90 S/R Card 90 Cols

    Trivia question:
    Why is a byte 8 bits?

    Ans: people have 10 fingers and the hardware to handle morse code
    (single wire - serial transfers) needed timers. 1-start, 8 data, 1-stop
    makes it a count by ten. Burroughs had 10 bits but counting by 12s just
    didn't come 'naturally'.
    That was the best answer I've heard to date. In reality - who knows?

    '...padding...'
    I never did. Never had to. Printers I used had enough buffer to void
    that practice. Thirty two character buffer seemed to be enough to
    disallow overflow. Of course we were using 300 to 1200 BAUD and DTR
    (pin 19 in most cases) -OR- the RTS and CTS pair of wires to control
    flow since ^S/^Q could be a valid dot matrix byte(s). Same for hardwired
    PIP or Kermit transfers.


    > --Scott David Daniels
    >
    >
    norseman, Jun 17, 2009
    #5
  6. On Tue, 16 Jun 2009 10:42:58 -0700, Scott David Daniels wrote:

    > Dave Angel wrote:
    >> Jorge wrote:
    >>> Hi there,
    >>> I'm making a application that reads 3 party generated ASCII files,
    >>> but some
    >>> times
    >>> the files are corrupted totally or partiality and I need to know if
    >>> it's a
    >>> ASCII file with *nix line terminators. In linux I can run the file
    >>> command but the applications should run in windows.
    >>>
    >>> Any help will be great.
    >>>
    >>> Thank you in advance.
    >>>
    >>>

    >> So, which is the assignment:
    >> 1) determine if a file has non-ASCII characters 2) determine whether
    >> the line-endings are crlf or just lf
    >>
    >> In the former case, look at translating the file contents to Unicode,
    >> specifying ASCII as source. If it fails, you have non-ASCII In the
    >> latter case, investigate the 'u' attribute of the mode parameter in the
    >> open() function.
    >>
    >> You also need to ask yourself whether you're doing a validation of the
    >> file, or doing a "best guess" like the file command.
    >>
    >>

    > Also, realize that ASCII is a 7-bit code, with printing characters all
    > greater than space, and very few people use delete ('\x7F'), so you can
    > define a function to determine if a file contains only printing ASCII
    > and a few control characters. This one is False unless some ink would
    > be printed.
    >
    > Python 3.X:
    > def ascii_file(name, controls=b'\t\n'):
    > ctrls = set(controls + b' ')
    > with open(name, 'rb') as f:
    > chars = set(f.read())
    > return min(chars) >= min(ctrls) ord('~') >= max(chars)
    > ) and min(chars - ctrls) > ord(' ')
    >
    > Python 2.X:
    > def ascii_file(name, controls='\t\n'):
    > ctrls = set(controls + ' ')
    > with open(name, 'rb') as f:
    > chars = set(f.read())
    > return min(chars) >= min(ctrls) and '~' >= max(chars
    > ) and min(chars - ctrls) > ' '
    >
    > For potentially more performance (at least on 2.X), you could do min and
    > max on the data read, and only do the set(data) if the min and max are
    > OK.



    You're suggesting that running through the entire data three times
    instead of once is an optimization? Boy, I'd hate to see what you
    consider a pessimation! *wink*

    I think the best solution will probably be a lazy function which stops
    processing as soon as it hits a character that isn't ASCII.


    # Python 2.5, and untested
    def ascii_file(name):
    from string import printable
    with open(name, 'rb') as f:
    for c in f.read(1):
    if c not in printable: return False
    return True

    This only reads the entire file if it needs to, and only walks the data
    once if it is ASCII.

    In practice, you may actually get better performance by reading in a
    block at a time, rather than a byte at a time:

    # Python 2.5, and still untested
    def ascii_file(name, bs=65536): # 64K default blocksize
    from string import printable
    with open(name, 'rb') as f:
    text = f.read(bs)
    while text:
    for c in text:
    if c not in printable: return False
    text = f.read(bs)
    return True




    --
    Steven
    Steven D'Aprano, Jun 17, 2009
    #6
  7. On Wednesday, 17. June 2009, Steven D'Aprano wrote:
    > while text:
    > for c in text:
    > if c not in printable: return False


    that is one loop per character.

    wouldn't it be faster to apply a regex to text?
    something like

    while text:
    if re.search(r'\W',text): return False

    --
    Wolfgang
    Wolfgang Rohdewald, Jun 17, 2009
    #7
  8. Dave Angel

    Lie Ryan Guest

    Scott David Daniels wrote:
    > norseman wrote:
    >> Scott David Daniels wrote:
    >>> Dave Angel wrote:
    >>>> Jorge wrote: ...
    >>>>> I'm making a application that reads 3 party generated ASCII files,
    >>>>> but some times the files are corrupted totally or partiality and I
    >>>>> need to know if it's a ASCII file with *nix line terminators.
    >>>>> In linux I can run the file command but the applications should run in
    >>>>> windows.

    >> you are looking for a \x0D (the Carriage Return) \x0A (the Line feed)
    >> combination. If present you have Microsoft compatibility. If not you
    >> don't. If you think High Bits might be part of the corruption, filter
    >> each byte with byte && \x7F (byte AND'ed with hex 7F or 127 base 10)
    >> then check for the \x0D \x0A combination.

    >
    > Well ASCII defines a \x0D as the return code, and \x0A as line feed.
    > It is unix that is wrong, not Microsoft (don't get me wrong, I know
    > Microsoft has often redefined what it likes invalidly).


    The \r\n was originally a hack because teletype machines can only do one
    thing at a time (i.e. do a line feed NAND carriage return) and trying to
    do both at the same time or the wrong order would trigger a bug that
    sends a HCF instruction on many ancient teletypes.

    Unix decided that in virtual terminal, \r\n is unnecessary and redundant
    since VTs can do both in a single instruction.

    We can argue that Microsoft is "foolish consistency" here or Unix is
    changing standards just for saving a few bytes, but objectively neither
    side is right or wrong since the problem was a hack in the first place.

    If anyone is wrong, it is Mac that decided to use \r when Unix have
    already decided which characters to abandon (or maybe it's their usual
    reason: "different just because we want to be different")

    <ducks avoiding rotten tomato from Mac fanboys>
    Lie Ryan, Jun 17, 2009
    #8
  9. Dave Angel

    Lie Ryan Guest

    Wolfgang Rohdewald wrote:
    > On Wednesday, 17. June 2009, Steven D'Aprano wrote:
    >> while text:
    >> for c in text:
    >> if c not in printable: return False

    >
    > that is one loop per character.


    unless printable is a set

    > wouldn't it be faster to apply a regex to text?
    > something like
    >
    > while text:
    > if re.search(r'\W',text): return False
    >


    regex? Don't even start...

    Anyway, only cProfile and profile would know who is the fastest...

    And, is speed really that important for this case? Seems like premature
    optimization to me.
    Lie Ryan, Jun 17, 2009
    #9
  10. On Wednesday 17 June 2009, Lie Ryan wrote:
    > Wolfgang Rohdewald wrote:
    > > On Wednesday, 17. June 2009, Steven D'Aprano wrote:
    > >> while text:
    > >> for c in text:
    > >> if c not in printable: return False

    > >
    > > that is one loop per character.

    >
    > unless printable is a set


    that would still execute the line "if c not in..."
    once for every single character, against just one
    regex call. With bigger block sizes, the advantage
    of regex should increase.

    > > wouldn't it be faster to apply a regex to text?
    > > something like
    > >
    > > while text:
    > > if re.search(r'\W',text): return False
    > >

    >
    > regex? Don't even start...


    Here comes a cProfile test. Note that the first variant of Steven
    would always have stopped after the first char. After fixing that
    making it look like variant 2 with block size=1, I now have
    3 variants:

    Variant 1 Blocksize 1
    Variant 2 Blocksize 65536
    Variant 3 Regex on Blocksize 65536

    testing for a file with 400k bytes shows regex as a clear winner.
    Doing the same for an 8k file: variant 2 takes 3ms, Regex takes 5ms.

    Variants 2 and 3 take about the same time for a file with 20k.


    python ascii.py | grep CPU
    398202 function calls in 1.597 CPU seconds
    13 function calls in 0.104 CPU seconds
    1181 function calls in 0.012 CPU seconds

    import re
    import cProfile

    from string import printable

    def ascii_file1(name):
    with open(name, 'rb') as f:
    c = f.read(1)
    while c:
    if c not in printable: return False
    c = f.read(1)
    return True

    def ascii_file2(name):
    bs = 65536
    with open(name, 'rb') as f:
    text = f.read(bs)
    while text:
    for c in text:
    if c not in printable: return False
    text = f.read(bs)
    return True

    def ascii_file3(name):
    bs = 65536
    search = r'[^%s]' % re.escape(printable)
    reco = re.compile(search)
    with open(name, 'rb') as f:
    text = f.read(bs)
    while text:
    if reco.search(text): return False
    text = f.read(bs)
    return True

    def test(fun):
    if fun('/tmp/x'):
    print 'is ascii'
    else:
    print 'is not ascii'

    cProfile.run("test(ascii_file1)")
    cProfile.run("test(ascii_file2)")
    cProfile.run("test(ascii_file3)")




    --
    Wolfgang
    Wolfgang Rohdewald, Jun 17, 2009
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Amil
    Replies:
    0
    Views:
    353
  2. Zsolt Koppany
    Replies:
    0
    Views:
    507
    Zsolt Koppany
    Nov 27, 2003
  3. Replies:
    2
    Views:
    5,437
    Ian Collins
    Jan 19, 2006
  4. Replies:
    5
    Views:
    377
    Larry Bates
    Sep 24, 2007
  5. Mark Anderson

    Insert ASCII charaters

    Mark Anderson, Aug 28, 2004, in forum: Javascript
    Replies:
    2
    Views:
    89
    Mark Anderson
    Aug 28, 2004
Loading...

Share This Page