Detecteing Unicode encodings

J

Jason Diamond

Hi.

Is it possible to decode a UTF-8 (with or without a BOM), UTF-16 (BE or
LE with a BOM), or UTF-32 (BE or LE with a BOM) byte stream without
knowing what encoding the stream is in?

I know how to use the codecs module to get StreamReader classes that can
decode a specific encoding but I have to know what that enocding is
before hand.

If I read up to four bytes from the byte stream, I can figure out what
encoding the stream is in but that has problems for UTF-8 streams
without BOMs--I would have just eaten one or more bytes that might need
to be decoded by the StreamReader. I could seek back to the beginning of
the stream but what if the file-like object I was reading from didn't
support seeking?

Thanks.

-- Jason
 
C

Christos TZOTZIOY Georgiou

If I read up to four bytes from the byte stream, I can figure out what
encoding the stream is in but that has problems for UTF-8 streams
without BOMs--I would have just eaten one or more bytes that might need
to be decoded by the StreamReader. I could seek back to the beginning of
the stream but what if the file-like object I was reading from didn't
support seeking?

Two options pop up instantly:

1. "Programmers do it byte by byte" (mainly a joke, so go to option 2 :)

2. wrap your file-like object in a custom object, which implements a
pushback method and its read method returns first from the push-back
buffer. If you read data that you shouldn't, push them back and give
your custom object to the StreamReader.
 
J

Jason Diamond

Christos said:
2. wrap your file-like object in a custom object, which implements a
pushback method and its read method returns first from the push-back
buffer. If you read data that you shouldn't, push them back and give
your custom object to the StreamReader.

Thanks for the suggestion.

Instead of a pushback method, I added a peek method. Below is what I
came up with.

-- Jason

class PeekableFile:

def __init__(self, source):
self.source = source
self.buffer = None

def peek(self, size):
if self.buffer:
n = len(self.buffer)
if size > n:
self.buffer += self.source.read(size - n)
else:
self.buffer = self.source.read(size)
return self.buffer[:size]

def read(self, size=-1):
if self.buffer:
if size >= 0:
n = len(self.buffer)
if size < n:
s = self.buffer[:size]
self.buffer = self.buffer[size:]
elif size == n:
s = self.buffer
self.buffer = None
else:
s = self.buffer + self.source.read(size - n)
self.buffer = None
else:
s = self.buffer + self.source.read()
self.buffer = None
else:
s = self.source.read(size)
return s

def main():

import StringIO
import unittest

class PeekableFileTests(unittest.TestCase):

def setUp(self):
f = StringIO.StringIO('abc')
self.pf = PeekableFile(f)

def testPeek0(self):
self.failUnlessEqual(self.pf.peek(0), '')

def testPeek1(self):
self.failUnlessEqual(self.pf.peek(1), 'a')

def testPeek1Read1(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.read(1), 'a')

def testPeek1Read2(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.read(2), 'ab')

def testPeek1ReadAll(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.read(), 'abc')

def testPeek1Read1Read1(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.read(1), 'a')
self.failUnlessEqual(self.pf.read(1), 'b')

def testPeek1Read1ReadAll(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.read(1), 'a')
self.failUnlessEqual(self.pf.read(), 'bc')

def testPeek1Peek1(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.peek(1), 'a')

def testPeek1Peek2(self):
self.failUnlessEqual(self.pf.peek(1), 'a')
self.failUnlessEqual(self.pf.peek(2), 'ab')

def testPeek2Peek1(self):
self.failUnlessEqual(self.pf.peek(2), 'ab')
self.failUnlessEqual(self.pf.peek(1), 'a')

def testPeek2Read1Peek1(self):
self.failUnlessEqual(self.pf.peek(2), 'ab')
self.failUnlessEqual(self.pf.read(1), 'a')
self.failUnlessEqual(self.pf.peek(1), 'b')

def testRead0(self):
self.failUnlessEqual(self.pf.read(0), '')

def testRead1(self):
self.failUnlessEqual(self.pf.read(1), 'a')

def testReadAll(self):
self.failUnlessEqual(self.pf.read(), 'abc')

def testRead1Peek1(self):
self.failUnlessEqual(self.pf.read(1), 'a')
self.failUnlessEqual(self.pf.peek(1), 'b')

def testReadAllPeek1(self):
self.failUnlessEqual(self.pf.read(), 'abc')
self.failUnlessEqual(self.pf.peek(1), '')

unittest.TextTestRunner().run(unittest.makeSuite(PeekableFileTests))

if __name__ == '__main__':
main()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

new encodings in 1.8 0
helping with unicode 4
Guessing Encodings and the PerlIO layer 2
UTF-8 question from Dive into Python 3 19
Unicode 20
Unicode BOM marks 9
Python and unicode 8
Opening Unicode files? 7

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top