Python 2.6 StreamReader.readline()

Discussion in 'Python' started by cpppwner@gmail.com, Jul 24, 2012.

  1. Guest

    Hi,

    I have a simple question, I'm using something like the following lines in python 2.6.2

    reader = codecs.getreader(encoding)
    lines = []
    with open(filename, 'rb') as f:
    lines = reader(f, 'strict').readlines(keepends=False)

    where encoding == 'utf-16-be'
    Everything works fine, except that lines[0] is equal to codecs.BOM_UTF16_BE
    Is this behaviour correct, that the BOM is still present?

    Thanks in advance for your help.

    Best,
    Stefan
    , Jul 24, 2012
    #1
    1. Advertising

  2. Am 24.07.2012 17:01, schrieb :
    > reader = codecs.getreader(encoding)
    > lines = []
    > with open(filename, 'rb') as f:
    > lines = reader(f, 'strict').readlines(keepends=False)
    >
    > where encoding == 'utf-16-be'
    > Everything works fine, except that lines[0] is equal to codecs.BOM_UTF16_BE
    > Is this behaviour correct, that the BOM is still present?


    Yes, assuming the first line only contains that BOM. Technically it's a
    space character, and why should those be removed?

    Uli
    Ulrich Eckhardt, Jul 25, 2012
    #2
    1. Advertising

  3. On 25.07.12 08:09, Ulrich Eckhardt wrote:

    > Am 24.07.2012 17:01, schrieb :
    >> reader = codecs.getreader(encoding)
    >> lines = []
    >> with open(filename, 'rb') as f:
    >> lines = reader(f, 'strict').readlines(keepends=False)
    >>
    >> where encoding == 'utf-16-be'
    >> Everything works fine, except that lines[0] is equal to
    >> codecs.BOM_UTF16_BE
    >> Is this behaviour correct, that the BOM is still present?

    >
    > Yes, assuming the first line only contains that BOM. Technically it's a
    > space character, and why should those be removed?


    If the first "character" in the file is a BOM the file encoding is
    probably not utf-16-be but utf-16.

    Servus,
    Walter
    Walter Dörwald, Jul 25, 2012
    #3
  4. Guest

    On Wednesday, July 25, 2012 11:02:01 AM UTC+2, Walter Dörwald wrote:
    > On 25.07.12 08:09, Ulrich Eckhardt wrote:
    >
    > > Am 24.07.2012 17:01, schrieb :
    > >> reader = codecs.getreader(encoding)
    > >> lines = []
    > >> with open(filename, 'rb') as f:
    > >> lines = reader(f, 'strict').readlines(keepends=False)
    > >>
    > >> where encoding == 'utf-16-be'
    > >> Everything works fine, except that lines[0] is equal to
    > >> codecs.BOM_UTF16_BE
    > >> Is this behaviour correct, that the BOM is still present?
    > >
    > > Yes, assuming the first line only contains that BOM. Technically it's a
    > > space character, and why should those be removed?
    >
    > If the first "character" in the file is a BOM the file encodingis
    > probably not utf-16-be but utf-16.
    >
    > Servus,
    > Walter


    The byte order mark, if present, is nothing else than
    an encoded

    >>> ud.name('\ufeff')

    'ZERO WIDTH NO-BREAK SPACE'

    *code point*.

    Five "BOM" are possible (Unicode consortium). utf-8-sig, utf-16-be,
    utf-16-le, utf-32-be, utf-32-le. The codecs module provide many
    aliases.

    The fact that utf-16/32 does correspond to -le or to -be may
    vary according to the platforms, the compilers, ...

    >>> sys.version

    '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit
    (Intel)]'
    >>> codecs.BOM_UTF16_BE

    b'\xfe\xff'
    >>> codecs.BOM_UTF16_LE

    b'\xff\xfe'
    >>> codecs.BOM_UTF16

    b'\xff\xfe'
    >>>


    ---

    As far as I know, Py 2.7 or Py 3.2 never return a "BOM" when
    a file is read correctly.

    >>> with open('a-utf-16-be.txt', 'r', encoding='utf-16-be') as f:

    .... r = f.readlines()
    .... for zeile in r:
    .... print(zeile.rstrip())
    ....
    abc
    élève
    cœur
    €uro
    >>>



    jmf
    , Jul 25, 2012
    #4
  5. Guest

    On Wednesday, July 25, 2012 11:02:01 AM UTC+2, Walter Dörwald wrote:
    > On 25.07.12 08:09, Ulrich Eckhardt wrote:
    >
    > > Am 24.07.2012 17:01, schrieb :
    > >> reader = codecs.getreader(encoding)
    > >> lines = []
    > >> with open(filename, 'rb') as f:
    > >> lines = reader(f, 'strict').readlines(keepends=False)
    > >>
    > >> where encoding == 'utf-16-be'
    > >> Everything works fine, except that lines[0] is equal to
    > >> codecs.BOM_UTF16_BE
    > >> Is this behaviour correct, that the BOM is still present?
    > >
    > > Yes, assuming the first line only contains that BOM. Technically it's a
    > > space character, and why should those be removed?
    >
    > If the first "character" in the file is a BOM the file encodingis
    > probably not utf-16-be but utf-16.
    >
    > Servus,
    > Walter


    The byte order mark, if present, is nothing else than
    an encoded

    >>> ud.name('\ufeff')

    'ZERO WIDTH NO-BREAK SPACE'

    *code point*.

    Five "BOM" are possible (Unicode consortium). utf-8-sig, utf-16-be,
    utf-16-le, utf-32-be, utf-32-le. The codecs module provide many
    aliases.

    The fact that utf-16/32 does correspond to -le or to -be may
    vary according to the platforms, the compilers, ...

    >>> sys.version

    '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit
    (Intel)]'
    >>> codecs.BOM_UTF16_BE

    b'\xfe\xff'
    >>> codecs.BOM_UTF16_LE

    b'\xff\xfe'
    >>> codecs.BOM_UTF16

    b'\xff\xfe'
    >>>


    ---

    As far as I know, Py 2.7 or Py 3.2 never return a "BOM" when
    a file is read correctly.

    >>> with open('a-utf-16-be.txt', 'r', encoding='utf-16-be') as f:

    .... r = f.readlines()
    .... for zeile in r:
    .... print(zeile.rstrip())
    ....
    abc
    élève
    cœur
    €uro
    >>>



    jmf
    , Jul 25, 2012
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?U29sZWwgU29mdHdhcmU=?=

    StreamReader.ReadLine Maximum Line Size?

    =?Utf-8?B?U29sZWwgU29mdHdhcmU=?=, Jan 13, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    1,104
    =?Utf-8?B?U29sZWwgU29mdHdhcmU=?=
    Jan 13, 2005
  2. gavino
    Replies:
    4
    Views:
    528
    gavino
    Sep 20, 2010
  3. Andy Mee
    Replies:
    5
    Views:
    354
  4. Jean-Michel
    Replies:
    0
    Views:
    351
    Jean-Michel
    Dec 22, 2007
  5. Andrew DeFaria
    Replies:
    1
    Views:
    196
    Ben Morrow
    Jan 30, 2008
Loading...

Share This Page