Sniffing encoding type by looking at file BOM header

Discussion in 'Python' started by python@bdurham.com, Mar 24, 2010.

  1. Guest

    I assume there's no standard library function that wraps
    codecs.open() to sniff a file's BOM header and open the file with
    the appropriate encoding?

    My reading of the docs leads me to believe that there are 5
    types of possible BOM headers with multiple names (synoymns?)
    for the same BOM encoding type.

    BOM = '\xff\xfe'
    BOM_LE = '\xff\xfe'
    BOM_UTF16 = '\xff\xfe'
    BOM_UTF16_LE = '\xff\xfe'

    BOM_BE = '\xfe\xff'
    BOM32_BE = '\xfe\xff'
    BOM_UTF16_BE = '\xfe\xff'

    BOM64_BE = '\x00\x00\xfe\xff'
    BOM_UTF32_BE = '\x00\x00\xfe\xff'

    BOM64_LE = '\xff\xfe\x00\x00'
    BOM_UTF32 = '\xff\xfe\x00\x00'
    BOM_UTF32_LE = '\xff\xfe\x00\x00'

    BOM_UTF8 = '\xef\xbb\xbf'

    Is the process of writing a BOM sniffer readlly as simple
    as detecting one of these 5 header types and then calling
    codecs.open() with the appropriate encoding= parameter?

    Note: I'm only interested in Unicode encodings. I am not
    interested in any of the non-Unicode encodings supported
    by the codecs module.

    Thank you,
    Malcolm
    , Mar 24, 2010
    #1
    1. Advertising

  2. In message <>,
    wrote:

    > BOM_UTF8 = '\xef\xbb\xbf'


    Since when does UTF-8 need a BOM?
    Lawrence D'Oliveiro, Mar 25, 2010
    #2
    1. Advertising

  3. On 26-3-2010 0:16, Lawrence D'Oliveiro wrote:
    > In message<>,
    > wrote:
    >
    >> BOM_UTF8 = '\xef\xbb\xbf'

    >
    > Since when does UTF-8 need a BOM?


    It doesn't, but it is allowed. Not recommended though.
    Unfortunately several tools, such as notepad.exe, have a tendency of
    silently adding it when saving files.

    -irmen
    Irmen de Jong, Mar 25, 2010
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Erik Wahlstrom
    Replies:
    1
    Views:
    624
    Richard Tobin
    Aug 18, 2004
  2. NoName
    Replies:
    9
    Views:
    2,256
    NoName
    Dec 27, 2008
  3. Wolfgang Nádasi-Donner

    UTF-8 encoding with BOM under Ruby 1.8.x (Windows)

    Wolfgang Nádasi-Donner, Aug 15, 2007, in forum: Ruby
    Replies:
    5
    Views:
    167
    Nobuyoshi Nakada
    Aug 16, 2007
  4. Replies:
    2
    Views:
    366
    Nathan Keel
    Aug 14, 2009
  5. Steven D'Aprano

    Guessing the encoding from a BOM

    Steven D'Aprano, Jan 16, 2014, in forum: Python
    Replies:
    7
    Views:
    59
    Tim Chase
    Jan 16, 2014
Loading...

Share This Page