Possible bug with stability of mimetypes.guess_* function output

Discussion in 'Python' started by Johannes Bauer, Feb 7, 2014.

  1. Hi group,

    I'm using Python 3.3.2+ (default, Oct 9 2013, 14:50:09) [GCC 4.8.1] on
    linux and have found what is very peculiar behavior at best and a bug at
    worst. It regards the mimetypes module and in particular the
    guess_all_extensions and guess_extension functions.

    I've found that these do not return stable output. When running the
    following commands, it returns one of:

    $ python3 -c 'import mimetypes;
    print(mimetypes.guess_all_extensions("text/html"),
    mimetypes.guess_extension("text/html"))'
    ['.htm', '.html', '.shtml'] .htm

    $ python3 -c 'import mimetypes;
    print(mimetypes.guess_all_extensions("text/html"),
    mimetypes.guess_extension("text/html"))'
    ['.html', '.htm', '.shtml'] .html

    So guess_extension(x) seems to always return guess_all_extensions(x)[0].

    Curiously, "shtml" is never the first element. The other two are mixed
    with a probability of around 50% which leads me to believe they're
    internally managed as a set and are therefore affected by the
    (relatively new) nondeterministic hashing function initialization.

    I don't know if stable output is guaranteed for these functions, but it
    sure would be nice. Messes up a whole bunch of things otherwise :-/

    Please let me know if this is a bug or expected behavior.
    Best regards,
    Johannes

    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
    Johannes Bauer, Feb 7, 2014
    #1
    1. Advertising

  2. Johannes Bauer

    Asaf Las Guest

    On Friday, February 7, 2014 8:06:36 PM UTC+2, Johannes Bauer wrote:
    > Hi group,
    >
    > I'm using Python 3.3.2+ (default, Oct 9 2013, 14:50:09) [GCC 4.8.1] on
    > linux and have found what is very peculiar behavior at best and a bug at
    > worst. It regards the mimetypes module and in particular the
    > guess_all_extensions and guess_extension functions.
    >
    > I've found that these do not return stable output. When running the
    > following commands, it returns one of:
    >
    > $ python3 -c 'import mimetypes;
    > print(mimetypes.guess_all_extensions("text/html"),
    > mimetypes.guess_extension("text/html"))'
    > ['.htm', '.html', '.shtml'] .htm
    >
    > $ python3 -c 'import mimetypes;
    > print(mimetypes.guess_all_extensions("text/html"),
    > mimetypes.guess_extension("text/html"))'
    > ['.html', '.htm', '.shtml'] .html
    >
    > So guess_extension(x) seems to always return guess_all_extensions(x)[0].
    >
    > Curiously, "shtml" is never the first element. The other two are mixed
    > with a probability of around 50% which leads me to believe they're
    > internally managed as a set and are therefore affected by the
    > (relatively new) nondeterministic hashing function initialization.
    >
    >
    > I don't know if stable output is guaranteed for these functions, but it
    > sure would be nice. Messes up a whole bunch of things otherwise :-/
    >
    > Please let me know if this is a bug or expected behavior.
    >
    > Best regards,
    >
    > Johannes


    dictionary. same for v3.3.3 as well.

    it might be you could try to query using sequence below :

    import mimetypes
    mimetypes.init()
    mimetypes.guess_extension("text/html")

    i got only 'htm' for 5 consequitive attempts

    /Asaf
    Asaf Las, Feb 7, 2014
    #2
    1. Advertising

  3. Johannes Bauer

    Asaf Las Guest

    btw, had seen this after own post -
    example usage includes mimetypes.init()
    before call to module functions.
    Asaf Las, Feb 7, 2014
    #3
  4. On 07/02/2014 19:17, Asaf Las wrote:
    > btw, had seen this after own post -
    > example usage includes mimetypes.init()
    > before call to module functions.
    >


    From http://docs.python.org/3/library/mimetypes.html#module-mimetypes
    third paragraph "The functions described below provide the primary
    interface for this module. If the module has not been initialized, they
    will call init() if they rely on the information init() sets up." Draw
    your own conclusions :)

    --
    My fellow Pythonistas, ask not what our language can do for you, ask
    what you can do for our language.

    Mark Lawrence

    ---
    This email is free from viruses and malware because avast! Antivirus protection is active.
    http://www.avast.com
    Mark Lawrence, Feb 7, 2014
    #4
  5. On 07.02.2014 20:09, Asaf Las wrote:

    > it might be you could try to query using sequence below :
    >
    > import mimetypes
    > mimetypes.init()
    > mimetypes.guess_extension("text/html")
    >
    > i got only 'htm' for 5 consequitive attempts


    Doesn't change anything. With this:

    #!/usr/bin/python3
    import mimetypes
    mimetypes.init()
    print(mimetypes.guess_extension("application/msword"))

    And a call like this:

    $ for i in `seq 100`; do ./x.py ; done | sort | uniq -c

    I get

    35 .doc
    24 .dot
    41 .wiz

    Regards,
    Johannes

    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
    Johannes Bauer, Feb 7, 2014
    #5
  6. Johannes Bauer

    Peter Otten Guest

    Asaf Las wrote:

    > On Friday, February 7, 2014 8:06:36 PM UTC+2, Johannes Bauer wrote:
    >> Hi group,
    >>
    >> I'm using Python 3.3.2+ (default, Oct 9 2013, 14:50:09) [GCC 4.8.1] on
    >> linux and have found what is very peculiar behavior at best and a bug at
    >> worst. It regards the mimetypes module and in particular the
    >> guess_all_extensions and guess_extension functions.
    >>
    >> I've found that these do not return stable output. When running the
    >> following commands, it returns one of:
    >>
    >> $ python3 -c 'import mimetypes;
    >> print(mimetypes.guess_all_extensions("text/html"),
    >> mimetypes.guess_extension("text/html"))'
    >> ['.htm', '.html', '.shtml'] .htm
    >>
    >> $ python3 -c 'import mimetypes;
    >> print(mimetypes.guess_all_extensions("text/html"),
    >> mimetypes.guess_extension("text/html"))'
    >> ['.html', '.htm', '.shtml'] .html
    >>
    >> So guess_extension(x) seems to always return guess_all_extensions(x)[0].
    >>
    >> Curiously, "shtml" is never the first element. The other two are mixed
    >> with a probability of around 50% which leads me to believe they're
    >> internally managed as a set and are therefore affected by the
    >> (relatively new) nondeterministic hashing function initialization.
    >>
    >>
    >> I don't know if stable output is guaranteed for these functions, but it
    >> sure would be nice. Messes up a whole bunch of things otherwise :-/
    >>
    >> Please let me know if this is a bug or expected behavior.
    >>
    >> Best regards,
    >>
    >> Johannes

    >
    > dictionary. same for v3.3.3 as well.
    >
    > it might be you could try to query using sequence below :
    >
    > import mimetypes
    > mimetypes.init()
    > mimetypes.guess_extension("text/html")
    >
    > i got only 'htm' for 5 consequitive attempts


    As Johannes mentioned, this depends on the hash seed:

    $ PYTHONHASHSEED=0 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
    ..html
    $ PYTHONHASHSEED=1 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
    ..htm
    $ PYTHONHASHSEED=2 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
    ..shtml

    You never see ".shtml" as the guessed extension because it is not in the
    original mimetypes.types_map dict, but instead programmaticaly read from a
    file like /etc/mime.types and then added to a list of extensions.

    Johanes,
    I'd like the guessed extension to be consistent, too, but even if that is
    rejected the current behaviour should be documented.

    Please file a bug report.
    Peter Otten, Feb 7, 2014
    #6
  7. Johannes Bauer

    Asaf Las Guest

    On Friday, February 7, 2014 9:40:06 PM UTC+2, Peter Otten wrote:
    > As Johannes mentioned, this depends on the hash seed:
    > $ PYTHONHASHSEED=0 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
    > .html
    > $ PYTHONHASHSEED=1 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
    > .htm
    > $ PYTHONHASHSEED=2 python3 -c 'print({".htm", ".html", ".shtml"}.pop())'
    > .shtml
    >
    > You never see ".shtml" as the guessed extension because it is not in the
    > original mimetypes.types_map dict, but instead programmaticaly read from a
    > file like /etc/mime.types and then added to a list of extensions.
    >

    as there are bunch of files in mimetypes.py the only repeatability could
    be achieved on particular machine level.

    "/etc/mime.types",
    "/etc/httpd/mime.types",
    "/etc/httpd/conf/mime.types",
    "/etc/apache/mime.types",
    "/etc/apache2/mime.types",
    "/usr/local/etc/httpd/conf/mime.types",
    "/usr/local/lib/netscape/mime.types",
    "/usr/local/etc/httpd/conf/mime.types",
    "/usr/local/etc/mime.types"
    Asaf Las, Feb 7, 2014
    #7
  8. Johannes Bauer

    Peter Otten Guest

    Asaf Las wrote:

    > On Friday, February 7, 2014 9:40:06 PM UTC+2, Peter Otten wrote:


    >> You never see ".shtml" as the guessed extension because it is not in the
    >> original mimetypes.types_map dict, but instead programmaticaly read from
    >> a file like /etc/mime.types and then added to a list of extensions.


    > as there are bunch of files in mimetypes.py the only repeatability could
    > be achieved on particular machine level.


    At least the mimetypes already defined in the module could easily produce
    the same guessed extension consistently.
    Peter Otten, Feb 8, 2014
    #8
  9. Johannes Bauer

    Asaf Las Guest

    On Saturday, February 8, 2014 9:51:48 AM UTC+2, Peter Otten wrote:
    >
    > At least the mimetypes already defined in the module could easily produce
    > the same guessed extension consistently.


    imho one workaround for OP could be to supply own map file in init() thus
    ensure unambiguous mapping across every platform and distribution. guess
    some libraries already doing that. or write wrapper and process all_guesses
    to eliminate ambiguity up to needed requirement.
    that is in case if bug request will be rejected.
    Asaf Las, Feb 8, 2014
    #9
  10. Johannes Bauer

    Peter Otten Guest

    Asaf Las wrote:

    > On Saturday, February 8, 2014 9:51:48 AM UTC+2, Peter Otten wrote:
    >>
    >> At least the mimetypes already defined in the module could easily produce
    >> the same guessed extension consistently.

    >
    > imho one workaround for OP could be to supply own map file in init() thus
    > ensure unambiguous mapping across every platform and distribution. guess
    > some libraries already doing that. or write wrapper and process
    > all_guesses to eliminate ambiguity up to needed requirement.
    > that is in case if bug request will be rejected.


    You also have to set mimetypes.types_map and mimetypes.common_types to an
    empty dict (or an OrderedDict).
    Peter Otten, Feb 8, 2014
    #10
  11. Johannes Bauer

    Asaf Las Guest

    On Saturday, February 8, 2014 10:39:06 AM UTC+2, Peter Otten wrote:
    > Asaf Las wrote:
    > > On Saturday, February 8, 2014 9:51:48 AM UTC+2, Peter Otten wrote:
    > >> At least the mimetypes already defined in the module could easily produce
    > >> the same guessed extension consistently.

    > > imho one workaround for OP could be to supply own map file in init() thus
    > > ensure unambiguous mapping across every platform and distribution. guess
    > > some libraries already doing that. or write wrapper and process
    > > all_guesses to eliminate ambiguity up to needed requirement.
    > > that is in case if bug request will be rejected.

    >
    > You also have to set mimetypes.types_map and mimetypes.common_types to an
    > empty dict (or an OrderedDict).


    Hmmm, yes. then the quickest workaround is to get all guesses list then
    sort it and use the one at index 0.
    Asaf Las, Feb 8, 2014
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dominic
    Replies:
    1
    Views:
    3,276
    Bernard
    Dec 14, 2004
  2. GHUM
    Replies:
    0
    Views:
    255
  3. Sion Arrowsmith

    mimetypes oddity

    Sion Arrowsmith, Jan 15, 2009, in forum: Python
    Replies:
    2
    Views:
    317
    Terry Reedy
    Jan 16, 2009
  4. Aaron Gray

    navigator.mimeTypes

    Aaron Gray, Mar 9, 2007, in forum: Javascript
    Replies:
    3
    Views:
    192
    -Lost
    Mar 10, 2007
  5. Gelonida N
    Replies:
    0
    Views:
    262
    Gelonida N
    Sep 26, 2012
Loading...

Share This Page