Where to contribute Unicode General Category encoding/decoding

Discussion in 'Python' started by Pander Musubi, Dec 13, 2012.

  1. Hi all,

    I have created some handy code to encode and decode Unicode General Categories. To which Python Package should I contribute this?

    Regards,

    Pander
    Pander Musubi, Dec 13, 2012
    #1
    1. Advertising

  2. Pander Musubi

    Bruno Dupuis Guest

    On Thu, Dec 13, 2012 at 01:51:00AM -0800, Pander Musubi wrote:
    > Hi all,
    >
    > I have created some handy code to encode and decode Unicode General Categories. To which Python Package should I contribute this?
    >


    Hi,

    As said in a recent thread (a graph data structure IIRC), talking about
    new features is far better if we see the code, so anyone can figure what
    the code really does.

    Can you provide a public repository uri or something?

    Standard lib inclusions are not trivial, it most likely happens for well-known,
    mature, PyPI packages, or battle-tested code patterns. Therefore, it's
    often better to make a package on PyPI, or, if the code is too short, to submit
    your handy chunks on ActiveState. If it deserves a general approbation, it
    may be included in Python stdlib.

    Cheers

    --
    Bruno
    Bruno Dupuis, Dec 13, 2012
    #2
    1. Advertising

  3. On Thursday, December 13, 2012 2:22:57 PM UTC+1, Bruno Dupuis wrote:
    > On Thu, Dec 13, 2012 at 01:51:00AM -0800, Pander Musubi wrote:
    >
    > > Hi all,

    >
    > >

    >
    > > I have created some handy code to encode and decode Unicode General Categories. To which Python Package should I contribute this?

    >
    > >

    >
    >
    >
    > Hi,
    >
    >
    >
    > As said in a recent thread (a graph data structure IIRC), talking about
    >
    > new features is far better if we see the code, so anyone can figure what
    >
    > the code really does.
    >
    >
    >
    > Can you provide a public repository uri or something?
    >
    >
    >
    > Standard lib inclusions are not trivial, it most likely happens for well-known,
    >
    > mature, PyPI packages, or battle-tested code patterns. Therefore, it's
    >
    > often better to make a package on PyPI, or, if the code is too short, to submit
    >
    > your handy chunks on ActiveState. If it deserves a general approbation, it
    >
    > may be included in Python stdlib.


    I was expecting PyPI. Here is the code, please advise on where to submit it:
    http://pastebin.com/dbzeasyq

    > Cheers
    >
    >
    >
    > --
    >
    > Bruno
    Pander Musubi, Dec 13, 2012
    #3
  4. On Thursday, December 13, 2012 2:22:57 PM UTC+1, Bruno Dupuis wrote:
    > On Thu, Dec 13, 2012 at 01:51:00AM -0800, Pander Musubi wrote:
    >
    > > Hi all,

    >
    > >

    >
    > > I have created some handy code to encode and decode Unicode General Categories. To which Python Package should I contribute this?

    >
    > >

    >
    >
    >
    > Hi,
    >
    >
    >
    > As said in a recent thread (a graph data structure IIRC), talking about
    >
    > new features is far better if we see the code, so anyone can figure what
    >
    > the code really does.
    >
    >
    >
    > Can you provide a public repository uri or something?
    >
    >
    >
    > Standard lib inclusions are not trivial, it most likely happens for well-known,
    >
    > mature, PyPI packages, or battle-tested code patterns. Therefore, it's
    >
    > often better to make a package on PyPI, or, if the code is too short, to submit
    >
    > your handy chunks on ActiveState. If it deserves a general approbation, it
    >
    > may be included in Python stdlib.


    I was expecting PyPI. Here is the code, please advise on where to submit it:
    http://pastebin.com/dbzeasyq

    > Cheers
    >
    >
    >
    > --
    >
    > Bruno
    Pander Musubi, Dec 13, 2012
    #4
  5. On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote:

    > I was expecting PyPI. Here is the code, please advise on where to submit
    > it:
    > http://pastebin.com/dbzeasyq


    If anywhere, either a third-party module, or the unicodedata standard
    library module.


    Some unanswered questions:

    - when would somebody need this function?

    - why is is called "decodeUnicodeGeneralCategory" when it
    doesn't seem to have anything to do with decoding?

    - why is the parameter "sortable" called sortable, when it
    doesn't seem to have anything to do with sorting?


    If this is useful at all, it would be more useful to just expose the data
    as a dict, and forget about an unnecessary wrapper function:


    from collections import namedtuple
    r = namedtuple("record", "other name desc") # better field names needed!

    GC = {
    'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
    'Cc': r('Control', 'Control',
    'a C0 or C1 control code'), # a.k.a. cntrl
    'Cf': r('Format', 'Format', 'a format control character'),
    'Cn': r('Unassigned', 'Unassigned',
    'a reserved unassigned code point or a noncharacter'),
    'Co': r('Private Use', 'Private_Use', 'a private-use character'),
    'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
    'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
    'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
    'Ll': r('Letter, Lowercase', 'Lowercase_Letter',
    'a lowercase letter'),
    'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
    'Lo': r('Letter, Other', 'Other_Letter',
    'other letters, including syllables and ideographs'),
    'Lt': r('Letter, Titlecase', 'Titlecase_Letter',
    'a digraphic character, with first part uppercase'),
    'Lu': r('Letter, Uppercase', 'Uppercase_Letter',
    'an uppercase letter'),
    'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
    'Mc': r('Mark, Spacing', 'Spacing_Mark',
    'a spacing combining mark (positive advance width)'),
    'Me': r('Mark, Enclosing', 'Enclosing_Mark',
    'an enclosing combining mark'),
    'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',
    'a nonspacing combining mark (zero advance width)'),
    'N' : r('Number', 'Number', 'Nd | Nl | No'),
    'Nd': r('Number, Decimal', 'Decimal_Number',
    'a decimal digit'), # a.k.a. digit
    'Nl': r('Number, Letter', 'Letter_Number',
    'a letterlike numeric character'),
    'No': r('Number, Other', 'Other_Number',
    'a numeric character of other type'),
    'P' : r('Punctuation', 'Punctuation',
    'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
    'Pc': r('Punctuation, Connector', 'Connector_Punctuation',
    'a connecting punctuation mark, like a tie'),
    'Pd': r('Punctuation, Dash', 'Dash_Punctuation',
    'a dash or hyphen punctuation mark'),
    'Pe': r('Punctuation, Close', 'Close_Punctuation',
    'a closing punctuation mark (of a pair)'),
    'Pf': r('Punctuation, Final', 'Final_Punctuation',
    'a final quotation mark'),
    'Pi': r('Punctuation, Initial', 'Initial_Punctuation',
    'an initial quotation mark'),
    'Po': r('Punctuation, Other', 'Other_Punctuation',
    'a punctuation mark of other type'),
    'Ps': r('Punctuation, Open', 'Open_Punctuation',
    'an opening punctuation mark (of a pair)'),
    'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),
    'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),
    'Sk': r('Symbol, Modifier', 'Modifier_Symbol',
    'a non-letterlike modifier symbol'),
    'Sm': r('Symbol, Math', 'Math_Symbol',
    'a symbol of mathematical use'),
    'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),
    'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),
    'Zl': r('Separator, Line', 'Line_Separator',
    'U+2028 LINE SEPARATOR only'),
    'Zp': r('Separator, Paragraph', 'Paragraph_Separator',
    'U+2029 PARAGRAPH SEPARATOR only'),
    'Zs': r('Separator, Space', 'Space_Separator',
    'a space character (of various non-zero widths)'),
    }

    del r


    Usage is then trivially the same as normal dict and attribute access:

    py> GC['Ps'].desc
    'an opening punctuation mark (of a pair)'



    --
    Steven
    Steven D'Aprano, Dec 14, 2012
    #5
  6. On Friday, December 14, 2012 1:06:23 AM UTC+1, Steven D'Aprano wrote:
    > On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote:
    >
    >
    >
    > > I was expecting PyPI. Here is the code, please advise on where to submit

    >
    > > it:

    >
    > > http://pastebin.com/dbzeasyq

    >
    >
    >
    > If anywhere, either a third-party module, or the unicodedata standard
    >
    > library module.
    >
    >
    >
    >
    >
    > Some unanswered questions:
    >
    >
    >
    > - when would somebody need this function?
    >


    When working with Unicode metedata, see below.

    >
    >
    > - why is is called "decodeUnicodeGeneralCategory" when it
    >
    > doesn't seem to have anything to do with decoding?


    It is actually a simple LUT. I like your improvements below.

    > - why is the parameter "sortable" called sortable, when it
    >
    > doesn't seem to have anything to do with sorting?


    The values return are alphabetically sortable.

    >
    >
    >
    >
    >
    > If this is useful at all, it would be more useful to just expose the data
    >
    > as a dict, and forget about an unnecessary wrapper function:
    >
    >
    >
    >
    >
    > from collections import namedtuple
    >
    > r = namedtuple("record", "other name desc") # better field names needed!
    >
    >
    >
    > GC = {
    >
    > 'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),
    >
    > 'Cc': r('Control', 'Control',
    >
    > 'a C0 or C1 control code'), # a.k.a. cntrl
    >
    > 'Cf': r('Format', 'Format', 'a format control character'),
    >
    > 'Cn': r('Unassigned', 'Unassigned',
    >
    > 'a reserved unassigned code point or a noncharacter'),
    >
    > 'Co': r('Private Use', 'Private_Use', 'a private-use character'),
    >
    > 'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),
    >
    > 'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),
    >
    > 'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),
    >
    > 'Ll': r('Letter, Lowercase', 'Lowercase_Letter',
    >
    > 'a lowercase letter'),
    >
    > 'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),
    >
    > 'Lo': r('Letter, Other', 'Other_Letter',
    >
    > 'other letters, including syllables and ideographs'),
    >
    > 'Lt': r('Letter, Titlecase', 'Titlecase_Letter',
    >
    > 'a digraphic character, with first part uppercase'),
    >
    > 'Lu': r('Letter, Uppercase', 'Uppercase_Letter',
    >
    > 'an uppercase letter'),
    >
    > 'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark
    >
    > 'Mc': r('Mark, Spacing', 'Spacing_Mark',
    >
    > 'a spacing combining mark (positive advance width)'),
    >
    > 'Me': r('Mark, Enclosing', 'Enclosing_Mark',
    >
    > 'an enclosing combining mark'),
    >
    > 'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',
    >
    > 'a nonspacing combining mark (zero advance width)'),
    >
    > 'N' : r('Number', 'Number', 'Nd | Nl | No'),
    >
    > 'Nd': r('Number, Decimal', 'Decimal_Number',
    >
    > 'a decimal digit'), # a.k.a. digit
    >
    > 'Nl': r('Number, Letter', 'Letter_Number',
    >
    > 'a letterlike numeric character'),
    >
    > 'No': r('Number, Other', 'Other_Number',
    >
    > 'a numeric character of other type'),
    >
    > 'P' : r('Punctuation', 'Punctuation',
    >
    > 'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct
    >
    > 'Pc': r('Punctuation, Connector', 'Connector_Punctuation',
    >
    > 'a connecting punctuation mark, like a tie'),
    >
    > 'Pd': r('Punctuation, Dash', 'Dash_Punctuation',
    >
    > 'a dash or hyphen punctuation mark'),
    >
    > 'Pe': r('Punctuation, Close', 'Close_Punctuation',
    >
    > 'a closing punctuation mark (of a pair)'),
    >
    > 'Pf': r('Punctuation, Final', 'Final_Punctuation',
    >
    > 'a final quotation mark'),
    >
    > 'Pi': r('Punctuation, Initial', 'Initial_Punctuation',
    >
    > 'an initial quotation mark'),
    >
    > 'Po': r('Punctuation, Other', 'Other_Punctuation',
    >
    > 'a punctuation mark of other type'),
    >
    > 'Ps': r('Punctuation, Open', 'Open_Punctuation',
    >
    > 'an opening punctuation mark (of a pair)'),
    >
    > 'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),
    >
    > 'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),
    >
    > 'Sk': r('Symbol, Modifier', 'Modifier_Symbol',
    >
    > 'a non-letterlike modifier symbol'),
    >
    > 'Sm': r('Symbol, Math', 'Math_Symbol',
    >
    > 'a symbol of mathematical use'),
    >
    > 'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),
    >
    > 'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),
    >
    > 'Zl': r('Separator, Line', 'Line_Separator',
    >
    > 'U+2028 LINE SEPARATOR only'),
    >
    > 'Zp': r('Separator, Paragraph', 'Paragraph_Separator',
    >
    > 'U+2029 PARAGRAPH SEPARATOR only'),
    >
    > 'Zs': r('Separator, Space', 'Space_Separator',
    >
    > 'a space character (of various non-zero widths)'),
    >
    > }
    >
    >
    >
    > del r
    >
    >
    >
    >
    >
    > Usage is then trivially the same as normal dict and attribute access:
    >
    >
    >
    > py> GC['Ps'].desc
    >
    > 'an opening punctuation mark (of a pair)'
    >
    >
    >


    Thank you for the improvements. I have some more extra dicts in this way such as:
    http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
    where this general category is begin used. This information is useful when handling Unicode metadata.

    I think I will approach both
    http://pypi.python.org/pypi/unicodeblocks/
    and
    http://pypi.python.org/pypi/unicodescript/
    to see who will adopt this.

    Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.

    Thanks for all your help,

    Pander

    >
    >
    >
    >
    > --
    >
    > Steven
    Pander Musubi, Dec 14, 2012
    #6
  7. On Friday, December 14, 2012 2:07:51 PM UTC+1, Pander Musubi wrote:
    > On Friday, December 14, 2012 1:06:23 AM UTC+1, Steven D'Aprano wrote:
    >
    > > On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote:

    >
    > >

    >
    > >

    >
    > >

    >
    > > > I was expecting PyPI. Here is the code, please advise on where to submit

    >
    > >

    >
    > > > it:

    >
    > >

    >
    > > > http://pastebin.com/dbzeasyq

    >
    > >

    >
    > >

    >
    > >

    >
    > > If anywhere, either a third-party module, or the unicodedata standard

    >
    > >

    >
    > > library module.

    >
    > >

    >
    > >

    >
    > >

    >
    > >

    >
    > >

    >
    > > Some unanswered questions:

    >
    > >

    >
    > >

    >
    > >

    >
    > > - when would somebody need this function?

    >
    > >

    >
    >
    >
    > When working with Unicode metedata, see below.
    >
    >
    >
    > >

    >
    > >

    >
    > > - why is is called "decodeUnicodeGeneralCategory" when it

    >
    > >

    >
    > > doesn't seem to have anything to do with decoding?

    >
    >
    >
    > It is actually a simple LUT. I like your improvements below.
    >
    >
    >
    > > - why is the parameter "sortable" called sortable, when it

    >
    > >

    >
    > > doesn't seem to have anything to do with sorting?

    >
    >
    >
    > The values return are alphabetically sortable.
    >
    >
    >
    > >

    >
    > >

    >
    > >

    >
    > >

    >
    > >

    >
    > > If this is useful at all, it would be more useful to just expose the data

    >
    > >

    >
    > > as a dict, and forget about an unnecessary wrapper function:

    >
    > >

    >
    > >

    >
    > >

    >
    > >

    >
    > >

    >
    > > from collections import namedtuple

    >
    > >

    >
    > > r = namedtuple("record", "other name desc") # better field names needed!

    >
    > >

    >
    > >

    >
    > >

    >
    > > GC = {

    >
    > >

    >
    > > 'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),

    >
    > >

    >
    > > 'Cc': r('Control', 'Control',

    >
    > >

    >
    > > 'a C0 or C1 control code'), # a.k.a. cntrl

    >
    > >

    >
    > > 'Cf': r('Format', 'Format', 'a format control character'),

    >
    > >

    >
    > > 'Cn': r('Unassigned', 'Unassigned',

    >
    > >

    >
    > > 'a reserved unassigned code point or a noncharacter'),

    >
    > >

    >
    > > 'Co': r('Private Use', 'Private_Use', 'a private-use character'),

    >
    > >

    >
    > > 'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),

    >
    > >

    >
    > > 'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),

    >
    > >

    >
    > > 'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),

    >
    > >

    >
    > > 'Ll': r('Letter, Lowercase', 'Lowercase_Letter',

    >
    > >

    >
    > > 'a lowercase letter'),

    >
    > >

    >
    > > 'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),

    >
    > >

    >
    > > 'Lo': r('Letter, Other', 'Other_Letter',

    >
    > >

    >
    > > 'other letters, including syllables and ideographs'),

    >
    > >

    >
    > > 'Lt': r('Letter, Titlecase', 'Titlecase_Letter',

    >
    > >

    >
    > > 'a digraphic character, with first part uppercase'),

    >
    > >

    >
    > > 'Lu': r('Letter, Uppercase', 'Uppercase_Letter',

    >
    > >

    >
    > > 'an uppercase letter'),

    >
    > >

    >
    > > 'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark

    >
    > >

    >
    > > 'Mc': r('Mark, Spacing', 'Spacing_Mark',

    >
    > >

    >
    > > 'a spacing combining mark (positive advance width)'),

    >
    > >

    >
    > > 'Me': r('Mark, Enclosing', 'Enclosing_Mark',

    >
    > >

    >
    > > 'an enclosing combining mark'),

    >
    > >

    >
    > > 'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',

    >
    > >

    >
    > > 'a nonspacing combining mark (zero advance width)'),

    >
    > >

    >
    > > 'N' : r('Number', 'Number', 'Nd | Nl | No'),

    >
    > >

    >
    > > 'Nd': r('Number, Decimal', 'Decimal_Number',

    >
    > >

    >
    > > 'a decimal digit'), # a.k.a. digit

    >
    > >

    >
    > > 'Nl': r('Number, Letter', 'Letter_Number',

    >
    > >

    >
    > > 'a letterlike numeric character'),

    >
    > >

    >
    > > 'No': r('Number, Other', 'Other_Number',

    >
    > >

    >
    > > 'a numeric character of other type'),

    >
    > >

    >
    > > 'P' : r('Punctuation', 'Punctuation',

    >
    > >

    >
    > > 'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct

    >
    > >

    >
    > > 'Pc': r('Punctuation, Connector', 'Connector_Punctuation',

    >
    > >

    >
    > > 'a connecting punctuation mark, like a tie'),

    >
    > >

    >
    > > 'Pd': r('Punctuation, Dash', 'Dash_Punctuation',

    >
    > >

    >
    > > 'a dash or hyphen punctuation mark'),

    >
    > >

    >
    > > 'Pe': r('Punctuation, Close', 'Close_Punctuation',

    >
    > >

    >
    > > 'a closing punctuation mark (of a pair)'),

    >
    > >

    >
    > > 'Pf': r('Punctuation, Final', 'Final_Punctuation',

    >
    > >

    >
    > > 'a final quotation mark'),

    >
    > >

    >
    > > 'Pi': r('Punctuation, Initial', 'Initial_Punctuation',

    >
    > >

    >
    > > 'an initial quotation mark'),

    >
    > >

    >
    > > 'Po': r('Punctuation, Other', 'Other_Punctuation',

    >
    > >

    >
    > > 'a punctuation mark of other type'),

    >
    > >

    >
    > > 'Ps': r('Punctuation, Open', 'Open_Punctuation',

    >
    > >

    >
    > > 'an opening punctuation mark (of a pair)'),

    >
    > >

    >
    > > 'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),

    >
    > >

    >
    > > 'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),

    >
    > >

    >
    > > 'Sk': r('Symbol, Modifier', 'Modifier_Symbol',

    >
    > >

    >
    > > 'a non-letterlike modifier symbol'),

    >
    > >

    >
    > > 'Sm': r('Symbol, Math', 'Math_Symbol',

    >
    > >

    >
    > > 'a symbol of mathematical use'),

    >
    > >

    >
    > > 'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),

    >
    > >

    >
    > > 'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),

    >
    > >

    >
    > > 'Zl': r('Separator, Line', 'Line_Separator',

    >
    > >

    >
    > > 'U+2028 LINE SEPARATOR only'),

    >
    > >

    >
    > > 'Zp': r('Separator, Paragraph', 'Paragraph_Separator',

    >
    > >

    >
    > > 'U+2029 PARAGRAPH SEPARATOR only'),

    >
    > >

    >
    > > 'Zs': r('Separator, Space', 'Space_Separator',

    >
    > >

    >
    > > 'a space character (of various non-zero widths)'),

    >
    > >

    >
    > > }

    >
    > >

    >
    > >

    >
    > >

    >
    > > del r

    >
    > >

    >
    > >

    >
    > >

    >
    > >

    >
    > >

    >
    > > Usage is then trivially the same as normal dict and attribute access:

    >
    > >

    >
    > >

    >
    > >

    >
    > > py> GC['Ps'].desc

    >
    > >

    >
    > > 'an opening punctuation mark (of a pair)'

    >
    > >

    >
    > >

    >
    > >

    >
    >
    >
    > Thank you for the improvements. I have some more extra dicts in this way such as:
    >
    > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
    >
    > where this general category is begin used. This information is useful when handling Unicode metadata.
    >
    >
    >
    > I think I will approach both
    >
    > http://pypi.python.org/pypi/unicodeblocks/
    >
    > and
    >
    > http://pypi.python.org/pypi/unicodescript/
    >
    > to see who will adopt this.
    >
    >
    >
    > Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.
    >
    >
    >
    > Thanks for all your help,
    >
    >
    >
    > Pander
    >
    >
    >
    > >

    >
    > >

    >
    > >

    >
    > >

    >
    > > --

    >
    > >

    >
    > > Steven


    Ah, it will become a feature request for http://docs.python.org/3/library/unicodedata.html
    Pander Musubi, Dec 14, 2012
    #7
  8. On Friday, December 14, 2012 5:22:31 PM UTC+1, Pander Musubi wrote:
    > On Friday, December 14, 2012 2:07:51 PM UTC+1, Pander Musubi wrote:
    >
    > > On Friday, December 14, 2012 1:06:23 AM UTC+1, Steven D'Aprano wrote:

    >
    > >

    >
    > > > On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote:

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > > I was expecting PyPI. Here is the code, please advise on where to submit

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > > it:

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > > http://pastebin.com/dbzeasyq

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > If anywhere, either a third-party module, or the unicodedata standard

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > library module.

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > Some unanswered questions:

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > - when would somebody need this function?

    >
    > >

    >
    > > >

    >
    > >

    >
    > >

    >
    > >

    >
    > > When working with Unicode metedata, see below.

    >
    > >

    >
    > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > - why is is called "decodeUnicodeGeneralCategory" when it

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > doesn't seem to have anything to do with decoding?

    >
    > >

    >
    > >

    >
    > >

    >
    > > It is actually a simple LUT. I like your improvements below.

    >
    > >

    >
    > >

    >
    > >

    >
    > > > - why is the parameter "sortable" called sortable, when it

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > doesn't seem to have anything to do with sorting?

    >
    > >

    >
    > >

    >
    > >

    >
    > > The values return are alphabetically sortable.

    >
    > >

    >
    > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > If this is useful at all, it would be more useful to just expose the data

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > as a dict, and forget about an unnecessary wrapper function:

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > from collections import namedtuple

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > r = namedtuple("record", "other name desc") # better field names needed!

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > GC = {

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Cc': r('Control', 'Control',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a C0 or C1 control code'), # a.k.a. cntrl

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Cf': r('Format', 'Format', 'a format control character'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Cn': r('Unassigned', 'Unassigned',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a reserved unassigned code point or a noncharacter'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Co': r('Private Use', 'Private_Use', 'a private-use character'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Ll': r('Letter, Lowercase', 'Lowercase_Letter',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a lowercase letter'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Lo': r('Letter, Other', 'Other_Letter',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'other letters, including syllables and ideographs'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Lt': r('Letter, Titlecase', 'Titlecase_Letter',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a digraphic character, with first part uppercase'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Lu': r('Letter, Uppercase', 'Uppercase_Letter',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'an uppercase letter'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Mc': r('Mark, Spacing', 'Spacing_Mark',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a spacing combining mark (positive advance width)'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Me': r('Mark, Enclosing', 'Enclosing_Mark',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'an enclosing combining mark'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a nonspacing combining mark (zero advance width)'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'N' : r('Number', 'Number', 'Nd | Nl | No'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Nd': r('Number, Decimal', 'Decimal_Number',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a decimal digit'), # a.k.a. digit

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Nl': r('Number, Letter', 'Letter_Number',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a letterlike numeric character'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'No': r('Number, Other', 'Other_Number',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a numeric character of other type'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'P' : r('Punctuation', 'Punctuation',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Pc': r('Punctuation, Connector', 'Connector_Punctuation',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a connecting punctuation mark, like a tie'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Pd': r('Punctuation, Dash', 'Dash_Punctuation',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a dash or hyphen punctuation mark'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Pe': r('Punctuation, Close', 'Close_Punctuation',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a closing punctuation mark (of a pair)'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Pf': r('Punctuation, Final', 'Final_Punctuation',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a final quotation mark'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Pi': r('Punctuation, Initial', 'Initial_Punctuation',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'an initial quotation mark'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Po': r('Punctuation, Other', 'Other_Punctuation',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a punctuation mark of other type'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Ps': r('Punctuation, Open', 'Open_Punctuation',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'an opening punctuation mark (of a pair)'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Sk': r('Symbol, Modifier', 'Modifier_Symbol',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a non-letterlike modifier symbol'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Sm': r('Symbol, Math', 'Math_Symbol',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a symbol of mathematical use'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Zl': r('Separator, Line', 'Line_Separator',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'U+2028 LINE SEPARATOR only'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Zp': r('Separator, Paragraph', 'Paragraph_Separator',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'U+2029 PARAGRAPH SEPARATOR only'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'Zs': r('Separator, Space', 'Space_Separator',

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'a space character (of various non-zero widths)'),

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > }

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > del r

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > Usage is then trivially the same as normal dict and attribute access:

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > py> GC['Ps'].desc

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > 'an opening punctuation mark (of a pair)'

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > >

    >
    > >

    >
    > > Thank you for the improvements. I have some more extra dicts in this way such as:

    >
    > >

    >
    > > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

    >
    > >

    >
    > > where this general category is begin used. This information is useful when handling Unicode metadata.

    >
    > >

    >
    > >

    >
    > >

    >
    > > I think I will approach both

    >
    > >

    >
    > > http://pypi.python.org/pypi/unicodeblocks/

    >
    > >

    >
    > > and

    >
    > >

    >
    > > http://pypi.python.org/pypi/unicodescript/

    >
    > >

    >
    > > to see who will adopt this.

    >
    > >

    >
    > >

    >
    > >

    >
    > > Perhaps it might be in their mutual interest to join their packages to e.g. unicodemetadata or something similar. Extra ideas on this are still welcome.

    >
    > >

    >
    > >

    >
    > >

    >
    > > Thanks for all your help,

    >
    > >

    >
    > >

    >
    > >

    >
    > > Pander

    >
    > >

    >
    > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > --

    >
    > >

    >
    > > >

    >
    > >

    >
    > > > Steven

    >
    >
    >
    > Ah, it will become a feature request for http://docs.python.org/3/library/unicodedata.html


    Please see:
    http://bugs.python.org/issue16684
    Pander Musubi, Dec 14, 2012
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Slade

    Problem encoding/decoding image

    Slade, Jun 25, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    1,115
    Natty Gur
    Jun 25, 2003
  2. =?Utf-8?B?TWFyaw==?=

    query string encoding/decoding

    =?Utf-8?B?TWFyaw==?=, Mar 3, 2004, in forum: ASP .Net
    Replies:
    7
    Views:
    17,204
    T Conti
    Apr 5, 2004
  3. terry
    Replies:
    2
    Views:
    2,441
    terry
    Nov 3, 2003
  4. LarsM
    Replies:
    18
    Views:
    1,155
    Andreas Prilop
    Feb 11, 2005
  5. Replies:
    0
    Views:
    110
Loading...

Share This Page