Three questions: UTF-8, DBM, hash of lists, ...

Discussion in 'Perl Misc' started by Wes Groleau, Jan 12, 2005.

  1. Wes Groleau

    Wes Groleau Guest

    I've been rooting around in perlutf8, perlencoding, perlunicode,
    and other such things. I think I follow most of it, but there
    are some contradictions. Or I thought there were.

    1. At the moment, my source is pure ASCII, but I want to
    treat it as UTF-8 because the text I work with is UTF-8
    and my editor is configured accordingly. (And data
    can easily become literals in source). I put -CSD on
    my bang-line, which one man page said covers everything
    (except -CL which I did not want for some reason). But
    another man page seemed to say that "use utf8;" covered
    something that -CSD did not, so I put that in, too. Is
    either one interfering with the other in any way?

    2. One of my applications is reading in a large file, finding
    certain patterns, and using them as keys to store everything
    else in a DBM hash (use DBM_File; dbmopen %hash, etc.)
    The input is 99.5% ASCII--only a few French diacritics, one
    copyright symbol, and two Polish characters. Yet adding
    the utf-8 constructs to the script and regenerating the DBM
    made a HUGE difference in the size of the file. Why is
    that?

    3. Say an input file contains key and value pairs, BUT
    there is more than one possible value for a key.

    For example, occupations.

    Key Value
    ----------- ---------
    firefighter Fred
    chef Charlotte
    firefighter Felicia

    Can I store a list at the key, or do I have to append
    to a string and split on output?

    If I can store a list, what is the syntax? The following
    is not allowed:


    push (@the_hash{$the_job}, $the_name);


    If the hash is tied with

    use DBM_File;
    dbmopen %the_hash .......

    does that change the answer?


    OK, more than three. :)

    --
    Wes Groleau

    In any formula, constants (especially those obtained
    from handbooks) are to be treated as variables.
     
    Wes Groleau, Jan 12, 2005
    #1
    1. Advertising

  2. Wes Groleau

    Jim Keenan Guest

    Wes Groleau wrote:

    >
    > 3. Say an input file contains key and value pairs, BUT
    > there is more than one possible value for a key.
    >
    > For example, occupations.
    >
    > Key Value
    > ----------- ---------
    > firefighter Fred
    > chef Charlotte
    > firefighter Felicia
    >
    > Can I store a list at the key, or do I have to append
    > to a string and split on output?
    >
    > If I can store a list, what is the syntax? The following
    > is not allowed:
    >
    >
    > push (@the_hash{$the_job}, $the_name);
    >
    >

    But wouldn't this be appropriate?

    push @{$the_hash{$the_job}}, $the_name;



    > If the hash is tied with
    >
    > use DBM_File;
    > dbmopen %the_hash .......


    Shouldn't that be ...?

    use DB_file;

    Jim Keenan
     
    Jim Keenan, Jan 12, 2005
    #2
    1. Advertising

  3. utf-8, was Re: Three questions: UTF-8, DBM, hash of lists, ...

    On Tue, 11 Jan 2005, Wes Groleau wrote:

    > Three questions


    There are no special awards for folding several questions into one
    posting. All that it achieves is: several unrelated subthreads
    hanging-off the original posting. Confusion all round.

    The key to effective problem-solving is to break up a complex problem
    into manageable parts, and deal with each separately, until one
    understands it well enough to use it at a component of the whole. In
    that sense, I'd commend to you the strategy of asking detailed
    questions one at a time (with enough context for the group to
    understand the detailed question). If, on the other hand, you can't
    decide how to partition a complex problem, then ask about the problem
    itself, at a higher level, without pre-judging the lower-level
    implementation detail. IMHO and YMMV, anyway.

    > I've been rooting around in perlutf8, perlencoding, perlunicode,
    > and other such things. I think I follow most of it, but there
    > are some contradictions. Or I thought there were.
    >
    > 1. At the moment, my source is pure ASCII, but I want to
    > treat it as UTF-8 because the text I work with is UTF-8
    > and my editor is configured accordingly.


    Please distinguish carefully between your program source and your
    data.

    As a matter of fact, us-ascii -is- a subset of utf-8 - utf-8 was
    deliberately designed that way - but you *don't* have to use utf-8
    encoding in your program source in order to process unicode data.

    In any case, Perl's unicode implementation is supposed to be
    transparent, i.e you shouldn't normally need to know that its internal
    representation happens to be utf-8. What you /do/ need to know is
    what encoding is used in your /external data/, and to tell Perl about
    it at the appropriate time (e.g by an encoding layer on an I/O
    statement).

    > (And data can easily become literals in source).


    In many situations, you might be better advised to write unicode
    characters into the source by means of their \x{..} representation.
    Which is not to deny that there can also be situations where you'd
    want to write unicode characters directly - but then you have to be a
    lot more careful with how you edit and transfer your source code.
    See
    http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Effects-of-Character-Semantics
    for more details.

    > I put -CSD on
    > my bang-line, which one man page said covers everything
    > (except -CL which I did not want for some reason).


    Could we have a cite on that?

    -C is a request to use wide system calls. It doesn't influence Perl's
    interpretation of your program source or data "as such".

    > But
    > another man page seemed to say that "use utf8;" covered
    > something that -CSD did not, so I put that in, too.


    The perlunicode pod, for the version of Perl that you're using, should
    be your "bible". Don't go tossing-in arbitrary bits and pieces that
    you may have acquired from elsewhere - treat them as possibly
    misleading clues, but check with the authoritative documentation to
    make sure that they really do what you want.

    See what
    http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Important-Caveats
    says about "use utf8;".

    > Is either one interfering with the other in any way?


    I don't know of any reason why they should.

    good luck
     
    Alan J. Flavell, Jan 12, 2005
    #3
  4. Wes Groleau

    Wes Groleau Guest

    Re: utf-8, was Re: Three questions: UTF-8, DBM, hash of lists, ...

    Alan J. Flavell wrote:
    > There are no special awards for folding several questions into one


    No rewards expected or requested.

    > hanging-off the original posting. Confusion all round.


    Welcome to Usenet.

    >>1. At the moment, my source is pure ASCII, but I want to
    >> treat it as UTF-8 because the text I work with is UTF-8
    >> and my editor is configured accordingly.

    >
    > Please distinguish carefully between your program source and your
    > data.


    I did. When I said "source," I meant "source" and when
    I said "text" I meant what you apparently call "data."

    > As a matter of fact, us-ascii -is- a subset of utf-8 - utf-8 was
    > deliberately designed that way - but you *don't* have to use utf-8
    > encoding in your program source in order to process unicode data.


    I know that. However, I prefer that everything on my system
    be interpreted as UTF-8, as I work with French, Spanish, Polish,
    and Japanese. The script is all ASCII _now_ but I could add
    literals for searching or whatever at any time.

    > In any case, Perl's unicode implementation is supposed to be
    > transparent, i.e you shouldn't normally need to know that its internal
    > representation happens to be utf-8. What you /do/ need to know is


    I don't want to know what it does internally, as long as everything
    comes out UTF-8 and is decoded as such going in.

    > what encoding is used in your /external data/, and to tell Perl about
    > it at the appropriate time (e.g by an encoding layer on an I/O
    > statement).


    Since I want _everything_ UTF-8, the appropriate time
    is (if possible) at the beginning of the script.

    > In many situations, you might be better advised to write unicode
    > characters into the source by means of their \x{..} representation.


    My terminal renders the glyphs correctly when I 'cat' UTF-8.
    Why should I have to look up the codes every time instead?
    And although I can compose characters in hex, why should
    I do that instead of cut-and-paste from the editor?

    > Which is not to deny that there can also be situations where you'd
    > want to write unicode characters directly - but then you have to be a
    > lot more careful with how you edit and transfer your source code.
    > See
    > http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Effects-of-Character-Semantics
    > for more details.


    Yes, I read that. I'm trying to minimize the need for "being careful"
    about all those ten zillion details by specifying "everything is UTF-8."

    > -C is a request to use wide system calls. It doesn't influence Perl's
    > interpretation of your program source or data "as such".


    You're right:

    man perlrun
    .....

    As of 5.8.1, the "-C" can be followed either by a number or a list
    of option letters. The letters, their numeric values, and effects
    are as follows; listing the letters is equal to summing the numbers.

    I 1 STDIN is assumed to be in UTF-8
    O 2 STDOUT will be in UTF-8
    E 4 STDERR will be in UTF-8
    S 7 I + O + E
    i 8 UTF-8 is the default PerlIO layer for input streams
    o 16 UTF-8 is the default PerlIO layer for output streams
    D 24 i + o

    Seems to say -CSDA should handle all my IO (I left off the A because
    I still have a little bit of resistance to overcome from the shell)
    except for the script itself. A detail I missed. Not an issue yet,
    but I'd like to fix it before it becomes one.

    >> But
    >> another man page seemed to say that "use utf8;" covered
    >> something that -CSD did not, so I put that in, too.

    >
    > The perlunicode pod, for the version of Perl that you're using, should
    > be your "bible". Don't go tossing-in arbitrary bits and pieces that


    I have 5.8.1 but no pod, so my 'elsewhere' is the man pages
    derived from the pod.

    > See what
    > http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Important-Caveats
    > says about "use utf8;".


    It says the same as my man page: that the pragma is needed
    to "enable UTF-8" in scripts. It doesn't say whether
    "enable" means the script itself or the IO or both.
    However, 'man perlrun' says the -CSD handles the IO,
    and perlunicode says for script encoding, see encoding
    which says that UTF-8 already works in scripts.

    So, things are a little unclear. I put in both, and
    was able to read UTF-8 text, put it in a DBM hash, and
    get it back out. That's good enough for now.

    --
    Wes Groleau
    "Beware the barrenness of a busy life."
    -- George Verwer
     
    Wes Groleau, Jan 15, 2005
    #4
  5. Re: utf-8, was Re: Three questions: UTF-8, DBM, hash of lists, ...

    On Sat, 15 Jan 2005, Wes Groleau wrote:

    > Welcome to Usenet.


    Indeed. It seems from your response, and the rarity of responses from
    other contributors, that you're in the position to offer us all a
    valuable tutorial on the topic.

    > I don't want to know what it does internally, as long as everything
    > comes out UTF-8 and is decoded as such going in.


    Fine, then we're pretty much up to speed already, and I'm sorry that I
    misinterpreted your original posting.

    > > Which is not to deny that there can also be situations where you'd
    > > want to write unicode characters directly - but then you have to
    > > be a lot more careful with how you edit and transfer your source
    > > code. See
    > > http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Effects-of-Character-Semantics
    > > for more details.

    >
    > Yes, I read that. I'm trying to minimize the need for "being
    > careful" about all those ten zillion details by specifying
    > "everything is UTF-8."


    Point made. If you're really in control of all that data then you're
    in a much happier position than I've ever been ;-)

    > I 1 STDIN is assumed to be in UTF-8
    > O 2 STDOUT will be in UTF-8
    > E 4 STDERR will be in UTF-8
    > S 7 I + O + E
    > i 8 UTF-8 is the default PerlIO layer for input streams
    > o 16 UTF-8 is the default PerlIO layer for output streams
    > D 24 i + o
    >
    > Seems to say -CSDA should handle all my IO


    It does, doesn't it? Did I miss the specific problem you were having,
    and your test case that demonstrated it?

    > > > But
    > > > another man page seemed to say that "use utf8;" covered
    > > > something that -CSD did not, so I put that in, too.

    > >
    > > The perlunicode pod, for the version of Perl that you're using,
    > > should be your "bible". Don't go tossing-in arbitrary bits and
    > > pieces that

    >
    > I have 5.8.1 but no pod, so my 'elsewhere' is the man pages
    > derived from the pod.


    No disagreement there. More than one way to...read the documentation.

    > > See what
    > > http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Important-Caveats
    > > says about "use utf8;".

    >
    > It says the same as my man page: that the pragma is needed
    > to "enable UTF-8" in scripts.


    Hmmm? At 5.8.4 (and I don't remember it being different in recent
    versions before that) it says [this'll need monospace display, and go
    sadly wrong with these newfangled usenet-ish interfaces, sorry]:

    As a compatibility measure, the use utf8 pragma must be explicitly
    included to enable recognition of UTF-8 in the Perl scripts
    ^^^^^^^^^^^^^^^^^^^
    themselves (in string or regular expression literals, or in
    ^^^^^^^^^^
    identifier names) on ASCII-based machines or to recognize UTF-EBCDIC
    on EBCDIC-based machines. These are the only times when an explicit
    ^^^^^^^^^^
    use utf8 is needed.

    > However, 'man perlrun' says the -CSD handles the IO,


    Indeed, and (fwiw) I don't see anything there about encoding of the
    script's source code itself.

    > and perlunicode says for script encoding, see encoding
    > which says that UTF-8 already works in scripts.


    It "works", yes, but (as I understand it, anyway) I think you have to
    ask for it. It could just be that if you call for locale-awareness
    with -CL, and you have utf-8 in your locale, it will come out in the
    wash; but I don't see any harm in asking for it directly, if you're so
    certain that you'll never not want it (sorry for the double-negative).

    > So, things are a little unclear. I put in both,


    Looks as if you're (a) right and (b) unlikely to cause any harm.

    > was able to read UTF-8 text, put it in a DBM hash, and
    > get it back out. That's good enough for now.


    Good luck
     
    Alan J. Flavell, Jan 15, 2005
    #5
  6. Wes Groleau

    Wes Groleau Guest

    perl 5.8 bug ? (was Re: utf-8, was Re: Three questions: ....)

    Alan J. Flavell wrote:
    [re UTF-8 in perl scripts]

    > It "works", yes, but (as I understand it, anyway) I think you have to
    > ask for it. It could just be that if you call for locale-awareness
    > with -CL, and you have utf-8 in your locale, it will come out in the
    > wash; but I don't see any harm in asking for it directly, if you're so
    > certain that you'll never not want it (sorry for the double-negative).


    I also left the L off of -C because I don't think I have that completely
    coerced to UTF-8

    >>So, things are a little unclear. I put in both,

    >
    > Looks as if you're (a) right and (b) unlikely to cause any harm.


    Sigh, now it starts getting weird. Kind of long, summary at the bottom.

    The script with -CSD and use utf8 created a database,
    and a test script pulled the records out of the database
    and printed them. The non-ASCII characters rendered
    correctly BUT that doesn't mean anything, since the test
    script had the same -CSD and use utf8. (Right?)

    So I figured I needed to eyeball inside the DB file
    and see if I could find some nonASCII and see how it was encoded.

    But a series of unfortunate events resulted in my having
    to re-create the script, and then it crashed (bus error
    or segmentation fault). Figured out which record it
    was crashing on, put it in its own file, and ....
    well to skip over the long tedious details, I eventually
    had a version of the script that would crash and one that
    would not crash on the same input file.

    'diff' showed only one difference:

    wgroleau$ diff ~/bin/GEDCOM_DB ./tempGCDB
    1c1
    < #!/usr/bin/perl -w -CSD
    ---
    > #!/usr/bin/perl -w -CSD


    od -xc revealed that the extra space is indeed a (hex 20)
    regular space and not a UTF-8 construct.

    More study showed that the space made a difference on the only
    two systems I currently have access to:

    wgroleau$ uname -a
    Darwin Groleau.local 7.7.0 Darwin Kernel Version 7.7.0: Sun Nov 7
    16:06:51 PST 2004; root:xnu/xnu-517.9.5.obj~1/RELEASE_PPC Power
    Macintosh powerpc
    wgroleau$ perl -v

    This is perl, v5.8.1-RC3 built for darwin-thread-multi-2level
    (with 1 registered patch, see perl -V for more detail)

    Copyright 1987-2003, Larry Wall

    AND

    [0:ag/g/groleau> uname -a
    NetBSD otaku 1.6.2_STABLE NetBSD 1.6.2_STABLE (sdf) #0: Sun Jul 25
    04:17:09 UTC 2004 root@ol:/var/src/src/sys/arch/alpha/compile/sdf alpha

    [0:ag/g/groleau> perl -v

    This is perl, v5.8.0 built for alpha-netbsd

    Copyright 1987-2002, Larry Wall


    On Darwin/PPC, the extra space prevents bus error/segmentation fault.
    On Net-BSD/Alpha, it prevents the following:

    [0:ag/g/groleau> rm wgroleau.DB; ./tempGCDB < bad.record.GED
    Recompile perl with -DDEBUGGING to use -D switch
    Can't emulate -S on #! line at ./tempGCDB line 1.
    [255:ag/g/groleau> head -1 ./tempGCDB
    #!/usr/pkg/bin/perl -w -CSD


    Summary: On two diferent platforms, in

    #!/usr/bin/perl -w -CSD

    the extra space is required.

    If anyone wants to try it on a different system, I can provide
    the script and the input file.

    --
    Wes Groleau
    -----------

    "Thinking I'm dumb gives people something to
    feel smug about. Why should I disillusion them?"
    -- Charles Wallace
    (in _A_Wrinkle_In_Time_)
     
    Wes Groleau, Jan 16, 2005
    #6
  7. Re: utf-8, was Re: Three questions: UTF-8, DBM, hash of lists, ...

    Wes Groleau <> wrote:
    > Alan J. Flavell wrote:
    >> There are no special awards for folding several questions into one

    >
    > No rewards expected or requested.
    >
    >> hanging-off the original posting. Confusion all round.

    >
    > Welcome to Usenet.



    So long then.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Jan 16, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Topher
    Replies:
    1
    Views:
    1,148
    Gunnar Hjalmarsson
    Jan 21, 2004
  2. rp
    Replies:
    1
    Views:
    597
    red floyd
    Nov 10, 2011
  3. Colvin
    Replies:
    3
    Views:
    183
    Colvin
    Dec 30, 2003
  4. Replies:
    2
    Views:
    151
    Martien verbruggen
    May 17, 2007
  5. Storing Object in DBM Hash

    , May 16, 2007, in forum: Perl Misc
    Replies:
    2
    Views:
    165
    -berlin.de
    Jun 21, 2007
Loading...

Share This Page