Parse using Text::CSV into Hash

Discussion in 'Perl Misc' started by batrams, Nov 10, 2011.

  1. batrams

    batrams Guest

    Hi guys - this hunk of code seems to function as intended. After I get
    code to work I'm often curious to run it by you gurus to see how you
    might have done it. Please don't laugh too hard but any comments are
    welcome.




    open (TEXTFILE, "USES.txt") || die "not found\n";
    my $csv = Text::CSV->new();

    while ( <TEXTFILE> ) {
    $csv->parse($_);
    push(@H_USES, [$csv->fields]);
    }
    close(TEXTFILE);

    # load up the hash
    for my $i ( 0 .. $#H_USES ) {
    $H_USES{$H_USES[$i][0]} = $H_USES[$i];
    }




    DB
     
    batrams, Nov 10, 2011
    #1
    1. Advertising

  2. batrams

    Jim Gibson Guest

    In article <j9gs9g$e1r$>, batrams <> wrote:

    > Hi guys - this hunk of code seems to function as intended. After I get
    > code to work I'm often curious to run it by you gurus to see how you
    > might have done it. Please don't laugh too hard but any comments are
    > welcome.
    >
    >
    >
    >
    > open (TEXTFILE, "USES.txt") || die "not found\n";


    Lexical variables with lower-case names are better than package globals
    in uppercase. The 'or' operator had better precedence for this use than
    '||'. It is always good to include the open error ($!) in the die
    message. I would usually put the file name in a scalar variable and
    include that in the error message as well.

    See the examples in the documentation for Text::CSV for a better style
    of programming to read in a text file.

    > my $csv = Text::CSV->new();
    >
    > while ( <TEXTFILE> ) {
    > $csv->parse($_);
    > push(@H_USES, [$csv->fields]);


    I would use lower-case variable names, and I always include the
    parentheses for function/subroutine/method calls ($csv->fields()), even
    if they are not required, as it gives me a better clue of what is going
    on.

    You should always 'use strict;' and declare your variables.

    > }
    > close(TEXTFILE);
    >
    > # load up the hash
    > for my $i ( 0 .. $#H_USES ) {
    > $H_USES{$H_USES[$i][0]} = $H_USES[$i];
    > }


    Loops should be indented, and the integer index is not needed:

    for my $row ( @H_USES ) {
    $H_USES{$row->[0]} = $row;
    }

    But I would populate the hash in the while loop, rather than in a
    separate loop afterwards.

    --
    Jim Gibson
     
    Jim Gibson, Nov 10, 2011
    #2
    1. Advertising

  3. batrams

    J. Gleixner Guest

    On 11/10/11 15:57, batrams wrote:
    > Hi guys - this hunk of code seems to function as intended. After I get
    > code to work I'm often curious to run it by you gurus to see how you
    > might have done it. Please don't laugh too hard but any comments are
    > welcome.


    If you want a hash, see the getline_hr method.
     
    J. Gleixner, Nov 10, 2011
    #3
  4. Ben Morrow <> writes:

    [...]

    > open (my $TEXTFILE, "<", $file)
    > or die "Couldn't open '$file': $!\n";


    [...]

    > - I put back the "\n" on the die message: file and line are useful
    > for a programmer, and should be left in on messages you don't
    > expect to ever see in production, but messages addressed to the
    > end user don't want them.


    The end-user is going to report the problem as "It crashed" no matter
    what actually occurred and any information said end-user can just
    copy&paste into an e-mail is possibly helpful to the person who is
    supposed to deal with the issue.

    [...]

    >> close(TEXTFILE);

    >
    > Another advantage of lexical filehandles is they auto-close when they go
    > out of scope. It's risky taking advantage of this if you have written to
    > the handle: in that case you should be calling close explicitly and
    > checking the return value.


    Or so the Linux close(2) manpage claims. The reason it does this is
    because data written to a file residing on a NFS filesystem (and
    possibly, other network file systems) isn't necessarily sent to the
    server until the (NFS) filehandle is closed. Because of this, the
    return value of a close of such a file may be some error which ocurred
    during network communication. For practical purpose, this is often
    irrelevant and - much more importantly - checking the return value of
    the close doesn't help except if reporting "This didn't work. Bad for
    you!" is considered to be useful. Programs which care about the
    integrity of data written to some persistent medium, at least to the
    degree this is possible, need to use fsync before closing the file and
    need to implement some strategy for dealing with errors returned by
    that (for Perl, this means invoking IO::Handle::flush, followed by
    IO::Handle::Sync).

    The actual write operation can actually fail after close returned
    success even for local files.
     
    Rainer Weikusat, Nov 10, 2011
    #4
  5. my %H_USES;
    while (<DATA>) {
    $_=[split ','];$H_USES{$_->[0]}=$_
    }

    use Data::Dumper; print Dumper(\%H_USES);

    __DATA__
    a1,b1,c1,d1
    a2,b2,c2,d2
    a3,b3,c3,d3
    a3,b3,c3,d3
     
    George Mpouras, Nov 11, 2011
    #5
  6. batrams

    Justin C Guest

    On 2011-11-10, Ben Morrow <> wrote:

    [snip]

    > The only important improvement here is the 'my $TEXTFILE'. Bareword
    > filehandles are always global


    Hey! And there's the answer to a question I was going to post! Couldn't
    figure our why I had to pass a sub a ref to a hash from the calling
    hash, but it figured out the FH all by itself. Now I know... and I know
    why not to do it too. TY!

    [snip]

    Justin.

    --
    Justin C, by the sea.
     
    Justin C, Nov 11, 2011
    #6
  7. batrams

    Dr.Ruud Guest

    On 2011-11-11 08:18, George Mpouras wrote:

    > split ','


    Hmm. What do you think that

    split '[,]'

    does?

    --
    Ruud
     
    Dr.Ruud, Nov 11, 2011
    #7
  8. batrams

    batrams Guest

    On 11/10/2011 2:10 PM, J. Gleixner wrote:
    > hash, see the getline_hr



    Thanks! That looks like what I need :)

    DB
     
    batrams, Nov 11, 2011
    #8
  9. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Ben Morrow <> writes:
    >>
    >> > open (my $TEXTFILE, "<", $file)
    >> > or die "Couldn't open '$file': $!\n";

    >>
    >> > - I put back the "\n" on the die message: file and line are useful
    >> > for a programmer, and should be left in on messages you don't
    >> > expect to ever see in production, but messages addressed to the
    >> > end user don't want them.

    >>
    >> The end-user is going to report the problem as "It crashed" no matter
    >> what actually occurred and any information said end-user can just
    >> copy&paste into an e-mail is possibly helpful to the person who is
    >> supposed to deal with the issue.

    >
    > You are certainly right for a large (and important) class of end-users.
    > Programs aimed at such people should, as you say, simply give a message
    > saying 'there was a problem' plus some way to send someone competent a
    > logfile with as much detailed information as possible.


    That wasn't what I was saying. I intended to argue in favor of
    omitting the \n because an event which causes a program to die is
    usually one some kind of developer will need to deal with and hence,
    including information which is only relevant to a developer in the
    diagnostic makes sense.

    [...]

    > Look at the error messages given by the standard Unix utilities. They
    > (usually) do include filenames and strerror messages, but very rarely
    > include __FILE__ and __LINE__ information.


    In C-code, I always include the name of the function where the error
    occurred. Granted, the default 'sudden death message suffix' is more
    verbose than I consider necessary, OTOH, it contains enough
    information to locate the operation that failed in the code and it is
    already built in.

    [...]

    >> > Another advantage of lexical filehandles is they auto-close when they go
    >> > out of scope. It's risky taking advantage of this if you have written to
    >> > the handle: in that case you should be calling close explicitly and
    >> > checking the return value.

    >>
    >> Or so the Linux close(2) manpage claims. The reason it does this is
    >> because data written to a file residing on a NFS filesystem (and
    >> possibly, other network file systems) isn't necessarily sent to the
    >> server until the (NFS) filehandle is closed. Because of this, the
    >> return value of a close of such a file may be some error which ocurred
    >> during network communication.

    >
    > You don't need to go anywhere near that far. PerlIO (and stdio in C)
    > buffers writes, so when you call close (that is, the equivalent of
    > fclose(3), not close(2)) there is still a partial bufferful of data
    > which hasn't been sent to the kernel yet. If that write(2) gives an
    > error, that error will be returned from close.


    So, more generally put: Using hidden buffering mechansisms is risky in
    the sense that they decouple the operation which provides the data and
    the operation which moves the data to its destination. As side effects
    of that, the time window during which data loss could happen becomes
    much larger and an application can no longer determine where in the
    produced 'output data stream' the problem actually occurred (and try
    to recover from that). The justification for hidden buffering is
    'usually, it works and it usually improves the performance of the
    system while it is working significantly'.

    'Reliable I/O' or even just 'reliably reporting I/O errors' and
    'hidden buffering' can't go together and ...

    > Additionally, close will return an error if *any* of the previous
    > writes failed (unless you call IO::Handle->clearerr), so checking
    > the return value of close means you don't need to check the return
    > value of print.


    .... checking the return value of close is just a fig leaf covering the
    problem: If everything worked, it is redundant and when something
    didn't work, reporting that "something didn't work" (but I know neither
    what nor when and can't do anything to repair that) is not helpful.

    >> For practical purpose, this is often irrelevant and - much more
    >> importantly - checking the return value of the close doesn't help
    >> except if reporting "This didn't work. Bad for you!" is considered
    >> to be useful.

    >
    > It is. Always. It's better to *know* something failed, and have some
    > idea as to why, than to think it succeeded when it didn't.


    Let's assume that somebody runs a long-running data processing task
    supposed to generate a lot of important output. On day three, the disk
    is full but the program doesn't notice that. On day fifteen, after the
    task has completed, it checks the return value of close and prints
    "The disk became full! You lost!".

    Do you really think a user will appreciate that?

    >> Programs which care about the integrity of data written to some
    >> persistent medium, at least to the degree this is possible, need to
    >> use fsync before closing the file and need to implement some strategy
    >> for dealing with errors returned by that (for Perl, this means
    >> invoking IO::Handle::flush, followed by IO::Handle::Sync).

    >
    > Yes, if you have strict integrity requirements simply checking close
    > isn't enough. It does guard against the two most common causes of error,
    > though: disk full (or quota exceeded) and network failure.


    Or so you hope: In case of a regular file residing on some local
    persistent storage medium, a write system call encountering a 'too
    little free space to write that' problem is supposed to write as many
    bytes as can be written and return this count to the caller. This
    means a final write on close will not return ENOSPC but will silently
    truncate the file. Since close doesn't return a count, there's no way
    to detect that. And there are - of course - people who think "Well, 0
    is as good a number as any other number, so, in the case that write
    really can't write anything, it is supposed to write as many bytes as
    it can, namely, 0 bytes, and report this as successful write of size 0
    to the caller" (I'm not making this up). Then, there are 'chamber of
    horrors' storage media where 'sucessfully written' doesn't mean 'can
    be read back again' and 'can be read back now' doesn't mean 'can still
    be read back five minutes from now', aka 'flash ROM'. In case of
    network communication, a remote system is perfectly capable of sending
    an acknowledgment for something and die a sudden death before the
    acknowledged data actually hit the persistent storage medium and so
    on.

    Eg, so far, I have written one program supposed to work reliably
    despite it is running in Germany and writes to a NFS-mounted
    filesystem residing in the UK and this program not only checks the
    return value of each I/O operation but additionally checks that the
    state of remote file system is what it was supposed to be if the
    operation was actually performed sucessfully and retries everything
    (with exponential backoff) until success or administrative
    intervention --- network programming remains network programming, no
    matter if the file system API is used for high-level convenience.
     
    Rainer Weikusat, Nov 11, 2011
    #9
  10. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Ben Morrow <> writes:
    >> > Quoth Rainer Weikusat <>:
    >> >>
    >> >> The end-user is going to report the problem as "It crashed" no matter
    >> >> what actually occurred and any information said end-user can just
    >> >> copy&paste into an e-mail is possibly helpful to the person who is
    >> >> supposed to deal with the issue.
    >> >
    >> > You are certainly right for a large (and important) class of end-users.
    >> > Programs aimed at such people should, as you say, simply give a message
    >> > saying 'there was a problem' plus some way to send someone competent a
    >> > logfile with as much detailed information as possible.

    >>
    >> That wasn't what I was saying. I intended to argue in favor of
    >> omitting the \n because an event which causes a program to die is
    >> usually one some kind of developer will need to deal with

    >
    > I disagree.
    >
    > ~% grep mauzo /etc/paswd
    > grep: /etc/paswd: No such file or directory
    >
    > That message would *not* be improved by including the file and line in
    > grep's source where the error happened to be discovered.


    As I already wrote:

    ,----
    |In C-code, I always include the name of the function where the error
    |occurred. Granted, the default 'sudden death message suffix' is more
    |verbose than I consider necessary, OTOH, it contains enough
    |information to locate the operation that failed in the code and it is
    |already built in.
    `----

    >> >> For practical purpose, this is often irrelevant and - much more
    >> >> importantly - checking the return value of the close doesn't help
    >> >> except if reporting "This didn't work. Bad for you!" is considered
    >> >> to be useful.
    >> >
    >> > It is. Always. It's better to *know* something failed, and have some
    >> > idea as to why, than to think it succeeded when it didn't.

    >>
    >> Let's assume that somebody runs a long-running data processing task
    >> supposed to generate a lot of important output. On day three, the disk
    >> is full but the program doesn't notice that. On day fifteen, after the
    >> task has completed, it checks the return value of close and prints
    >> "The disk became full! You lost!".
    >>
    >> Do you really think a user will appreciate that?

    >
    > As opposed to the program falsely claiming it completed successfully?
    > Yes, absolutely.


    I agree with that: Should the program print "I'm happy to inform you
    that your task ran to completion" despite it didn't, the person whose
    time got lost because of the lack of a sensible error handling
    strategy will certainly be even more pissed then when the program just
    prints "Sorry. Couldn't be bothered to check print return
    values. Something failed".

    To state this again: Either, the application doesn't care for the
    result of an I/O operation (and there are valid reasons for that, eg,
    short running, interactive tasks where any failure will be obvious to
    the user). In this case, checking the return value of close is
    pointless. Or the application does care for the results of I/O
    operations. In this case, the hidden buffering facility whose
    shortcomings are supposed to be plastered over by checking the return
    value of close must not be used. I'll just say "Abracadabra" and firmly
    believe everything's fine doesn't work.

    >> >> Programs which care about the integrity of data written to some
    >> >> persistent medium, at least to the degree this is possible, need to
    >> >> use fsync before closing the file and need to implement some strategy
    >> >> for dealing with errors returned by that (for Perl, this means
    >> >> invoking IO::Handle::flush, followed by IO::Handle::Sync).
    >> >
    >> > Yes, if you have strict integrity requirements simply checking close
    >> > isn't enough. It does guard against the two most common causes of error,
    >> > though: disk full (or quota exceeded) and network failure.

    >>
    >> Or so you hope: In case of a regular file residing on some local
    >> persistent storage medium, a write system call encountering a 'too
    >> little free space to write that' problem is supposed to write as many
    >> bytes as can be written and return this count to the caller. This
    >> means a final write on close will not return ENOSPC but will silently
    >> truncate the file. Since close doesn't return a count, there's no way
    >> to detect that.

    >
    > RTFS, of either perl or libc (or, you know, the manpages). fclose(3)
    > calls fflush(3), which will repeatedly call write(2) until the whole of
    > the remaining buffer has been successfully written,


    Or so you hope. You cannot possibly have checked the source code of
    each past, present and future hidden buffering facility and I
    seriously doubt that you did check even a large minority of them, not
    the least because the source code of many of them isn't available. And
    - of course - since the kernel will usually provide another layer of
    hidden buffering, this is still technically equivalent to "La la la. I
    didn't hear a thing!".

    > Perl's PerlIO functions do exactly the same thing. stdio certainly has
    > some design bugs, but it wasn't designed by idiots.


    This is not a matter of the design but of the implementation.

    >> And there are - of course - people who think "Well, 0
    >> is as good a number as any other number, so, in the case that write
    >> really can't write anything, it is supposed to write as many bytes as
    >> it can, namely, 0 bytes, and report this as successful write of size 0
    >> to the caller" (I'm not making this up).

    >
    > SUSv4 says:
    >
    > | Where this volume of POSIX.1-2008 requires -1 to be returned and errno
    > | set to [EAGAIN], most historical implementations return zero [...].
    >
    > so, while this behaviour is not strictly POSIX-conforming, a decently-
    > written application should handle that case correctly.


    And that is

    These functions shall fail if:

    [EAGAIN]
    The O_NONBLOCK flag is set for the file descriptor and the thread
    would be delayed in the write() operation.

    Not applicable. In case of the 'write returning zero instead of
    ENOSPC' issue (and this is a real implementation), the program will
    loop forever in the flushing routine.
     
    Rainer Weikusat, Nov 11, 2011
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ram Laxman
    Replies:
    22
    Views:
    938
    Programmer Dude
    Feb 11, 2004
  2. sso
    Replies:
    20
    Views:
    2,774
    Martin Gregorie
    Apr 26, 2009
  3. rp
    Replies:
    1
    Views:
    587
    red floyd
    Nov 10, 2011
  4. Replies:
    3
    Views:
    153
    Mirco Wahab
    Apr 25, 2007
  5. Zhen Zhang

    parse a csv file into a text file

    Zhen Zhang, Feb 6, 2014, in forum: Python
    Replies:
    29
    Views:
    156
    Tim Chase
    Feb 6, 2014
Loading...

Share This Page