Parse using Text::CSV into Hash

B

batrams

Hi guys - this hunk of code seems to function as intended. After I get
code to work I'm often curious to run it by you gurus to see how you
might have done it. Please don't laugh too hard but any comments are
welcome.




open (TEXTFILE, "USES.txt") || die "not found\n";
my $csv = Text::CSV->new();

while ( <TEXTFILE> ) {
$csv->parse($_);
push(@H_USES, [$csv->fields]);
}
close(TEXTFILE);

# load up the hash
for my $i ( 0 .. $#H_USES ) {
$H_USES{$H_USES[$i][0]} = $H_USES[$i];
}




DB
 
J

Jim Gibson

Hi guys - this hunk of code seems to function as intended. After I get
code to work I'm often curious to run it by you gurus to see how you
might have done it. Please don't laugh too hard but any comments are
welcome.




open (TEXTFILE, "USES.txt") || die "not found\n";

Lexical variables with lower-case names are better than package globals
in uppercase. The 'or' operator had better precedence for this use than
'||'. It is always good to include the open error ($!) in the die
message. I would usually put the file name in a scalar variable and
include that in the error message as well.

See the examples in the documentation for Text::CSV for a better style
of programming to read in a text file.
my $csv = Text::CSV->new();

while ( <TEXTFILE> ) {
$csv->parse($_);
push(@H_USES, [$csv->fields]);

I would use lower-case variable names, and I always include the
parentheses for function/subroutine/method calls ($csv->fields()), even
if they are not required, as it gives me a better clue of what is going
on.

You should always 'use strict;' and declare your variables.
}
close(TEXTFILE);

# load up the hash
for my $i ( 0 .. $#H_USES ) {
$H_USES{$H_USES[$i][0]} = $H_USES[$i];
}

Loops should be indented, and the integer index is not needed:

for my $row ( @H_USES ) {
$H_USES{$row->[0]} = $row;
}

But I would populate the hash in the while loop, rather than in a
separate loop afterwards.
 
J

J. Gleixner

Hi guys - this hunk of code seems to function as intended. After I get
code to work I'm often curious to run it by you gurus to see how you
might have done it. Please don't laugh too hard but any comments are
welcome.

If you want a hash, see the getline_hr method.
 
R

Rainer Weikusat

[...]
open (my $TEXTFILE, "<", $file)
or die "Couldn't open '$file': $!\n";
[...]

- I put back the "\n" on the die message: file and line are useful
for a programmer, and should be left in on messages you don't
expect to ever see in production, but messages addressed to the
end user don't want them.

The end-user is going to report the problem as "It crashed" no matter
what actually occurred and any information said end-user can just
copy&paste into an e-mail is possibly helpful to the person who is
supposed to deal with the issue.

[...]
Another advantage of lexical filehandles is they auto-close when they go
out of scope. It's risky taking advantage of this if you have written to
the handle: in that case you should be calling close explicitly and
checking the return value.

Or so the Linux close(2) manpage claims. The reason it does this is
because data written to a file residing on a NFS filesystem (and
possibly, other network file systems) isn't necessarily sent to the
server until the (NFS) filehandle is closed. Because of this, the
return value of a close of such a file may be some error which ocurred
during network communication. For practical purpose, this is often
irrelevant and - much more importantly - checking the return value of
the close doesn't help except if reporting "This didn't work. Bad for
you!" is considered to be useful. Programs which care about the
integrity of data written to some persistent medium, at least to the
degree this is possible, need to use fsync before closing the file and
need to implement some strategy for dealing with errors returned by
that (for Perl, this means invoking IO::Handle::flush, followed by
IO::Handle::Sync).

The actual write operation can actually fail after close returned
success even for local files.
 
G

George Mpouras

my %H_USES;
while (<DATA>) {
$_=[split ','];$H_USES{$_->[0]}=$_
}

use Data::Dumper; print Dumper(\%H_USES);

__DATA__
a1,b1,c1,d1
a2,b2,c2,d2
a3,b3,c3,d3
a3,b3,c3,d3
 
J

Justin C

[snip]
The only important improvement here is the 'my $TEXTFILE'. Bareword
filehandles are always global

Hey! And there's the answer to a question I was going to post! Couldn't
figure our why I had to pass a sub a ref to a hash from the calling
hash, but it figured out the FH all by itself. Now I know... and I know
why not to do it too. TY!

[snip]

Justin.
 
R

Rainer Weikusat

Ben Morrow said:
You are certainly right for a large (and important) class of end-users.
Programs aimed at such people should, as you say, simply give a message
saying 'there was a problem' plus some way to send someone competent a
logfile with as much detailed information as possible.

That wasn't what I was saying. I intended to argue in favor of
omitting the \n because an event which causes a program to die is
usually one some kind of developer will need to deal with and hence,
including information which is only relevant to a developer in the
diagnostic makes sense.

[...]
Look at the error messages given by the standard Unix utilities. They
(usually) do include filenames and strerror messages, but very rarely
include __FILE__ and __LINE__ information.

In C-code, I always include the name of the function where the error
occurred. Granted, the default 'sudden death message suffix' is more
verbose than I consider necessary, OTOH, it contains enough
information to locate the operation that failed in the code and it is
already built in.

[...]
You don't need to go anywhere near that far. PerlIO (and stdio in C)
buffers writes, so when you call close (that is, the equivalent of
fclose(3), not close(2)) there is still a partial bufferful of data
which hasn't been sent to the kernel yet. If that write(2) gives an
error, that error will be returned from close.

So, more generally put: Using hidden buffering mechansisms is risky in
the sense that they decouple the operation which provides the data and
the operation which moves the data to its destination. As side effects
of that, the time window during which data loss could happen becomes
much larger and an application can no longer determine where in the
produced 'output data stream' the problem actually occurred (and try
to recover from that). The justification for hidden buffering is
'usually, it works and it usually improves the performance of the
system while it is working significantly'.

'Reliable I/O' or even just 'reliably reporting I/O errors' and
'hidden buffering' can't go together and ...
Additionally, close will return an error if *any* of the previous
writes failed (unless you call IO::Handle->clearerr), so checking
the return value of close means you don't need to check the return
value of print.

.... checking the return value of close is just a fig leaf covering the
problem: If everything worked, it is redundant and when something
didn't work, reporting that "something didn't work" (but I know neither
what nor when and can't do anything to repair that) is not helpful.
It is. Always. It's better to *know* something failed, and have some
idea as to why, than to think it succeeded when it didn't.

Let's assume that somebody runs a long-running data processing task
supposed to generate a lot of important output. On day three, the disk
is full but the program doesn't notice that. On day fifteen, after the
task has completed, it checks the return value of close and prints
"The disk became full! You lost!".

Do you really think a user will appreciate that?
Yes, if you have strict integrity requirements simply checking close
isn't enough. It does guard against the two most common causes of error,
though: disk full (or quota exceeded) and network failure.

Or so you hope: In case of a regular file residing on some local
persistent storage medium, a write system call encountering a 'too
little free space to write that' problem is supposed to write as many
bytes as can be written and return this count to the caller. This
means a final write on close will not return ENOSPC but will silently
truncate the file. Since close doesn't return a count, there's no way
to detect that. And there are - of course - people who think "Well, 0
is as good a number as any other number, so, in the case that write
really can't write anything, it is supposed to write as many bytes as
it can, namely, 0 bytes, and report this as successful write of size 0
to the caller" (I'm not making this up). Then, there are 'chamber of
horrors' storage media where 'sucessfully written' doesn't mean 'can
be read back again' and 'can be read back now' doesn't mean 'can still
be read back five minutes from now', aka 'flash ROM'. In case of
network communication, a remote system is perfectly capable of sending
an acknowledgment for something and die a sudden death before the
acknowledged data actually hit the persistent storage medium and so
on.

Eg, so far, I have written one program supposed to work reliably
despite it is running in Germany and writes to a NFS-mounted
filesystem residing in the UK and this program not only checks the
return value of each I/O operation but additionally checks that the
state of remote file system is what it was supposed to be if the
operation was actually performed sucessfully and retries everything
(with exponential backoff) until success or administrative
intervention --- network programming remains network programming, no
matter if the file system API is used for high-level convenience.
 
R

Rainer Weikusat

Ben Morrow said:
I disagree.

~% grep mauzo /etc/paswd
grep: /etc/paswd: No such file or directory

That message would *not* be improved by including the file and line in
grep's source where the error happened to be discovered.

As I already wrote:

,----
|In C-code, I always include the name of the function where the error
|occurred. Granted, the default 'sudden death message suffix' is more
|verbose than I consider necessary, OTOH, it contains enough
|information to locate the operation that failed in the code and it is
|already built in.
`----
As opposed to the program falsely claiming it completed successfully?
Yes, absolutely.

I agree with that: Should the program print "I'm happy to inform you
that your task ran to completion" despite it didn't, the person whose
time got lost because of the lack of a sensible error handling
strategy will certainly be even more pissed then when the program just
prints "Sorry. Couldn't be bothered to check print return
values. Something failed".

To state this again: Either, the application doesn't care for the
result of an I/O operation (and there are valid reasons for that, eg,
short running, interactive tasks where any failure will be obvious to
the user). In this case, checking the return value of close is
pointless. Or the application does care for the results of I/O
operations. In this case, the hidden buffering facility whose
shortcomings are supposed to be plastered over by checking the return
value of close must not be used. I'll just say "Abracadabra" and firmly
believe everything's fine doesn't work.
RTFS, of either perl or libc (or, you know, the manpages). fclose(3)
calls fflush(3), which will repeatedly call write(2) until the whole of
the remaining buffer has been successfully written,

Or so you hope. You cannot possibly have checked the source code of
each past, present and future hidden buffering facility and I
seriously doubt that you did check even a large minority of them, not
the least because the source code of many of them isn't available. And
- of course - since the kernel will usually provide another layer of
hidden buffering, this is still technically equivalent to "La la la. I
didn't hear a thing!".
Perl's PerlIO functions do exactly the same thing. stdio certainly has
some design bugs, but it wasn't designed by idiots.

This is not a matter of the design but of the implementation.
And there are - of course - people who think "Well, 0
is as good a number as any other number, so, in the case that write
really can't write anything, it is supposed to write as many bytes as
it can, namely, 0 bytes, and report this as successful write of size 0
to the caller" (I'm not making this up).

SUSv4 says:

| Where this volume of POSIX.1-2008 requires -1 to be returned and errno
| set to [EAGAIN], most historical implementations return zero [...].

so, while this behaviour is not strictly POSIX-conforming, a decently-
written application should handle that case correctly.

And that is

These functions shall fail if:

[EAGAIN]
The O_NONBLOCK flag is set for the file descriptor and the thread
would be delayed in the write() operation.

Not applicable. In case of the 'write returning zero instead of
ENOSPC' issue (and this is a real implementation), the program will
loop forever in the flushing routine.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,056
Messages
2,570,443
Members
47,089
Latest member
Bobby2025b

Latest Threads

Top