suggestions for printing out a few records of a lengthy file

ccc31807 · Jun 7, 2010

The input is a flat file (pipe separated) with thousands of records
and tens of columns, similar to this. The first column is a unique
key.

42546|First|Middle|Last|Street|City|State|Zip|Country|Attr1|Attr2|
Attr3 ...

The input is processed and the output consists of multi-page PDF
documents that combine the input file with other files. The other
files reference the unique key. I build a hash with the input files,
like this:

my %records;
while(<IN>)
{
my ($key, $first, $middle, $last ...) = split /\|/;
$record{$key} = {
first => $first,
middle => $middle,
last => $last,
...
}

Running this script results in thousands of PDF files. The client has
a need for individual documents, so I modified the script to accept a
unique key as a command line argument, which still reads the input
document until it matches the key, creates one hash element for the
key, and exits, like this:

#in the while loop
if ($key == $command_line_argument) {
#create hash element as above
last;
}

The client now has a need to create a small number of documents. I
capture the unique keys in @ARGV, but I don't know the best way to
select just those records. I can pre-create the hash like this:

foreach my $key (@ARVG)
(
$records{$key};
}

and in the while loop, doing this:

if(exists $records{$key})
{
#create hash element as above
}

but this still reads through the entire input file.

Is there a better way?

Thanks, CC.

Uri Guttman · Jun 7, 2010

c> The client now has a need to create a small number of documents. I
c> capture the unique keys in @ARGV, but I don't know the best way to
c> select just those records. I can pre-create the hash like this:

c> foreach my $key (@ARVG)
c> (
c> $records{$key};
c> }

c> and in the while loop, doing this:

c> if(exists $records{$key})
c> {
c> #create hash element as above
c> }

c> but this still reads through the entire input file.

use a read database or even a DBD for a csv (pipe separated is ok)
file.

or you could save a lot of space by reading in each row and only saving
that text line in a hash with the key (you extract only the key). then
you can locate the rows of interest, parse out the fields and do the
usual stuff.

uri

ccc31807 · Jun 7, 2010

use a read database or even a DBD for a csv (pipe separated is ok)
file.

We get the input file dumped on us every other month or so, as an
ASCII file, and use it just once to create the PDFs. We never do any
update, delete, or insert queries, and only a few select queries, so
putting it into a RDB just to print maybe two dozen documents out of
thousands seems like a lot of effort for very little benefit.

or you could save a lot of space by reading in each row and only saving
that text line in a hash with the key (you extract only the key). then
you can locate the rows of interest, parse out the fields and do the
usual stuff.

This is what I thought I was doing. However, it occurs to me that I
can use a counter initially set to the size of @ARVG, decrement it for
every match, and exit when the counter reaches zero.

Thanks for your response, CC.

Martijn Lievaart · Jun 7, 2010

The input is a flat file (pipe separated) with thousands of records and
tens of columns, similar to this. The first column is a unique key.
(snip)

The client now has a need to create a small number of documents. I
capture the unique keys in @ARGV, but I don't know the best way to
select just those records. I can pre-create the hash like this:

foreach my $key (@ARVG)
(
$records{$key};
}

and in the while loop, doing this:

if(exists $records{$key})
{
#create hash element as above
}

but this still reads through the entire input file.

Is there a better way?

Better first ask yourself if there really is a problem. "Thousands" of
records sounds to me as peanuts, and very small peanuts at that.

[martijn@cow t]$ time perl -ne '($x, $y, $z) = split; $h{$x}{y}=$y; $h{$x}
{z}=$z' t.log

real 0m2.804s
user 0m2.750s
sys 0m0.043s
[martijn@cow t]$ wc -l t.log
670365 t.log
[martijn@cow t]$

YMMV, and the more you do in the loop the longer it takes. But still, the
seconds (at most!) you might shave off aren't worth your programmer time.

That said, there isn't a real good way to optimize it as well. Only if
you run dozens of runs with the same inputfile, it might make sens to
either create an index file, put it in a database or read it once and
store the hash with the Storable module for fast reread. (And all
solutions are the same use an index on disk).

M4

J. Gleixner · Jun 7, 2010

ccc31807 said:
We get the input file dumped on us every other month or so, as an
ASCII file, and use it just once to create the PDFs. We never do any
update, delete, or insert queries, and only a few select queries, so
putting it into a RDB just to print maybe two dozen documents out of
thousands seems like a lot of effort for very little benefit.

This is what I thought I was doing. However, it occurs to me that I
can use a counter initially set to the size of @ARVG, decrement it for
every match, and exit when the counter reaches zero.

No need for all that. You could create a hash of the keys passed in
via ARGV.

my %ids = map { $_ => 1 } @ARGV;

Then test if the key is one you're interested in:

while(<IN>)
{
my ($key, $first, $middle, $last ...) = split /\|/;
next unless $ids{ $key };
...

Dr.Ruud · Jun 7, 2010

J. Gleixner said:
You could create a hash of the keys passed in
via ARGV.

my %ids = map { $_ => 1 } @ARGV;

Then test if the key is one you're interested in:

while(<IN>)
{
my ($key, $first, $middle, $last ...) = split /\|/;
next unless $ids{ $key };
...

You could further first test if the line starts with something
interesting. If the key is for example at least 3 characters,
like: C<next unless $short{ substr $_, 0, 3 };>.

You can also (pre)process the file with a grep command.

Uri Guttman · Jun 7, 2010

c> We get the input file dumped on us every other month or so, as an
c> ASCII file, and use it just once to create the PDFs. We never do any
c> update, delete, or insert queries, and only a few select queries, so
c> putting it into a RDB just to print maybe two dozen documents out of
c> thousands seems like a lot of effort for very little benefit.

c> This is what I thought I was doing. However, it occurs to me that I
c> can use a counter initially set to the size of @ARVG, decrement it for
c> every match, and exit when the counter reaches zero.

that would save some time but an unknown amount as you don't know which
keys are needed and one could be the last one. if you want to do it that
way, even simpler is to make a hash of the needed keys from @ARGV. then
when you see a line with that key, process it to a pdf and delete that
entry from the hash. when the hash is empty, exit.

this also could run to the end of the file but it won't ever store more
than one line at a time so it is ram efficient.

uri

Uri Guttman · Jun 7, 2010

JG> No need for all that. You could create a hash of the keys passed in
JG> via ARGV.

JG> my %ids = map { $_ => 1 } @ARGV;

JG> Then test if the key is one you're interested in:

JG> while(<IN>)
JG> {
JG> my ($key, $first, $middle, $last ...) = split /\|/;
JG> next unless $ids{ $key };
JG> ...

same idea i had but you didn't add in deleting found keys so you can
exit early.

also no need to do a full split on the line unless you know it was in
the hash. only split after you find a needed line. you can easily grab
the key from the front of each line as it comes in.

uri

ccc31807 · Jun 7, 2010

Better first ask yourself if there really is a problem. "Thousands" of
records sounds to me as peanuts, and very small peanuts at that.

You are right about that. Printing the PDFs takes far more time than
creating the hash in memory, and even if creating the full hash took
as much as a second it would be acceptable. My concern was really more
theoretical: why create a hash of some 50K elements when you only need
three?

YMMV, and the more you do in the loop the longer it takes. But still, the
seconds (at most!) you might shave off aren't worth your programmer time.

That's worth a smiley! I could just create the individual documents as
I have a modified script that will do just that. Again, it offends my
sense of frugality.

That said, there isn't a real good way to optimize it as well. Only if
you run dozens of runs with the same inputfile, it might make sens to
either create an index file, put it in a database or read it once and
store the hash with the Storable module for fast reread. (And all
solutions are the same use an index on disk).

Agreed. I don't have that much experience in development, and there
isn't a real functional need for optimization.

Again, thanks for your comments, CC.

Dr.Ruud · Jun 8, 2010

ccc31807 said:
Printing the PDFs takes far more time than
creating the hash in memory

On the related subject of creating nice PDFs:
we are using webkit for that for the last few years,
we create many-many thousands a day,
and we are very happy with the results.

Webkit interprets HTML with a decent support of CSS,
which makes it real easy to generate the source
from which the PDF will be created.

C.DeRykus · Jun 8, 2010

JG> ccc31807 wrote:
>>> use a read database or even a DBD for a csv (pipe separated is ok)
>>> file.
>>
>> We get the input file dumped on us every other month or so, as an
>> ASCII file, and use it just once to create the PDFs. We never do any
>> update, delete, or insert queries, and only a few select queries, so
>> putting it into a RDB just to print maybe two dozen documents out of
>> thousands seems like a lot of effort for very little benefit.
>>
>>> or you could save a lot of space by reading in each row and only saving
>>> that text line in a hash with the key (you extract only the key).then
>>> you can locate the rows of interest, parse out the fields and do the
>>> usual stuff.
>>
>> This is what I thought I was doing. However, it occurs to me that I
>> can use a counter initially set to the size of @ARVG, decrement itfor
>> every match, and exit when the counter reaches zero.

JG> No need for all that. You could create a hash of the keys passed in
JG> via ARGV.

JG> my %ids = map { $_ => 1 } @ARGV;

JG> Then test if the key is one you're interested in:

JG> while(<IN>)
JG> {
JG> my ($key, $first, $middle, $last ...) = split /\|/;
JG> next unless $ids{ $key };
JG> ...

same idea i had but you didn't add in deleting found keys so you can
exit early.

also no need to do a full split on the line unless you know it was in
the hash. only split after you find a needed line. you can easily grab
the key from the front of each line as it comes in.

A match with \G and /gc is an alternative
to avoid a split() of the whole line:

while (<IN>)
{
my ($key) = m{\G(\d+)\|}gc;
next unless defined $key and $ids{ $key };
my @rest = m{\|\G([^|]+)}gc;
...
}

A regex and split() may not be as
efficient but is much easier though:

while (<IN>)
{
my ($key) = /^(\d+)/;
next unless...
my( undef, @rest ) = split (/\|/, $_ );
...
}

Peter J. Holzer · Jun 9, 2010

On the related subject of creating nice PDFs:
we are using webkit for that for the last few years,
we create many-many thousands a day,
and we are very happy with the results.

Sounds interesting. Which perl module do you use (there are several on
CPAN, but the descriptions don't look promising)?

hp

Chris Nehren · Jun 10, 2010

Sounds interesting. Which perl module do you use (there are several on
CPAN, but the descriptions don't look promising)?

Not a module, per se, but I've had success with wkhtmltopdf. See
http://code.google.com/p/wkhtmltopdf/ for more info.

Peter J. Holzer · Jun 11, 2010

Not a module, per se, but I've had success with wkhtmltopdf. See
http://code.google.com/p/wkhtmltopdf/ for more info.

Thanks, but after playing with it for a bit I found two problems:

1) It pretends to be a screen device, not a printing device (so for a
stylesheet which contain both @media print and @media screen sections
it chooses the wrong ones).
2) It sometimes makes a pagebreak in the middle of a line (so the upper
half of the line is on page 1 and the lower half of the line is on
page 2).

It looks like the tool renders the page the same way as a browser on
screen and then cuts the result into pages.

hp

Dr.Ruud · Jun 11, 2010

Peter said:
Thanks, but after playing with it for a bit I found two problems:

1) It pretends to be a screen device, not a printing device (so for a
stylesheet which contain both @media print and @media screen sections
it chooses the wrong ones).
2) It sometimes makes a pagebreak in the middle of a line (so the upper
half of the line is on page 1 and the lower half of the line is on
page 2).

It looks like the tool renders the page the same way as a browser on
screen and then cuts the result into pages.

This should help:
--print-media-type
"page-break-inside: avoid;"
http://www.smashingmagazine.com/2007/02/21/printing-the-web-solutions-and-techniques/
http://code.google.com/p/wkhtmltopdf/issues/detail?id=9
http://code.google.com/p/wkhtmltopdf/issues/detail?id=57
http://search.cpan.org/~tbr/WKHTMLTOPDF-0.02/lib/WKHTMLTOPDF.pm

Peter J. Holzer · Jun 13, 2010

This should help:
--print-media-type

That was the option I was looking for. I guess I didn't expect to find
an option which I consider extremely important (in fact, I think it
should be the default) to be hidden under "less common command
switches".

"page-break-inside: avoid;"

I see that I wasn't clear enough what I meant with "a pagebreak in the
middle of a line", so some screenshots may help:

http://www.hjp.at/junk/ss-wkhtmltopdf1.png
http://www.hjp.at/junk/ss-wkhtmltopdf2.png

As you can see, the last line of the page is split *horizontally*
slightly above the baseline in both cases - the descenders appear at the
top of the next page. That's clearly a bug and not something
"page-break-inside: avoid;" is supposed to fix. "page-break-inside:
avoid;" avoids pagebreaks within an element, e.g. a paragraph, but that
isn't the problem here.

http://www.smashingmagazine.com/2007/02/21/printing-the-web-solutions-and-techniques/

Nice collection of links, although I'm not sure why you mention them.

http://code.google.com/p/wkhtmltopdf/issues/detail?id=9

Yup, my problem number 2 is mentioned in comment 4 here. I already found
that before posting.

http://code.google.com/p/wkhtmltopdf/issues/detail?id=57

Different problem.

http://search.cpan.org/~tbr/WKHTMLTOPDF-0.02/lib/WKHTMLTOPDF.pm

Ouch! My eyes! Couldn't he have named the thing WkHTMLtoPDF of
WkHtmlToPdf, or something? ;-).

hp

Dr.Ruud · Jun 13, 2010

Peter said:
That was the option I was looking for. I guess I didn't expect to find
an option which I consider extremely important (in fact, I think it
should be the default) to be hidden under "less common command
switches".

Yes, I also don't understand why "they" did it like that, it makes it
all unnecessary less easy to understand.
But it still all works reasonably well, we create many thousands of
unique PDFs daily with it.

"page-break-inside: avoid;"

Click to expand...

I see that I wasn't clear enough what I meant with "a pagebreak in the
middle of a line" [...]
the last line of the page is split *horizontally*
slightly above the baseline

That's what I understood, and I assumed that you could prevent that by
giving the element that attribute. BTW, the default page size is A4.

The manual says:

<quote>
Page Breaking

The current page breaking algorithm of WebKit leaves much to be
desired. Basically webkit will render everything into one long page,
and then cut it up into pages. This means that if you have two columns
of text where one is vertically shifted by half a line, then webkit
will cut a line into to pieces display the top half on one page, and
the bottom half on another page. It will also break image in two and so
on. If you are using the patched version of QT you can use the CSS
page-break-inside property to remedy this somewhat. There is no easy
solution to this problem, until this is solved try organising your HTML
documents such that it contains many lines on which pages can be cut
cleanly.

See also:
<http://code.google.com/p/wkhtmltopdf/issues/detail?id=9>,
<http://code.google.com/p/wkhtmltopdf/issues/detail?id=33> and
<http://code.google.com/p/wkhtmltopdf/issues/detail?id=57>.
</quote>

Fonts (and Qt's QPrinter::ScreenResolution) also can cause issues:
http://code.google.com/p/wkhtmltopdf/issues/detail?id=72

How can I find out the id of the first record in a database table, assuming record 1 is deleted?	2	Aug 27, 2025
Unable to read input from keyboard, in below C code, for a BST.	0	Jul 20, 2025
Database Manager: A C++ Console Application	14	May 12, 2025
Executing a second python file with one of several options at a time	0	Nov 6, 2025
Padding strings for a clean visual print out...	5	Dec 23, 2023
Ow do I easily convert my PST file into a PDF?	10	Dec 28, 2024
Using a DTSX file with GoDaddy	0	Apr 21, 2024
I'm tempted to quit out of frustration	1	Aug 13, 2023

suggestions for printing out a few records of a lengthy file

ccc31807

Uri Guttman

ccc31807

Martijn Lievaart

J. Gleixner

Dr.Ruud

Uri Guttman

Uri Guttman

ccc31807

Dr.Ruud

C.DeRykus

Peter J. Holzer

Chris Nehren

Peter J. Holzer

Dr.Ruud

Peter J. Holzer

Dr.Ruud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads