Appropriate technique for altering a text file?

C

ccc31807

During the discussion of the 9-11 mosque in NYC, several commentators
mentioned Milestones
by Sayed Qutb. I decided to read it to see that the fuss was about,
and ended up with an ASCII text copy generated from a PDF original.

I could have printed the text directly, but it was pretty mangled, and
after attempting and failing to reformat the document using vi, I
decided to write a simple Perl script to reformat it. I wanted to do
several things, join paragraphs together (every line in the file was
terminated by a "\n"), separate paragraphs by a blank line (block
style), remove repeated periods (dots), remove form feeds (which
marked pages in the original), etc.

I first attempted to munge the file in place, like this:
#FIRST ATTEMPT
open MS, '<', $file;
open OUT, '>', $out;
while (<MS>)
{
#do stuff
print OUT;
}
close MS;
close OUT;

It mostly worked, but I couldn't fine tune it. I then attempted to
munge two lines together, like this:
#SECOND ATTEMPT
open MS, '<', $file;
open OUT, '>', $out;
$line1 = <MS>;
while (<MS>)
{
$line2 = $_;
#do stuff
print OUT;
$line 2 = $line1;
}
close MS;
close OUT;

This worked a little better, but it wasn't perfect. I then tried this
and got perfect formatting:
#THIRD ATTEMPT
{
local $/ = undef;
open MS, '<', $file;
$document = <MS>;
close MS;
}
#series of transformations like this
$document =~ s/\r//;
open OUT, '>', $out;
print OUT $document;
close OUT;

All of the work I have done in the past has munged the lines one by
one, as in the first example. Occasionally, I have had to use the
second style (e.g., where the formatting of each line depends on the
content of the preceding line.) I've never used the third style at
all.

I liked the third way a lot. It seemed quick, easy, and worked
perfectly. I was actually able to open the resulting document in Word,
fancify it a little, and print a nice finished copy. However, I can't
think of any actual uses of the third style in my day to day work.

My question is this: Is the third attempt, slurping the entire
document into memory and transforming the text by regexs, very common,
or is it considered a last resort when nothing else would work?

CC.
 
U

Uri Guttman

c> This worked a little better, but it wasn't perfect. I then tried this
c> and got perfect formatting:
c> #THIRD ATTEMPT
c> {
c> local $/ = undef;
c> open MS, '<', $file;
c> $document = <MS>;
c> close MS;
c> }

c> All of the work I have done in the past has munged the lines one by
c> one, as in the first example. Occasionally, I have had to use the
c> second style (e.g., where the formatting of each line depends on the
c> content of the preceding line.) I've never used the third style at
c> all.

it isn't as common as it should be IMNSHO. in the old days reading files
line by line was almost required do to small memory machines. today,
megabyte files can be slurped without fear at all but line by line is
still taught as standard. it take time to change views.

c> I liked the third way a lot. It seemed quick, easy, and worked
c> perfectly. I was actually able to open the resulting document in
c> Word, fancify it a little, and print a nice finished copy. However,
c> I can't think of any actual uses of the third style in my day to
c> day work.

parsing and text munging is much easier when the entire file is in
ram. there is no need to mix i/o with logic, the i/o is much faster, you
can send/receive whole documents to servers (which could format things
or whatever), etc. slurping whole files makes a lot of sense in many
areas.

c> My question is this: Is the third attempt, slurping the entire
c> document into memory and transforming the text by regexs, very common,
c> or is it considered a last resort when nothing else would work?

it is not a last resort by any imagination today. and use File::Slurp
instead for both reading and writing the file. it is cleaner and faster
than the methods you used.

uri
 
P

Peter J. Holzer

[ 3 ways of munging a text file: line by line, pairs of lines,
and whole file at once
]
I liked the third way a lot. It seemed quick, easy, and worked
perfectly. I was actually able to open the resulting document in Word,
fancify it a little, and print a nice finished copy. However, I can't
think of any actual uses of the third style in my day to day work.

My question is this: Is the third attempt, slurping the entire
document into memory and transforming the text by regexs, very common,
or is it considered a last resort when nothing else would work?

Uri would probably tell you that's what you always should do unless the
file is too big to fit into memory (and you should use File::Slurp for
it) :).

I do whatever allows the most straightforward implementation. Very
often that means reading the whole data into memory, although not
necessarily as a single scalar.

hp
 
C

ccc31807

parsing and text munging is much easier when the entire file is in
ram. there is no need to mix i/o with logic, the i/o is much faster, you
can send/receive whole documents to servers (which could format things
or whatever), etc. slurping whole files makes a lot of sense in many
areas.

Most of what I do requires me to treat each record as a separate
'document.' In many cases, this even extends to the output, where one
input document results in hundreds of separate output documents, each
of which must be opened, written to, and closed.

I'm not being difficult (or maybe I am) but I'm having a hard time
seeing how this kind of logic which treats each record separately:

while (<IN>)
{
chomp;
my ($var1, $var2, ... $varn) = split;
#do stuff
print OUT qq("$field1","$field2",..."$fieldn"\n);
}

or this:

foreach my $key (sort keys %{$hashref})
{
#do stuff using $hashref{$key}{var1}, $hashref{$key}{var2}, etc.
print OUT qq("$field1","$field2",..."$fieldn"\n);
}

could be made easier by dealing with the entire file at once.

Okay, this is the first time I have had to treat a single file as a
unit, and to be honest the experience was positive. Still, my
worldview consists of record oriented datasets, so I put this in my
nice-to-know-but-not-particularly-useful category.

CC.
 
S

sln

My question is this: Is the third attempt, slurping the entire
document into memory and transforming the text by regexs, very common,
or is it considered a last resort when nothing else would work?

The answer is no to slurping, and no to using regex's on large
documents that don't need to be all in memory.

There is usually a single drive (say raid). Only one
i/o operation is performed at a time. If hogged, the
other processes will wait until the hog is done and thier
i/o is dequed, done and returned.
The speeds of modern sata2, raid configured drives work well
when reading/writing incremental data, it should always be
used this way on large data that can be worked on incrementally.
The default buffer on read between the api and the device is usually
small, so as to not clog up device i/o and spin locks. So its still
going to be incremental.

A complex regex will perform larger back tracking on large
data then on smaller data. So it depends on the type and complexity.

The third reason is always memory. Sure, there is a lot of memory,
but to hog it all, bogs down background file cacheing and other processing.
 
X

Xho Jingleheimerschmidt

ccc31807 wrote:
....
All of the work I have done in the past has munged the lines one by
one, as in the first example. Occasionally, I have had to use the
second style (e.g., where the formatting of each line depends on the
content of the preceding line.) I've never used the third style at
all.

I liked the third way a lot. It seemed quick, easy, and worked
perfectly. I was actually able to open the resulting document in Word,
fancify it a little, and print a nice finished copy. However, I can't
think of any actual uses of the third style in my day to day work.

My question is this: Is the third attempt, slurping the entire
document into memory and transforming the text by regexs, very common,

I use the first method, line by line, if the lines are logically
independent (the most common case), or usually if the dependence is
simple and entirely backwards. I use method 3, slurping (either into a
scalar or an array) otherwise. I only use method 2, keeping a look-back
or ring buffer, if the file were so large or had the potential to become
so large that slurping could threaten my memory.
or is it considered a last resort when nothing else would work?

No, it is the middle method that I consider a last resort.

Xho
 
U

Uri Guttman

c> Most of what I do requires me to treat each record as a separate
c> 'document.' In many cases, this even extends to the output, where
c> one input document results in hundreds of separate output
c> documents, each of which must be opened, written to, and closed.

it doesn't make a difference what you mostly do. it matters how to best
solve this problem. don't use the same technique to solve all
problems. hammers don't work well with screws.

c> I'm not being difficult (or maybe I am) but I'm having a hard time
c> seeing how this kind of logic which treats each record separately:

c> while (<IN>)
c> {
c> chomp;
c> my ($var1, $var2, ... $varn) = split;
c> #do stuff
c> print OUT qq("$field1","$field2",..."$fieldn"\n);
c> }

if that is fine, then use it. speed can be an issue, state of line to
line data can be an issue, parsing multiline things can be an issue.

c> or this:

c> foreach my $key (sort keys %{$hashref})
c> {
c> #do stuff using $hashref{$key}{var1}, $hashref{$key}{var2}, etc.
c> print OUT qq("$field1","$field2",..."$fieldn"\n);
c> }

that has nothing to do with line by line vs slurping. also why are you
quoting scalars even in pseudo code? it ends with a newline outside a
string too.

c> could be made easier by dealing with the entire file at once.

again it depends on the problem. try to parse a multiline structure line
by line vs slurping. it is much easier to do a single regex on the whole
file (in /g mode usually) and grab the structure then parse that. the
line by line method needs state (possibly using the .. op which i have
done plenty in this style), a variable to hold the stuff, a more complex
loop, etc. slurp style is just so much cleaner.

c> Okay, this is the first time I have had to treat a single file as a
c> unit, and to be honest the experience was positive. Still, my
c> worldview consists of record oriented datasets, so I put this in my
c> nice-to-know-but-not-particularly-useful category.

i would make it a known and very useful when needed tool. it is not how
you think but you just don't have experience seeing problems that are
better slurped. many things work fine line by line but just as many work
better slurping.

uri
 
M

Martijn Lievaart

The answer is no to slurping, and no to using regex's on large documents
that don't need to be all in memory.

There is usually a single drive (say raid). Only one i/o operation is
performed at a time. If hogged, the other processes will wait until the
hog is done and thier i/o is dequed, done and returned.
The speeds of modern sata2, raid configured drives work well when
reading/writing incremental data, it should always be used this way on
large data that can be worked on incrementally. The default buffer on
read between the api and the device is usually small, so as to not clog
up device i/o and spin locks. So its still going to be incremental.

Utter BS. Doing incremental reads under load will result in a lot of
seeking so leads to a degradation of performance. Slurping the file is
much more efficient.
A complex regex will perform larger back tracking on large data then on
smaller data. So it depends on the type and complexity.

True, but with modern fast machines the trade off between programmer time
and computer time more often falls in favor of using more machine time.
Only when it proves to slow should you optimize.
The third reason is always memory. Sure, there is a lot of memory, but
to hog it all, bogs down background file cacheing and other processing.

Also true, but text files are often much smaller than memory. However,
this is the only thing you really have to think about up front.

M4
 
D

Dr.Ruud

Uri said:
cartercc:
print OUT qq("$field1","$field2",..."$fieldn"\n);

[...] why are you
quoting scalars even in pseudo code? it ends with a newline outside a
string too.

The qq() seems to be there to output the dquotes and the newline.
The ellipsis looks weird though.
 
U

Uri Guttman

R> Uri Guttman said:
cartercc:
print OUT qq("$field1","$field2",..."$fieldn"\n);

[...] why are you
quoting scalars even in pseudo code? it ends with a newline outside a
string too.

R> The qq() seems to be there to output the dquotes and the newline.
R> The ellipsis looks weird though.

i didn't see the qq but it still looks odd. as i said it is more likely
psuedo code as there were no vars named $field1 and if there were, they
should be an array.

uri
 
C

ccc31807

also why are you
quoting scalars even in pseudo code? it ends with a newline outside a
string too.

Because I want my output file to look like this:
"George","Washington","1788"\n
"George","Washington","1792"\n
"John","Adams","1796"\n
etc.
again it depends on the problem. try to parse a multiline structure line
by line vs slurping. it is much easier to do a single regex on the whole
file (in /g mode usually) and grab the structure then parse that. the
line by line method needs state (possibly using the .. op which i have
done plenty in this style), a variable to hold the stuff, a more complex
loop, etc. slurp style is just so much cleaner.

AFAICR, this is the first time that I have parsed a file with a
multiline structure, and it was a personal rather than a business
purpose.

I see how one can say that the slurp style is cleaner, but as you say,
it depends on the job.
i would make it a known and very useful when needed tool. it is not how
you think but you just don't have experience seeing problems that are
better slurped. many things work fine line by line but just as many work
better slurping.

I had an experience several weeks ago. The short version is that I
solved a problem using Perl in less than an hour that some pretty high
powered folks spent two weeks trying to solve with SQL. They asked me
to help because I've had a lot of experience with SQL, and they just
couldn't get the query to work.

What I saw (and I admit that it took me a day to see it -- sometimes
I'm pretty dense) was that the problem wasn't a query problem but a
typical munging problem. The reason SQL didn't work was because the
problem was a screw and they were trying to use a hammer. When I used
a screw driver, it worked out very nicely.

The downside ;-) was that I had to confess that I used Perl instead of
a database, which caused some heartburn, because EVERYBODY KNOWS that
database technology (using Access) is far, far superior to the grungy
old technology that became obsolete during the last century.

I've been writing some Lisp, and sooner or later I'll solve a problem
with Lisp, and then maybe they'll be more accepting of Perl.

CC.
 
U

Uri Guttman

c> Because I want my output file to look like this:
c> "George","Washington","1788"\n
c> "George","Washington","1792"\n
c> "John","Adams","1796"\n

then use a CSV module to make sure you do it correctly. or a map join
print line like this:

print join( ',', map qq{"$_"}, @fields ), "\n" ;

that way you can use an array and not need single scalars named $field1
etc. anytime you see number suffixes on scalars, think array or think
the coder is a moron! :)

c> AFAICR, this is the first time that I have parsed a file with a
c> multiline structure, and it was a personal rather than a business
c> purpose.

then you haven't done much parsing so far. plenty of files have
multiline formats.

c> I've been writing some Lisp, and sooner or later I'll solve a
c> problem with Lisp, and then maybe they'll be more accepting of
c> Perl.

a strange way of getting perl accepted! :)

uri
 
M

Martijn Lievaart

You underestimate your OS.

No, actually I don't. Obviously, a lot will be buffered, maybe a page,
maybe a cylinder worth of data. But once you read more than that, by
incremental reading you greatly increased the chance (we're talking about
a heavily loaded machine) that some other process needed the disk in the
mean time, leading to extra seeks.

In fact, many OSses are very good at optimizing seeks, so if you read
fast enough, the OS may recognize that the heads are in the right
position so avoid some seeks by giving your next read priority.

M4
 
J

Jürgen Exner

Martijn Lievaart said:
No, actually I don't. Obviously, a lot will be buffered, maybe a page,
maybe a cylinder worth of data. But once you read more than that, by
incremental reading you greatly increased the chance (we're talking about
a heavily loaded machine) that some other process needed the disk in the
mean time, leading to extra seeks.

If your machines are so loaded that this becomes relevant then you
should have been looking for other storage solutions like RAID0 or SAN
or solid state disks for a while.

jue
 
P

Peter J. Holzer

c> Because I want my output file to look like this:
c> "George","Washington","1788"\n
c> "George","Washington","1792"\n
c> "John","Adams","1796"\n

then use a CSV module to make sure you do it correctly. or a map join
print line like this:

print join( ',', map qq{"$_"}, @fields ), "\n" ;

I took $field1, $field2, etc. to be pseudocode. In reality the variables
were probably called $firstname, $lastname, $year, etc. (of course with
Carter you can never tell). I would probably have used something like

print join( ',', map quote($_), $firstname, $lastname, $year ), "\n" ;

to encapsulate any quoting that might be necessary (what happens when
one of the variables contains double quotes?). I'm partial about CSV
modules. There are so many different CSV formats, but each of them is
very simple: So you need a lot of time to figure out which CSV format is
needed, and if you have done so, it takes more time to figure out the
parameters for your CSV module then just coding it in plain Perl (I
still do use CSV_XS for speed - its simply much faster than any parser I
could write in Perl. But that's a secondary decision: First I write it
in Perl to get it correct. If it's too slow I use CSV_XS to get it fast).

hp
 
C

ccc31807

print line like this:

        print join( ',', map qq{"$_"}, @fields ), "\n" ;

Thank you. I don't use grep() or map() much ... tidbits like this make
reading c.l.p.m. worthwhile.

then you haven't done much parsing so far. plenty of files have
multiline formats.

Not when your data results from a database, which mine does. I'm not a
programmer, but a database guy who write scripts to munge data which
is read from a database, written to a database, or both.
  c> I've been writing some Lisp, and sooner or later I'll solve a
  c> problem with Lisp, and then maybe they'll be more accepting of
  c> Perl.

a strange way of getting perl accepted! :)

I once took a graduate level management course in the business school
of a large, state university. Many of the students were essentially
fifth year seniors, but many were junior to mid level management a
decade or so out of school. I was the only SwE guy in the class, and
received abuse from those in management who had had bad experiences
from IT or their SW developers. The accepted Standard was Visual
Basic, because Everybody used Visual Basic, and if you didn't you
couldn't do any Work. The textbook (which was actually a good book - I
kept it and still use it from time to time) was written using VB. The
fact that I used Perl was seen as a sign of the backwardness and
intractability of IT, as opposed to the rational, mature, and forward
looking nature of Management.

CC
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,054
Latest member
LucyCarper

Latest Threads

Top