Appropriate technique for altering a text file?

ccc31807 · Aug 13, 2010

During the discussion of the 9-11 mosque in NYC, several commentators
mentioned Milestones
by Sayed Qutb. I decided to read it to see that the fuss was about,
and ended up with an ASCII text copy generated from a PDF original.

I could have printed the text directly, but it was pretty mangled, and
after attempting and failing to reformat the document using vi, I
decided to write a simple Perl script to reformat it. I wanted to do
several things, join paragraphs together (every line in the file was
terminated by a "\n"), separate paragraphs by a blank line (block
style), remove repeated periods (dots), remove form feeds (which
marked pages in the original), etc.

I first attempted to munge the file in place, like this:
#FIRST ATTEMPT
open MS, '<', $file;
open OUT, '>', $out;
while (<MS>)
{
#do stuff
print OUT;
}
close MS;
close OUT;

It mostly worked, but I couldn't fine tune it. I then attempted to
munge two lines together, like this:
#SECOND ATTEMPT
open MS, '<', $file;
open OUT, '>', $out;
$line1 = <MS>;
while (<MS>)
{
$line2 = $_;
#do stuff
print OUT;
$line 2 = $line1;
}
close MS;
close OUT;

This worked a little better, but it wasn't perfect. I then tried this
and got perfect formatting:
#THIRD ATTEMPT
{
local $/ = undef;
open MS, '<', $file;
$document = <MS>;
close MS;
}
#series of transformations like this
$document =~ s/\r//;
open OUT, '>', $out;
print OUT $document;
close OUT;

All of the work I have done in the past has munged the lines one by
one, as in the first example. Occasionally, I have had to use the
second style (e.g., where the formatting of each line depends on the
content of the preceding line.) I've never used the third style at
all.

I liked the third way a lot. It seemed quick, easy, and worked
perfectly. I was actually able to open the resulting document in Word,
fancify it a little, and print a nice finished copy. However, I can't
think of any actual uses of the third style in my day to day work.

My question is this: Is the third attempt, slurping the entire
document into memory and transforming the text by regexs, very common,
or is it considered a last resort when nothing else would work?

CC.

Uri Guttman · Aug 13, 2010

c> This worked a little better, but it wasn't perfect. I then tried this
c> and got perfect formatting:
c> #THIRD ATTEMPT
c> {
c> local $/ = undef;
c> open MS, '<', $file;
c> $document = <MS>;
c> close MS;
c> }

c> All of the work I have done in the past has munged the lines one by
c> one, as in the first example. Occasionally, I have had to use the
c> second style (e.g., where the formatting of each line depends on the
c> content of the preceding line.) I've never used the third style at
c> all.

it isn't as common as it should be IMNSHO. in the old days reading files
line by line was almost required do to small memory machines. today,
megabyte files can be slurped without fear at all but line by line is
still taught as standard. it take time to change views.

c> I liked the third way a lot. It seemed quick, easy, and worked
c> perfectly. I was actually able to open the resulting document in
c> Word, fancify it a little, and print a nice finished copy. However,
c> I can't think of any actual uses of the third style in my day to
c> day work.

parsing and text munging is much easier when the entire file is in
ram. there is no need to mix i/o with logic, the i/o is much faster, you
can send/receive whole documents to servers (which could format things
or whatever), etc. slurping whole files makes a lot of sense in many
areas.

c> My question is this: Is the third attempt, slurping the entire
c> document into memory and transforming the text by regexs, very common,
c> or is it considered a last resort when nothing else would work?

it is not a last resort by any imagination today. and use File::Slurp
instead for both reading and writing the file. it is cleaner and faster
than the methods you used.

uri

Peter J. Holzer · Aug 13, 2010

[ 3 ways of munging a text file: line by line, pairs of lines,
and whole file at once
]

I liked the third way a lot. It seemed quick, easy, and worked
perfectly. I was actually able to open the resulting document in Word,
fancify it a little, and print a nice finished copy. However, I can't
think of any actual uses of the third style in my day to day work.

My question is this: Is the third attempt, slurping the entire
document into memory and transforming the text by regexs, very common,
or is it considered a last resort when nothing else would work?

Uri would probably tell you that's what you always should do unless the
file is too big to fit into memory (and you should use File::Slurp for
it)

.

I do whatever allows the most straightforward implementation. Very
often that means reading the whole data into memory, although not
necessarily as a single scalar.

hp

Peter J. Holzer · Aug 13, 2010

Uri would probably tell you [...]

I didn't see Uri's answer before I posted this. I swear!

hp

Uri Guttman · Aug 13, 2010

PJH> On 2010-08-13 18:14 said:
Uri would probably tell you [...]

Click to expand...

PJH> I didn't see Uri's answer before I posted this. I swear!

great minds.

uri

ccc31807 · Aug 13, 2010

parsing and text munging is much easier when the entire file is in
ram. there is no need to mix i/o with logic, the i/o is much faster, you
can send/receive whole documents to servers (which could format things
or whatever), etc. slurping whole files makes a lot of sense in many
areas.

Most of what I do requires me to treat each record as a separate
'document.' In many cases, this even extends to the output, where one
input document results in hundreds of separate output documents, each
of which must be opened, written to, and closed.

I'm not being difficult (or maybe I am) but I'm having a hard time
seeing how this kind of logic which treats each record separately:

while (<IN>)
{
chomp;
my ($var1, $var2, ... $varn) = split;
#do stuff
print OUT qq("$field1","$field2",..."$fieldn"\n);
}

or this:

foreach my $key (sort keys %{$hashref})
{
#do stuff using $hashref{$key}{var1}, $hashref{$key}{var2}, etc.
print OUT qq("$field1","$field2",..."$fieldn"\n);
}

could be made easier by dealing with the entire file at once.

Okay, this is the first time I have had to treat a single file as a
unit, and to be honest the experience was positive. Still, my
worldview consists of record oriented datasets, so I put this in my
nice-to-know-but-not-particularly-useful category.

CC.

sln · Aug 14, 2010

My question is this: Is the third attempt, slurping the entire
document into memory and transforming the text by regexs, very common,
or is it considered a last resort when nothing else would work?

The answer is no to slurping, and no to using regex's on large
documents that don't need to be all in memory.

There is usually a single drive (say raid). Only one
i/o operation is performed at a time. If hogged, the
other processes will wait until the hog is done and thier
i/o is dequed, done and returned.
The speeds of modern sata2, raid configured drives work well
when reading/writing incremental data, it should always be
used this way on large data that can be worked on incrementally.
The default buffer on read between the api and the device is usually
small, so as to not clog up device i/o and spin locks. So its still
going to be incremental.

A complex regex will perform larger back tracking on large
data then on smaller data. So it depends on the type and complexity.

The third reason is always memory. Sure, there is a lot of memory,
but to hog it all, bogs down background file cacheing and other processing.

Uri Guttman · Aug 14, 2010

TM> Uri Guttman said:
"PJH" == Peter J Holzer <[email protected]> writes:

Click to expand...

PJH> On 2010-08-13 18:14 said:

Uri would probably tell you [...]

Click to expand...

PJH> I didn't see Uri's answer before I posted this. I swear!

great minds.

Click to expand...

TM> yes, but why were you and Peter both thinking the same thing?

TM>

oh, your mother is a python coder, and your father smells of java!

uri

Xho Jingleheimerschmidt · Aug 14, 2010

ccc31807 wrote:
....

All of the work I have done in the past has munged the lines one by
one, as in the first example. Occasionally, I have had to use the
second style (e.g., where the formatting of each line depends on the
content of the preceding line.) I've never used the third style at
all.

I liked the third way a lot. It seemed quick, easy, and worked
perfectly. I was actually able to open the resulting document in Word,
fancify it a little, and print a nice finished copy. However, I can't
think of any actual uses of the third style in my day to day work.

My question is this: Is the third attempt, slurping the entire
document into memory and transforming the text by regexs, very common,

I use the first method, line by line, if the lines are logically
independent (the most common case), or usually if the dependence is
simple and entirely backwards. I use method 3, slurping (either into a
scalar or an array) otherwise. I only use method 2, keeping a look-back
or ring buffer, if the file were so large or had the potential to become
so large that slurping could threaten my memory.

or is it considered a last resort when nothing else would work?

No, it is the middle method that I consider a last resort.

Xho

Uri Guttman · Aug 14, 2010

c> Most of what I do requires me to treat each record as a separate
c> 'document.' In many cases, this even extends to the output, where
c> one input document results in hundreds of separate output
c> documents, each of which must be opened, written to, and closed.

it doesn't make a difference what you mostly do. it matters how to best
solve this problem. don't use the same technique to solve all
problems. hammers don't work well with screws.

c> I'm not being difficult (or maybe I am) but I'm having a hard time
c> seeing how this kind of logic which treats each record separately:

c> while (<IN>)
c> {
c> chomp;
c> my ($var1, $var2, ... $varn) = split;
c> #do stuff
c> print OUT qq("$field1","$field2",..."$fieldn"\n);
c> }

if that is fine, then use it. speed can be an issue, state of line to
line data can be an issue, parsing multiline things can be an issue.

c> or this:

c> foreach my $key (sort keys %{$hashref})
c> {
c> #do stuff using $hashref{$key}{var1}, $hashref{$key}{var2}, etc.
c> print OUT qq("$field1","$field2",..."$fieldn"\n);
c> }

that has nothing to do with line by line vs slurping. also why are you
quoting scalars even in pseudo code? it ends with a newline outside a
string too.

c> could be made easier by dealing with the entire file at once.

again it depends on the problem. try to parse a multiline structure line
by line vs slurping. it is much easier to do a single regex on the whole
file (in /g mode usually) and grab the structure then parse that. the
line by line method needs state (possibly using the .. op which i have
done plenty in this style), a variable to hold the stuff, a more complex
loop, etc. slurp style is just so much cleaner.

c> Okay, this is the first time I have had to treat a single file as a
c> unit, and to be honest the experience was positive. Still, my
c> worldview consists of record oriented datasets, so I put this in my
c> nice-to-know-but-not-particularly-useful category.

i would make it a known and very useful when needed tool. it is not how
you think but you just don't have experience seeing problems that are
better slurped. many things work fine line by line but just as many work
better slurping.

uri

Martijn Lievaart · Aug 14, 2010

The answer is no to slurping, and no to using regex's on large documents
that don't need to be all in memory.

There is usually a single drive (say raid). Only one i/o operation is
performed at a time. If hogged, the other processes will wait until the
hog is done and thier i/o is dequed, done and returned.
The speeds of modern sata2, raid configured drives work well when
reading/writing incremental data, it should always be used this way on
large data that can be worked on incrementally. The default buffer on
read between the api and the device is usually small, so as to not clog
up device i/o and spin locks. So its still going to be incremental.

Utter BS. Doing incremental reads under load will result in a lot of
seeking so leads to a degradation of performance. Slurping the file is
much more efficient.

A complex regex will perform larger back tracking on large data then on
smaller data. So it depends on the type and complexity.

True, but with modern fast machines the trade off between programmer time
and computer time more often falls in favor of using more machine time.
Only when it proves to slow should you optimize.

The third reason is always memory. Sure, there is a lot of memory, but
to hog it all, bogs down background file cacheing and other processing.

Also true, but text files are often much smaller than memory. However,
this is the only thing you really have to think about up front.

M4

Dr.Ruud · Aug 14, 2010

Uri said:
cartercc:

print OUT qq("$field1","$field2",..."$fieldn"\n);

[...] why are you
quoting scalars even in pseudo code? it ends with a newline outside a
string too.

The qq() seems to be there to output the dquotes and the newline.
The ellipsis looks weird though.

Dr.Ruud · Aug 14, 2010

Martijn said:
Doing incremental reads under load will result in a lot of
seeking so leads to a degradation of performance.

You underestimate your OS.

Uri Guttman · Aug 15, 2010

R> Uri Guttman said:
cartercc:
print OUT qq("$field1","$field2",..."$fieldn"\n);

[...] why are you
quoting scalars even in pseudo code? it ends with a newline outside a
string too.

Click to expand...

R> The qq() seems to be there to output the dquotes and the newline.
R> The ellipsis looks weird though.

i didn't see the qq but it still looks odd. as i said it is more likely
psuedo code as there were no vars named $field1 and if there were, they
should be an array.

uri

ccc31807 · Aug 15, 2010

also why are you
quoting scalars even in pseudo code? it ends with a newline outside a
string too.

Because I want my output file to look like this:
"George","Washington","1788"\n
"George","Washington","1792"\n
"John","Adams","1796"\n
etc.

again it depends on the problem. try to parse a multiline structure line
by line vs slurping. it is much easier to do a single regex on the whole
file (in /g mode usually) and grab the structure then parse that. the
line by line method needs state (possibly using the .. op which i have
done plenty in this style), a variable to hold the stuff, a more complex
loop, etc. slurp style is just so much cleaner.

AFAICR, this is the first time that I have parsed a file with a
multiline structure, and it was a personal rather than a business
purpose.

I see how one can say that the slurp style is cleaner, but as you say,
it depends on the job.

i would make it a known and very useful when needed tool. it is not how
you think but you just don't have experience seeing problems that are
better slurped. many things work fine line by line but just as many work
better slurping.

I had an experience several weeks ago. The short version is that I
solved a problem using Perl in less than an hour that some pretty high
powered folks spent two weeks trying to solve with SQL. They asked me
to help because I've had a lot of experience with SQL, and they just
couldn't get the query to work.

What I saw (and I admit that it took me a day to see it -- sometimes
I'm pretty dense) was that the problem wasn't a query problem but a
typical munging problem. The reason SQL didn't work was because the
problem was a screw and they were trying to use a hammer. When I used
a screw driver, it worked out very nicely.

The downside ;-) was that I had to confess that I used Perl instead of
a database, which caused some heartburn, because EVERYBODY KNOWS that
database technology (using Access) is far, far superior to the grungy
old technology that became obsolete during the last century.

I've been writing some Lisp, and sooner or later I'll solve a problem
with Lisp, and then maybe they'll be more accepting of Perl.

CC.

Uri Guttman · Aug 15, 2010

c> Because I want my output file to look like this:
c> "George","Washington","1788"\n
c> "George","Washington","1792"\n
c> "John","Adams","1796"\n

then use a CSV module to make sure you do it correctly. or a map join
print line like this:

print join( ',', map qq{"$_"}, @fields ), "\n" ;

that way you can use an array and not need single scalars named $field1
etc. anytime you see number suffixes on scalars, think array or think
the coder is a moron!

c> AFAICR, this is the first time that I have parsed a file with a
c> multiline structure, and it was a personal rather than a business
c> purpose.

then you haven't done much parsing so far. plenty of files have
multiline formats.

c> I've been writing some Lisp, and sooner or later I'll solve a
c> problem with Lisp, and then maybe they'll be more accepting of
c> Perl.

a strange way of getting perl accepted!

uri

Martijn Lievaart · Aug 15, 2010

You underestimate your OS.

No, actually I don't. Obviously, a lot will be buffered, maybe a page,
maybe a cylinder worth of data. But once you read more than that, by
incremental reading you greatly increased the chance (we're talking about
a heavily loaded machine) that some other process needed the disk in the
mean time, leading to extra seeks.

In fact, many OSses are very good at optimizing seeks, so if you read
fast enough, the OS may recognize that the heads are in the right
position so avoid some seeks by giving your next read priority.

M4

Jürgen Exner · Aug 15, 2010

Martijn Lievaart said:
No, actually I don't. Obviously, a lot will be buffered, maybe a page,
maybe a cylinder worth of data. But once you read more than that, by
incremental reading you greatly increased the chance (we're talking about
a heavily loaded machine) that some other process needed the disk in the
mean time, leading to extra seeks.

If your machines are so loaded that this becomes relevant then you
should have been looking for other storage solutions like RAID0 or SAN
or solid state disks for a while.

jue

Peter J. Holzer · Aug 15, 2010

c> Because I want my output file to look like this:
c> "George","Washington","1788"\n
c> "George","Washington","1792"\n
c> "John","Adams","1796"\n

then use a CSV module to make sure you do it correctly. or a map join
print line like this:

print join( ',', map qq{"$_"}, @fields ), "\n" ;

I took $field1, $field2, etc. to be pseudocode. In reality the variables
were probably called $firstname, $lastname, $year, etc. (of course with
Carter you can never tell). I would probably have used something like

print join( ',', map quote($_), $firstname, $lastname, $year ), "\n" ;

to encapsulate any quoting that might be necessary (what happens when
one of the variables contains double quotes?). I'm partial about CSV
modules. There are so many different CSV formats, but each of them is
very simple: So you need a lot of time to figure out which CSV format is
needed, and if you have done so, it takes more time to figure out the
parameters for your CSV module then just coding it in plain Perl (I
still do use CSV_XS for speed - its simply much faster than any parser I
could write in Perl. But that's a secondary decision: First I write it
in Perl to get it correct. If it's too slow I use CSV_XS to get it fast).

hp

ccc31807 · Aug 16, 2010

print line like this:

print join( ',', map qq{"$_"}, @fields ), "\n" ;

Thank you. I don't use grep() or map() much ... tidbits like this make
reading c.l.p.m. worthwhile.

then you haven't done much parsing so far. plenty of files have
multiline formats.

Not when your data results from a database, which mine does. I'm not a
programmer, but a database guy who write scripts to munge data which
is read from a database, written to a database, or both.

c> I've been writing some Lisp, and sooner or later I'll solve a
c> problem with Lisp, and then maybe they'll be more accepting of
c> Perl.

a strange way of getting perl accepted!

I once took a graduate level management course in the business school
of a large, state university. Many of the students were essentially
fifth year seniors, but many were junior to mid level management a
decade or so out of school. I was the only SwE guy in the class, and
received abuse from those in management who had had bad experiences
from IT or their SW developers. The accepted Standard was Visual
Basic, because Everybody used Visual Basic, and if you didn't you
couldn't do any Work. The textbook (which was actually a good book - I
kept it and still use it from time to time) was written using VB. The
fact that I used Perl was seen as a sign of the backwardness and
intractability of IT, as opposed to the rational, mature, and forward
looking nature of Management.

CC

Text File Only Programming	1	May 10, 2023
Creating a direct download div link for pdf file	3	Mar 19, 2023
Problem Splitting Text String	2	Dec 29, 2022
Php combine identical lines in text file	4	Oct 11, 2023
Yet another debugging technique	3	Apr 23, 2014
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Different Serialization Technique In .NET	0	Sep 27, 2013
Help for a newbie	13	Feb 13, 2023

Appropriate technique for altering a text file?

ccc31807

Uri Guttman

Peter J. Holzer

Peter J. Holzer

Uri Guttman

ccc31807

sln

Uri Guttman

Xho Jingleheimerschmidt

Uri Guttman

Martijn Lievaart

Dr.Ruud

Dr.Ruud

Uri Guttman

ccc31807

Uri Guttman

Martijn Lievaart

Jürgen Exner

Peter J. Holzer

ccc31807

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads