question concerning pipes and large strings

Rainer Weikusat · Jun 21, 2012

Rainer Weikusat said:
Rainer Weikusat said:

Martijn Lievaart said:

]

man sort

Indeed, I tried sort first, it works, it is more of a scalability
question really.

This is a really bad idea because sort will reorder the complete input
lines, including the data part, possible/ probably multiple times for
each input line, and this means a lot of copying of data which doesn't
need to be copied since only the IDs are supposed to be sorted.

As GNU sort is rather optimized, I would benchmark this before making
blanket statements like this.

Click to expand...

'Rather optimmized' usually means the code is seriously convoluted
because it used to run faster on some piece of ancient hardware in
1997 for a single test case because of that. And not matter how
'optimized', a sort program needs to sort its input. Which involves
reordering it. Completely. In case of files which are too large for
the memory of a modern computer, this involves a real lot of copying
data around.

I suggest that you make some benchmarks before making blanket
statements like the one above.

Click to expand...

On some random computer I just used for that, sorting a 1080M file
(4000000 lines)

Since I'm a curious person, I also tried this with the 'complete'
algorithm, namely, sort the lines, remove the IDs and concatenate the
results and something like

sort -k1 -S 50% mob-4 | perl -pe 'chop; s/^[^\t]+\t//;' >out

is actually drastically faster than any 'pure Perl' solution. But this
requires keeping the whole file in memory. As soon as sort can't do
that anymore, its performance becomes relatively abysmal while the
code which keeps only the IDs works decently on a larger dataset.
But this is nevertheless sort-of a ghost discussion: Something
'complete' which has been written in C will doubtlessly outperform the
pipeline easily.

Martijn Lievaart · Jun 22, 2012

But this is nevertheless sort-of a ghost discussion: Something
'complete' which has been written in C will doubtlessly outperform the
pipeline easily.

What is more valuable, your time or computer time?

M4

Martijn Lievaart · Jun 22, 2012

Martijn Lievaart said:
Martijn Lievaart said:

]

man sort

Indeed, I tried sort first, it works, it is more of a scalability
question really.

This is a really bad idea because sort will reorder the complete input
lines, including the data part, possible/ probably multiple times for
each input line, and this means a lot of copying of data which doesn't
need to be copied since only the IDs are supposed to be sorted.

Click to expand...

As GNU sort is rather optimized, I would benchmark this before making
blanket statements like this.

Click to expand...

'Rather optimmized' usually means the code is seriously convoluted
because it used to run faster on some piece of ancient hardware in 1997
for a single test case because of that. And not matter how 'optimized',

You have a rather dim view of GNU software, which I think is rather far
from the truth.

a sort program needs to sort its input. Which involves reordering it.
Completely. In case of files which are too large for the memory of a
modern computer, this involves a real lot of copying data around.
True.

I suggest that you make some benchmarks before making blanket statements
like the one above.

I wasn't the one stating something is a "bad idea". On rereading I see I
was not completely clear and said something different than what I was
trying to say.

What I tried to say was: "Gnu sort is a very optimized program, maybe a
solution with Gnu sort is fast enough".

Efficiency is always relevant except in a single case: The guy who has
to write the code is so busy with getting it to work at all that the
mere thought of having to try to make it work sensibly scares the shit
out of him and he tries to pass this competence-deficit as 'secret
advantage' when posing for others. Uusually, this will also always
involve a dedicated computer for testing and often, the people who are
going to use the code are not in the position to complain to the person
who wrote it, IOW, run-time efficiency doesn't matter because it is
someone elses problem.

You must live on a different planet than I am. There is always a "good
enough" efficiency. Sometimes you cannot achieve even good enough
efficiency, sometimes it truly does not matter and sometimes it's not
worth spending 1000 hours to make something run a little faster.

It's a tradeof between use-time and programming time. And as I'm the
primary user of my own software, I make sure it is efficient enough for
my needs.

I query routers with snmp. It used to take ages, so I parallelized it. It
now runs 10 times faster and I doubt it can be made faster (it's now I/O
bound).

I run reports nightly. I don't care is they take 1 minute or 5 minutes to
run.

Congratulate yourself to the happy situation you happen to be in. Stop
assuming that it is 'the universal situation'. Things might look rather
different if code is written for in-house use and supposed to run on a
computer which also provides VPN services for customers coming from
fifty different companies.

So maybe efficiency is relevant in that case. One example does not make
your statements true.

Although a machine running VPNs for 50 customers should probably be
dedicated. The impact is too big if something goes wrong, f.i. a file
with multi megabyte lines must be sorted and brings the machine to it's
knees

. But maybe it has to run on that machine, f.i. gathering
statistics. That's my whole point, it depends.
M4

Rainer Weikusat · Jun 22, 2012

Martijn Lievaart said:
What is more valuable, your time or computer time?

Depending on how you look at it, my time is either infinitely more
valuable than computer time (every second of my finite life which has
passed is gone forever and cannot be replaced) or infinitely less
valuable (People will pay in order to use computer time to solve
problems. Nobody will pay me for my lifetime just because it is
theoretically valuable to me and if I just starved to death tomorrow,
the world-at-a-large wouldn't take note of this loss).

Seems you question doesn't really make sense ...

Rainer Weikusat · Jun 22, 2012

Martijn Lievaart said:
Martijn Lievaart said:

]

man sort

Indeed, I tried sort first, it works, it is more of a scalability
question really.

This is a really bad idea because sort will reorder the complete input
lines, including the data part, possible/ probably multiple times for
each input line, and this means a lot of copying of data which doesn't
need to be copied since only the IDs are supposed to be sorted.

As GNU sort is rather optimized, I would benchmark this before making
blanket statements like this.

Click to expand...

'Rather optimmized' usually means the code is seriously convoluted
because it used to run faster on some piece of ancient hardware in 1997
for a single test case because of that. And not matter how 'optimized',

Click to expand...

You have a rather dim view of GNU software, which I think is rather far
from the truth.

I have a rather realistic expectation what 'heavily optimized code'
means in practice. This is not all specifically related to 'GNU
software': It means that the code contains special-case treatment for
all kinds of special cases which, in turn, strongly suggest that it is
based on an ill-thought out 'hasty generalization' of lots of
different situation which should rather be handled differently
('malloc' would be the 'classical' example for that).

[...]

I wasn't the one stating something is a "bad idea". On rereading I see I
was not completely clear and said something different than what I was
trying to say.

What I tried to say was: "Gnu sort is a very optimized program, maybe a
solution with Gnu sort is fast enough".

Actually, it isn't (which is supposed to be real compliment to the
people who wrote the code who understood that the maintenance cost of
'an untangible mess' by far outweighs any runtime efficiency
advantages this mess might have at the moment): The program reads
some amount of data into a fixed-size buffer, sorts these lines via
mergesort, write them to a temp file, reads the next amount of data
and so forth until all input data has been processed once. Afterwards,
it creates 'complete output' by merging the temporary files. For the
situation the OP describe (a 'line of text' composed of an ID, a tab
and then 'many millions' of data bytes) this is very wasteful, at
least for the sorting step, because these 'many lines with many
millions of data bytes' will needlessly be moved around during the
tempfile merge cycle.

You must live on a different planet than I am. There is always a "good
enough" efficiency.

The state of certain sciences such as 'physics' or 'astronomy' has
been 'good enough' for any even remotely practical purpose for
something like a century or so. Yet people build 'large hadron
colliders' for the joy of letting infinitesimally small things crash
into infinitesimally small other things which - ha ha ha - causes them
to fragment into yet smaller things (and no one will ever know if any
of these 'particles' actually exist outside of 'large hadron
colliders' and it wouldn't matter to know).

Compared to that, the case at hand is (also) a seriously practical
research problem and a really cheap one as well (among other things, I
have now learnt a sensible way for sorting linked lists and I will
certainly be able to make use of that).

Marc Girod · Jun 23, 2012

I am enjoying your conversation.
Just a side comment not taking parts.

The state of certain sciences such as 'physics' or 'astronomy' has
been 'good enough' for any even remotely practical purpose for
something like a century or so.

It had been good enough, in retrospectively radically contradictory
ways, for several thousands of years before...
In fact, it is structurally good enough, isn't it?
The problems are always marginal, and ignored for practical purposes.

Marc

J. Gleixner · Jun 25, 2012

Sorry, I should have pasted my trial code first, I did it right after my initial post, but it got here a bit late. Please see my second post for details (there is an ordering mistake in the snipper, "chomp" should be below the line with the "split")

Indeed, I tried sort first, it works, it is more of a scalability question really.

Thanks.

Possibly another option is to parse the huge file into many different
files, named by the key. So you'd have files: 001, 002, 003, etc. or
maybe 001.txt, 002.txt, etc. and they'd each contain the contents
of the line of info from the huge file. Sort the file names and
concat the files.

Martijn Lievaart · Jul 2, 2012

Depending on how you look at it, my time is either infinitely more
valuable than computer time (every second of my finite life which has
passed is gone forever and cannot be replaced) or infinitely less
valuable (People will pay in order to use computer time to solve
problems. Nobody will pay me for my lifetime just because it is
theoretically valuable to me and if I just starved to death tomorrow,
the world-at-a-large wouldn't take note of this loss).

Seems you question doesn't really make sense ...

I think you just answered the question...

But the context is of course one of efficiency. To put it to extremes,
should you put in a few hours work to shave of a few milliseconds
runtime? If you work in realtiome systems the answer might be a definate
yes. Otherwise it more often than not should be no. Unless your hobby is
making programs run as efficient as possible.

Remember the three laws of optimization: Don't, don't, don't. Or profile,
profile, profile.

Cases in point:

I was asked to speed up a ldap synchronization program. After looking it
over, I rewrote the algorithm from O(3) to O(2), resulting in a 300%
speedup. The result was still not fast enough, but lacking a clear point
of attack (programmers are often very bad at estimating what is or isn't
efficient, I was not going to fall for that trap), I profiled the
program. I found that 40% of the time, the program was converting strings
from utf8 to latin1 and vise versa. Caching the conversion resulted in
another 20% speedup. On a whim, as I know that graphics can take time and
I knew the profiler could not profile that graphics part, I changed the
progress bar to update only 1 in 16 steps. This resulted in another 20%
speedup, at which point the program was fast enough for our datasets
(which we knew would not grow).

A program I wrote was rather efficient at it's task, but it did affect
the efficiency of the database involved. Other tasks did not yet suffer
enough that users noticed, but monitoring indicated a much higher
database load. After talking it over with a colleague, he came up with a
much better algorithm. I found I could also cache some values at program
start. The resulting speedup was huge, from a minute runtime down to less
than a second. As this was a daemon, the runtime for a single pass was
not important at all, however, the database load was down to unnoticable
for monitoring.

I have to create a report every month. It takes some 2 minutes to run. I
can make it much more efficient, but it's just not worth my time.

M4

Rainer Weikusat · Jul 8, 2012

Martijn Lievaart said:
[...]

Depending on how you look at it, my time is either infinitely more
valuable than computer time

Click to expand...

[...]

or infinitely less> valuable (People will pay in order to use computer time to solve

Click to expand...

[...]

Seems you question doesn't really make sense ...

Click to expand...

I think you just answered the question...

I don't think so. On its own, 'my time' has some value to me and even
this only insofar I can make some productive use of it.

But the context is of course one of efficiency. To put it to extremes,
should you put in a few hours work to shave of a few milliseconds
runtime? If you work in realtiome systems the answer might be a definate
yes. Otherwise it more often than not should be no. Unless your hobby is
making programs run as efficient as possible.

'A realtime system' is the classic example for something which is
either 'fast enough' (capable of operating within the time constraints
associated with a given problem). In this case, making it work faster
offers no benefit. Or it is not fast enough. And then, it is entirely
unusable.

As with the 'your time versus computer time' issue, the question is
ill-defined. In order to answer it, at least the following pieces of
additional information are necessary: How large is the benefit
achieved in this way compared to the total runtime? How often is the
code going to be executed and/or how many instances of it will be
executed concurrently?

brillisoft · Jul 8, 2012

Let's roll back to the beginning of the process - if that is feasible, and maybe redesign the process?

How is this file created in the first place?

Whatever logic creates it - could that logic store a lighter mirror file with just the IDs and positions of the ids and the text in the large file?

Then that lighter file could be used as a kind of index of the large file.
Or .. maybe the whole thing could sit in a database in the first place?

Find and count strings of text from multiple files	17	Dec 16, 2021
Date in CSV/TSV question	19	Jan 1, 2013
logic question for text file updates	11	Mar 26, 2010
parsing large (>1gb) strings..how?	2	Jul 31, 2003
Aggregating/Sorting Large files	6	May 25, 2006
Searching way to assign blank delimited strings to different variables?	1	Jan 18, 2011
Efficient processing of large nuumeric data file	12	Jan 18, 2008
flattening/rolling up/aggregating a large sorted text file	3	Mar 21, 2007

question concerning pipes and large strings

Rainer Weikusat

Martijn Lievaart

Martijn Lievaart

Rainer Weikusat

Rainer Weikusat

Marc Girod

J. Gleixner

Martijn Lievaart

Rainer Weikusat

brillisoft

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads