Virtual files

Roedy Green · May 14, 2008

I have written quite a bit of code that loads an entire file into RAM
then does indexOf, substring, regexes etc, working on the giant
String, often creating a new giant String and writing it out.

I wondered if anyone had developed some sort of package to allow such
code to be easily transformed to work on arbitrarily large text files,
e.g. 10 gigabytes.

Andreas Leitgeb · May 14, 2008

Roedy Green said:
I have written quite a bit of code that loads an entire file into RAM
then does indexOf, substring, regexes etc, working on the giant
String, often creating a new giant String and writing it out.

Don't know of such a library, but depending on the exact actions,
such a library may be impossible to exist.

A regex quite surely needs its "haystack" all in memory, so unless
you can be sure that your particular regex will never match a string
of length greater than some "n", you'd at least have to load the whole
file at once.
If instead you can exclude the possibility that a match ever exceeds
say 10000 chars, then you can apply it to 20000-length blocks, and
then keep the upper 10000, load another 10000 from file and re-apply
the regexp ...

The other operations are probably less of a principial problem.

Another thing is, that for substrings you perhaps do not
always need to create new strings separately in memory to
write out. The OutputStreamWriter has a version of "write"
that takes a String and two ints offset,length to write out
only a substring. I don't know, if that just obtains a
..substring in memory, or if it uses the data directly from
the given String.
If that isn't trustworthy enough, you'll have to obtain
shorter substrings and write them out sequentially one by one.

Finally, you can use JNI and have C-code memory-map the whole
file at once. But then you won't get a String, but at most a byte[]
visible in Java... Paging in/out the appropriate parts of the file
is then made OS's job.

PS: There's some hearsay and speculation behind those hints.
Where necessary: corrections are welcome.

Tom Anderson · May 14, 2008

Don't know of such a library, but depending on the exact actions,
such a library may be impossible to exist.

A regex quite surely needs its "haystack" all in memory, so unless you
can be sure that your particular regex will never match a string of
length greater than some "n", you'd at least have to load the whole file
at once.

Fine, so constrain the length of a match. You lose generality, but i bet
it makes no practical difference. How often do you search text for a
pattern >100 characters long, let alone >1 000 000 000?

If instead you can exclude the possibility that a match ever exceeds say
10000 chars, then you can apply it to 20000-length blocks, and then keep
the upper 10000, load another 10000 from file and re-apply the regexp
...

Something like that. If you were clever at block boundaries, you might be
able to work through the blocks quicker than that.

Back in the olden days, when memory was small (and made of magnetic
doughnuts) and data lived on tapes, a lot of work was done on algorithms
which didn't need to have their whole dataset in memory, and did the
minimal amount of IO, and particularly non-sequential IO, to do their
jobs. Particularly algorithms related to data-crunching, as you might well
want to do with a computer of that period, databases and whatnot, so
plenty of sorting and searching work. I bet there's stuff that could be
applied to this case. Sadly, i think a lot of this knowledge has been lost
- it's all written down, of course, but there isn't a living community of
practice which would help one make use if it.

The other operations are probably less of a principial problem.

Another thing is, that for substrings you perhaps do not always need to
create new strings separately in memory to write out. The
OutputStreamWriter has a version of "write" that takes a String and two
ints offset,length to write out only a substring. I don't know, if that
just obtains a .substring in memory, or if it uses the data directly
from the given String.

Good idea. I have this vague memory of String doing this internally
anyway, possibly in some ancient version of java - there was a thing about
getting memory leaks where you take a 10-char substring of a 10 000
000-char string, drop the big string, but the char[] doesn't get collected
becaue the little string is holding on to it. Maybe that was a dream.

But a CraftyStringBuffer object that worked by maintaining indices into an
existing buffer, and coalescing them or converting them to strings when it
saved memory (ie when they were short) would be cool.

Finally, you can use JNI and have C-code memory-map the whole file at
once. But then you won't get a String, but at most a byte[] visible in
Java... Paging in/out the appropriate parts of the file is then made
OS's job.

No need for JNI (your own, at least) - you can memory-map files using NIO:

String fname = ... ;
File f = new File(fname) ;
MappedByteBuffer buf =
new RandomAccessFile(f, "r").getChannel().map(FileChannel.MapMode.READ_ONLY, 0L, f.length()) ;

However, i note that Buffers use ints for indexing, so they can't handle

2 GB files. Doh.

Okay, but the offset and length for the map *are* long, so you can create
a sequence of Buffers which between them tile the entire file. You can
then wrap these in a GignormoBuffer class which looks like a buffer but
uses long indices and routes calls to the appropriate underlying buffer.

Then you write a GignormoCharBuffer which wraps a GignormoBuffer and
translates bytes to chars (and doesn't handle encodings which are not 1:k,
eg Latin-1 or UTF-16, so no UTF-8).

Then you rewrite string searching and regexps to use a GignormoCharBuffer
instead of a String. String searching is easy, you just rock some
Boyer-Moore action. Regexps are harder. Your best bet might be to let
GignormoCharBuffer produce CharSequences which are windows onto a 2
gigachar region, then apply java's regexps to those, and add logic to
search a whole buffer by doing it in bits and sliding the window,
including handling the joins (which might be quite hard to do cleverly).
Otherwise, write regexps from the ground up to use long-based offsets
(enjoy!).

So no, i don't know of a library for doing this. But it doesn't sound like
an insuperable problem. But i suspect Roedy knew that already.

tom

Roedy Green · May 14, 2008

Since a String can hold no more than two gigachars, such
a package couldn't be a drop-in replacement for your current
techniques. You'll need to divide the file into chunks, and
your code will need to deal with the cracks between them.

I was imagining some sort of package like String/StringBuilder but
with 64-bit offsets. It would cache parts of a backing file in RAM.

The regex would have to work by working on chunks, stopping it before
it got too close to a boundary, before shifting so that it was always
working in the middle of a buffer. It could then do all the logic
with standard Regex, but with a wrapper to handle this buffer shifting
nonsense.

nio memory mapped files are not quite clever enough since they are
limited by address space. nio is probably the tool you need to have a
sliding window.

Jeff Higgins · May 17, 2008

Logan said:
I was going to suggest that, but it appears that all the offsets
are ints, so the same 2^32 limitation (actually, 2^31 limitation,
I guess) would apply.

What's wrong with java.nio.channels.FileChannel?

Tom Anderson · May 17, 2008

What's wrong with java.nio.channels.FileChannel?

It's not memory-mapped.

tom

Jeff Higgins · May 17, 2008

Tom said:
It's not memory-mapped.

That is true. Going back to the OP, I don't a *requirement*
for a memory-mapped, well anything. I suppose my point was that
FileChannel has long indexes.

[quote OP]
I have written quite a bit of code that loads an entire file into RAM
then does indexOf, substring, regexes etc, working on the giant
String, often creating a new giant String and writing it out.

I wondered if anyone had developed some sort of package to allow such
code to be easily transformed to work on arbitrarily large text files,
e.g. 10 gigabytes.
[unquote]

JH

Tom Anderson · May 18, 2008

That is true. Going back to the OP, I don't a *requirement* for a
memory-mapped, well anything.

True! But that's what Logan and whoever came before him were getting at.
If you can memory-map, you avoid having to do explicit IO, which radically
simplifies searching - you just use the buffer, and let the system bring
you the data at the right time.

I suppose my point was that FileChannel has long indexes.

True, but so does RandomAccessFile. A FileInputStream doesn't use indices,
and that means it can read files of any size. The only thing that
FileChannel adds is the ability to do asynchronous IO, which doesn't seem
relevant here. All of these approaches are still hamstrung by the fact
that even with long indices on the file object, you can only load up to 2
GB of data at a time, because arrays and buffers use int indices.

tom

Jeff Higgins · May 18, 2008

Tom said:
All of these approaches are still hamstrung by the fact that even with
long indices on the file object, you can only load up to 2 GB of data at a
time, because arrays and buffers use int indices.

Well, there we are.

1 p4WN 3v3Ry+h1n G!!!

How did your opponent feel about that?

Tom Anderson · May 18, 2008

Well, there we are.
Exactly.

How did your opponent feel about that?

Insubstantial.

tom

Mark Space · May 18, 2008

Kenneth said:
This kind of thing is coming up more an more often. The API is going to
require some major rewriting in the not too distant future. Sun is going
to emphasize backward compatibility and what we are going to end up with
is a rather broken and fragmented language. I don't know any good way to
handle this, maybe a new set of packages for parts of the API based on
longs instead?

That's good point. It's too bad that ints and longs and other basic
data types aren't treated as more like objects and able to be handled
polymorphically.

I wonder how bad it would be just to take the existing API, say to heck
with the spec, and replace all ints with longs, in place, without
changing any method names or class names?

You'd get a lot of "need to cast from long to int here" but the you
could suppress that and it would still run, right? And then you could
upgrade your code at some leisure. I'm not sure of the best way for Sun
to do this. Maybe some future version (Java 8?) and give the current
version an extra long life so everyone has time to react.

The worst part would be in the transition, narrowing casts do not throw
overflow errors, so you could have a big issue and not know it.
(*coughfixthisSuncough*) I'd like to see this changed at the same time,
just so such errors are caught a runtime.

I know some here are against that, because of how "important" and
"convenient" bit slicing integers is, but that's a load of b.s. The
obvious solution is that the idiom of catching and ignoring a cast
related overflow error could be optimized to the older, faster operators
that simulate bit slicing.

int i;
long val = 0;
try {
i = (int)val;
} catch( CastOverflowException x )
{}

means "just bit slice it." For convenience, Sun could add an annotation
that does the same thing:

@SupressCastOverflow
public int longToInt( int i ) {
return val;
}

will also use the "faster bit slice operator." (Sun could also make it a
warning rather than an error to narrow primitives without an explicit
cast, and use SuppressWarnings.) Yes, you'll have to touch your code.
Sorry.

Arne VajhÃ¸j · May 18, 2008

Mark said:
I wonder how bad it would be just to take the existing API, say to heck
with the spec, and replace all ints with longs, in place, without
changing any method names or class names?

You'd get a lot of "need to cast from long to int here" but the you
could suppress that and it would still run, right? And then you could
upgrade your code at some leisure. I'm not sure of the best way for Sun
to do this. Maybe some future version (Java 8?) and give the current
version an extra long life so everyone has time to react.

You can suppress warning not errors.

I think it would be rather bad.

You will either break a lot of existing code or make
Java a lot less type safe.

Arne

Roedy Green · May 21, 2008

"just bit slice it."

could you please define "bit slice".

do you just mean chopping off the high order bits?

Database Manager: A C++ Console Application	14	May 12, 2025
ANN: XMLMax Virtual XML Editor Beta	0	Feb 13, 2009
ANN: XMLMax Virtual XML Editor Beta	0	Mar 30, 2009
Efficiently concatenating contents of multiple files	5	Jul 2, 2008
Pure virtual destructor	11	May 27, 2005
PyWart: PEP8: a seething cauldron of inconsistencies.	1	Jul 28, 2011
PyWart: PEP8: A cauldron of inconsistencies.	7	Jul 27, 2011
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012

Virtual files

Roedy Green

Andreas Leitgeb

Tom Anderson

Roedy Green

Jeff Higgins

Tom Anderson

Jeff Higgins

Tom Anderson

Jeff Higgins

Tom Anderson

Mark Space

Arne VajhÃ¸j

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads