is Random Access File really "random access"?

K

Kevin

Hi,
I am kind of new at this topic, but does anyone know that: is Java's
Random Access File really "random access", or just java "simulate" it
for newbie coders' easy coding?

The difference is, for example, for a 100G file of many records, if the
access to it is a REAL random access, then accessing any of its record
will use almost the same time (and fast): accessing the first record
will use the same time as accessing its 100th record, 1000000th record,
etc., and all should be fast and use little resource.

So I think it comes down to how the "seek(position)" work. Will it:
1) just read forward/backward to the postion?
or
2) it "jump" to that position directly?
From my limited knowledge, I think they can do it this way: since each
file's header will keep a linked list (or pointers, whatever) of the
blocks of this file, so java can read in those "informative" blocks
that keep information of those data blocks, and do the calculate to
know which data block is the required one and read that block directly.


Am I right at this point? Why I saw some articles say something like
"since random access file needs to access the underlying OS, so its
performance is not so good"?

Thank you.
 
K

Kevin

By the way, anyother description (which is my real case) will be:

If I have 100000000 fixed size record, each record is 100 bytes for
example, and I write them out to a file, which will be about 10G size.

And I need to access those records randomly, at about (randomly) 100 of
them each 1 - 5 seconds. And of course I don't have 10G memory so I can
not keep the file in memory.

Using random access file, can I expect to be able to access them in
this way at a relatively fast way?

Thanks. :)
 
R

Roedy Green

I am kind of new at this topic, but does anyone know that: is Java's
Random Access File really "random access", or just java "simulate" it
for newbie coders' easy coding?

It is true random access. At the low level on disk is a list of the
clusters ( head, track and sectors) where the various fragments of the
file are stored.

If you seek to offset 345333 of the file, the OS figures out which
fragment it is in, and the offset within that fragment. Then it
calculates the head, track and sector containing that offset and how
many sectors are need to fulfil your read. Then it schedules the disk
to seek to that location.The CPU does not wait, it gets on with other
things. When the disk arm gets to that location, it reads the data (in
SCSCI without CPU help), and then it is done taps the CPU on the
shoulder to tell it to look it RAM for the sectors requested. The cpu
copies the bytes you wanted into your buffer.

So the computer does not need to read the file sequentially at all.
Even when it reads sequentially, it is just a series of random reads,
one after the other.

Sequential devices are : mag tape, CD writing. DVD writing, TCP/IP,
printers
 
R

Roedy Green

And I need to access those records randomly, at about (randomly) 100 of
them each 1 - 5 seconds. And of course I don't have 10G memory so I can
not keep the file in memory.

Using random access file, can I expect to be able to access them in
this way at a relatively fast way?

With nio there is an intermediate alternative. You memory map the
file. The OS then tries to keep as much of the file as it can in RAM.
 
C

Chris Uppal

Kevin said:
So I think it comes down to how the "seek(position)" work. Will it:
1) just read forward/backward to the postion?
or
2) it "jump" to that position directly?

2.

(I suppose that technically it is implementation dependent, but if the
underlying OS provides random access files and the Java implementation didn't
use them then we'd have a right to be more than merely astonished, we could
lynch someone ;-) The same goes for an OS with a "real", general-purpose,
filesystem that didn't provide random access).

file's header will keep a linked list (or pointers, whatever) of the
blocks of this file, so java can read in those "informative" blocks
that keep information of those data blocks, and do the calculate to
know which data block is the required one and read that block directly.

That kind of complexity is implemented in the OS and/or filesystem rather than
in the Java code.

[...] I saw some articles say something like
"since random access file needs to access the underlying OS, so its
performance is not so good"?

I /suspect/ that what they mean is that random access and buffering are largely
incompatible. The point of buffering is that by holding data in the process's
own memory, you can avoid going to the OS with lots of small reads/writes. But
that depends on the reads/writes being adjacent. If you read a byte at offset
1, then one at offset 10000000, no implementation has any chance of finding the
second byte in the buffer that it filled to satisfy the first request (unless
you had stupidly big buffers -- which would be /very/ inefficient in this
case). Broadly speaking, if you are doing random access then either you can't
take advantage of buffering at all or you have to do it yourself. It may be
that the Java implementation provides a /little/ buffering so that sequential
reads (with no intervening seek()) will read from a small buffer. That's the
way I'd implement it myself, but I'm afraid that I don't know whether the Java
people did the same (the spec vanishes into a maze of abstract classes, and I
can't be bothered to check the actual code) -- on the whole I'd guess not.

BTW1, 10G is a bit on the large size for a file. You may find it unwieldy, if
only for things like backing up, etc (and hope to Hell that you don't have a
virus checker that insists on scanning the whole thing after every write ;-)
It also may be less efficient, than -- say -- 10 x 1G files since the
OS/filesystem will have to build rather complex on-disk structures to find each
of your blocks on disk.

BTW2. You say you want to handle a peak of approx 100 random reads per second.
That translates to a disk-head seek time of at worst 10 microseconds. Which is
plausible, but you are close to the hardware limit[*]. If the OS+filesystem
has to do another (internal) seek to find data defining the location of your
real data on-disk each time, then you are even closer to the hardware limit.
That's another reason why you may find it better to use more than one file
located on different /physical/ disks.

([*] I haven't been following hard-disk specs for some years, but I doubt if
seek time has speeded up all that much)

-- chris
 
C

Chris Uppal

Roedy Green wrote:

With nio there is an intermediate alternative. You memory map the
file. The OS then tries to keep as much of the file as it can in RAM.

10 Gig ?

Unless the OP's using a 64-bit JVM there won't be enough address space
to map it in.

-- chris
 
R

Roedy Green


If you have a 32 bit address space that will limit you to somewhere
between 1 and 4 gig depending on how they implemented it. If you have
a 64 bit addressing space, 10 gig, no problem.

How are home-use 64-bit machines coming along?
 
R

Roedy Green

I /suspect/ that what they mean is that random access and buffering are largely
incompatible

In theory with sequential i/o, either Java or the OS could presume you
are going to read the next block, and get it ready ahead of time while
you are still computing. That is how it was done in the days of
computers with 16K of RAM. We called it double buffering. You
processed data in one buffer while it was reading the next. I know
early version of Windows did not do this. When it did an i/o,
computation stopped until the i/o was completed, and it never did any
read ahead while you were busy computing. NT was a little bit
cleverer, at least scheduling i/o for several different tasks. I
don't know if XP has finally graduated to the level of circa 1965
computers.



With random I/O it has no idea what you will read next, so it can't
very well read ahead.
 
K

Kevin

So, I think our conclusion is:

1) it is real random access as far as the OS support.

2) the time needed to access each data block is:

a) time needed to compute the physical location of the data block
plus
b) time of disk head movement to that location.
plus
c) time needed to read in that physical data block from disk.

Thank you all. :)
 
L

Luc The Perverse

Roedy Green said:
If you have a 32 bit address space that will limit you to somewhere
between 1 and 4 gig depending on how they implemented it. If you have
a 64 bit addressing space, 10 gig, no problem.

How are home-use 64-bit machines coming along?

There are no single chip X86-64 solutions existing offering more than 8 Gb
right now.

Many dual xeon boards are limited to 8 Gb total, but dual Opteron board can
accept up to 16 Gb. (Actually for opteron just multiple number of chips
times 8 Gb and you get total main memory potential)

It is at least possible that in the new socket switch AMD might allow more
pins for additional main memory. (I haven't heard any discussion one way or
the other.) Hopefully they don't have a 3 socket blunder again like they
did last time. Sheesh! AMD has made PR mistake after PR mistake at the
hands of Mr Huiz. But
 
R

Richard Wheeldon

Roedy said:
Even when it reads sequentially, it is just a series of random reads,
one after the other.

Which get turned back into sequential reads by the disk controller
to avoid having to wave the drive arm around too much,

Richard
 
C

Chris Uppal

Luc said:
There are no single chip X86-64 solutions existing offering more than 8 Gb
right now.

Oh bugger! And I'd been hoping for 128 Gig in my next laptop too...

BTW, the issue here is actually the size of the address-space, rather than that
of the physical RAM -- unless these chips/boards have limited address lines
too.

-- chris
 
L

Luc The Perverse

Chris Uppal said:
Oh bugger! And I'd been hoping for 128 Gig in my next laptop too...

BTW, the issue here is actually the size of the address-space, rather than
that
of the physical RAM -- unless these chips/boards have limited address
lines
too.


If I understand you correctly, yes this is a problem.

I don't know about Xeon chips, they are way out of my price range - but
Opterons have on board memory controller, and are actually limited by their
pins. (I'm an AMD guy anyway.)

128 Gb - that seems a little insane. I'm all about technological jumps -
but I'm not sure what you'd fill it up with? Illegally downloading DVD
ISOs . .. to ram?
 
R

Roedy Green

Which get turned back into sequential reads by the disk controller
to avoid having to wave the drive arm around too much,

not quite sequential, elevator seeks. It waves the arms back and
forth over the disk like a bus on a route, picking up passengers in
order, different from the order the requests were made.

I want some day write a defragger that uses similar "bus" logic.
 
C

Chris Uppal

Luc The Perverse wrote:

[me:]
If I understand you correctly, yes this is a problem.

I don't know about Xeon chips, they are way out of my price range - but
Opterons have on board memory controller, and are actually limited by
their pins. (I'm an AMD guy anyway.)

Eek!

Or, maybe not. I suppose (I wish I knew more about this stuff) it depends on
whether the limitation is on addresses passed /in/ to the address-translation
hardware, or on the translated addresses that it emits. I'd hope it's only the
latter.

128 Gb - that seems a little insane. I'm all about technological jumps -
but I'm not sure what you'd fill it up with? Illegally downloading DVD
ISOs . .. to ram?

(I'm sure you realised that I was joking, but I'll take that question anyway)

"Insane" ?!? You mean having "only" 1 Gig in a laptop /isn't/ insane ? ;-)

Anyway -- seriously -- with that much RAM there are a number of interesting
things you can do (assuming that program size hasn't grown in proportion --
it's hard to imagine why it should[*]). For instance you could maintain a
seriously useful amount of state in RAM, enough to be able to treat the
hard-disk as merely a stable backup for memory (this whole "file" thing is
really, like, so 20th century). Or you could run every program in its own OS
(I'd really like to run all my network-facing applications -- webrowsers etc --
as separate virtual Linuxes).

-- chris

([*] Yeah, right...)
 
R

Roedy Green

not quite sequential, elevator seeks. It waves the arms back and
forth over the disk like a bus on a route, picking up passengers in
order, different from the order the requests were made.

I want some day write a defragger that uses similar "bus" logic.

I keep waiting for two hardware devices that never come.

1. multiarm disks

2. disks with marthaing in firmware -- remapping the logical tracks to
physical ones, with background defragging to minimise head motion
independent of the OS.
see http://mindprod.com/jgloss/martha.html
 
G

Gordon Beaton

I keep waiting for two hardware devices that never come.

1. multiarm disks

AFAIK, the IBM 3340 (1973) and Connor Chinook (~1990) had multiple
actuators.

/gordon
 
M

Monique Y. Mudama

I keep waiting for two hardware devices that never come.

1. multiarm disks

I know some folks in the disk drive industry. It seems like they have
enough trouble keeping one arm from misbehaving.
 
C

Chris Uppal

Monique said:
I know some folks in the disk drive industry. It seems like they have
enough trouble keeping one arm from misbehaving.

Like Dr. Strangelove...

-- chris
 
T

Tris Orendorff

I know some folks in the disk drive industry. It seems like they have
enough trouble keeping one arm from misbehaving.

Like "Police Inspector Hans Wilhelm Friederich Kemp" in Young Frankenstein?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top