Reading file lines into large List

I

iksrazal

Hi all,

I have tried posting this problem a few times looking for a solution.
What I have works but is slow.

I have to read unordered data from a text file. Data A has a
one-to-many relationship with Data B, and Data B has a one-to-many
relationship with Data C . Once I find Data A, Data B and Data C could
be anywhere in the file. So I do a RandonAccessFile.seek(0) . This
needs to be repeated hundereds of times for lots of Data B and Data C
relationships.

I'd like to put the file in memory. However, I'm using Scanner and
regex to get the data. I don't think I can run Scanner on ByteBuffer
per line.

So my latest idea is to read the file per line and store it in List. So
a 10,000 line file would have a 10,000 size List. That way I could run
Scanner on the Strings in the List, and Iterate repeatably. The data in
the List would be read only.

Any advice appreciated.
iksrazal
 
J

jan V

I'd like to put the file in memory. However, I'm using Scanner and
regex to get the data. I don't think I can run Scanner on ByteBuffer
per line.

So my latest idea is to read the file per line and store it in List. So
a 10,000 line file would have a 10,000 size List. That way I could run
Scanner on the Strings in the List, and Iterate repeatably. The data in
the List would be read only.

Maybe iterating over a List would itself greatly slow down things... not
because of the pure iterating overhead, but because you would force the
regex engine to get going for every line. Why don't you

a) store your entire text file in one (1) String
b) scan it once to locate all the line start offsets, storing these in a
separate data structure
c) use the regex engine to find what you need
d) then map the match offset to a line number

At the expense of slightly expensive b) and d) steps, you give step c) the
simplest possible data structure in which to search.

[BTW, I'm assuming you've already profiled your app and rejected the current
algorithm as a result of your findings]
 
C

Chris Uppal

So my latest idea is to read the file per line and store it in List. So
a 10,000 line file would have a 10,000 size List. That way I could run
Scanner on the Strings in the List, and Iterate repeatably. The data in
the List would be read only.

Any advice appreciated.

But what exactly is your problem ? How to represent that data in memory ? How
to get it into memory ? How to search it once it is in memory ? Something
else ?

-- chris
 
I

iksrazal

Chris Uppal escreveu:
But what exactly is your problem ? How to represent that data in memory ? How
to get it into memory ? How to search it once it is in memory ? Something
else ?

-- chris

My problem is simply speed. Doing lots of RandomAccessFile.seek(0) is
too slow. I'm using JAXB to collect the data - but that is not the
bottleneck. So searching in memory is what I'd like to try. I currently
use a regex per line via Scannner, and I'd like to accomplish the regex
in a different way, ie, not using seek(0).

iksrazal
 
I

iksrazal

That all seems like a good idea. I'm going to try and simply put the
data into List first, and then profile the speed difference versus
RAF.seek(0) . If speed is still an issue, I'll try and implement steps
a,b,c,d that you outlined and give it a shot.

Thanks!
iksrazal
 
J

John C. Bollinger

Hi all,

I have tried posting this problem a few times looking for a solution.
What I have works but is slow.

I have to read unordered data from a text file. Data A has a
one-to-many relationship with Data B, and Data B has a one-to-many
relationship with Data C . Once I find Data A, Data B and Data C could
be anywhere in the file. So I do a RandonAccessFile.seek(0) . This
needs to be repeated hundereds of times for lots of Data B and Data C
relationships.

I'd like to put the file in memory. However, I'm using Scanner and
regex to get the data. I don't think I can run Scanner on ByteBuffer
per line.

So my latest idea is to read the file per line and store it in List. So
a 10,000 line file would have a 10,000 size List. That way I could run
Scanner on the Strings in the List, and Iterate repeatably. The data in
the List would be read only.

Your latest idea is your original idea in different clothes. Yes,
getting all the data into memory has a chance of speeding things up
some, but if you're talking about a file of 10,000 lines of (say) 256
characters or less, then the total size of the file is less than 64 KB,
and your OS' filesystem cache probably holds the whole thing in OS
memory anyway. The margin to be gained by moving it into user memory is
probably not enough to go from "slow" to "fast". That would probably
still be true even if the file *wasn't* in the OS' cache.

You want a solution that requires fewer scans of the data -- preferably
just one. You didn't describe how references from A to B and from B to
C are encoded in the data, nor what distinguishes an A from a B from a
C, but chances are good that you would get a significant speedup from
building an index and then using that instead of repeatedly scanning the
data from the beginning. For instance, read the whole file, once,
constructing as you go structured representations of data A, B, C, D, Q,
and Z. Store these data in memory a Map, using for the key whatever
substring is used for linking these things together. It may also be
convenient to build separate Lists (or Maps) of all the A's, B's, and
C's as you go, if indeed they are distinguishable. Once you've parsed
the file use the Map to find / resolve all the references.

If you haven't used Maps before, don't be scared of HashMap. Many
novices seem to gravitate to TreeMap, perhaps because they understand
trees better than they understand hashing, but HashMap is frequently the
better choice.

I hope this helps.
 
I

iksrazal

Your latest idea is your original idea in different clothes. Yes,
getting all the data into memory has a chance of speeding things up
some, but if you're talking about a file of 10,000 lines of (say) 256
characters or less, then the total size of the file is less than 64 KB,
and your OS' filesystem cache probably holds the whole thing in OS
memory anyway. The margin to be gained by moving it into user memory is
probably not enough to go from "slow" to "fast". That would probably
still be true even if the file *wasn't* in the OS' cache.

You want a solution that requires fewer scans of the data -- preferably
just one. You didn't describe how references from A to B and from B to
C are encoded in the data, nor what distinguishes an A from a B from a
C, but chances are good that you would get a significant speedup from
building an index and then using that instead of repeatedly scanning the
data from the beginning.
<snip>

I know the whole collections framework a little bit ;-) .

Actually just changing RandomAccessFile.readline and
RandomAccessFile.seek to putting all 10,000 lines into List<String> and
using Iterator, boosted performance 60 times. It went from 14 minutes
to 14 seconds. Which for practical purposes solves my problem.

My file:

2056482 BSS0

I have about 20 of these.

I'm using List with the ArrayList, which seems to be fine but another
List implementation may be better for large indices of read-only data.

Now what you took the time to describe - thanks - is similair to what
another poster described, ie 'building an index'. It is the better
approach. I'm not sure just yet if going, say from 14 seconds to 4, is
neccesary in this app.

Incidently, the data is stored in a Hibernate persisted DB using the
composite pattern for the tree, but is using JAXB to build the tree on
another system as a first step.

I hope this help others - I'm convinced that using List to store all
file lines can achieve much better perfomance then RAF.seek. Using an
index as described would be a good experiment.

Thanks!
iksrazal
 
J

jan V

Actually just changing RandomAccessFile.readline and
RandomAccessFile.seek to putting all 10,000 lines into List<String> and
using Iterator, boosted performance 60 times. It went from 14 minutes
to 14 seconds. Which for practical purposes solves my problem.

So, additionally, there might be a lot more to gain from further optimizing
by using some of the other suggestions in this thread..
I'm using List with the ArrayList, which seems to be fine but another
List implementation may be better for large indices of read-only data.

IMO, None of the standard List implementations would improve on ArrayList.
Though a custom List implementation just might.
Now what you took the time to describe - thanks - is similair to what
another poster described, ie 'building an index'. It is the better
approach. I'm not sure just yet if going, say from 14 seconds to 4, is
neccesary in this app.

Never optimize beyond what the client is prepared to pay for. ;-)
Using an index as described would be a good experiment.

I've written a TextBlock class which I subclass in various applications.. it
allows a linear character offset to be mapped to a line number. Very
handy... makes handling line numbers in all kinds of applications a doddle.
TextBlock takes a block of text, and does a one-off scan to create a private
data structure to make offset-to-line nos lookups very, very fast.
Additionally the constructor takes a performance hint so that you can tune
the performance of this lookup: you can have O(1) lookups if you're prepared
to pay in RAM..
 
J

Joan

I'd like to put the file in memory. However, I'm using Scanner
and
regex to get the data. I don't think I can run Scanner on
ByteBuffer
per line.

I have an application that reads in unordered text files of size
8,250 lines
and puts them into an array of String[][]. It uses readline and
split.
Starting and running the prog takes about 1-2 sec. Then I scan
the
whole list for a match on user input casting the strings to
(CharSequence)
and using matcher. It is very fast.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top