Reading file lines into large List

iksrazal · Sep 13, 2005

Hi all,

I have tried posting this problem a few times looking for a solution.
What I have works but is slow.

I have to read unordered data from a text file. Data A has a
one-to-many relationship with Data B, and Data B has a one-to-many
relationship with Data C . Once I find Data A, Data B and Data C could
be anywhere in the file. So I do a RandonAccessFile.seek(0) . This
needs to be repeated hundereds of times for lots of Data B and Data C
relationships.

I'd like to put the file in memory. However, I'm using Scanner and
regex to get the data. I don't think I can run Scanner on ByteBuffer
per line.

So my latest idea is to read the file per line and store it in List. So
a 10,000 line file would have a 10,000 size List. That way I could run
Scanner on the Strings in the List, and Iterate repeatably. The data in
the List would be read only.

Any advice appreciated.
iksrazal

jan V · Sep 13, 2005

I'd like to put the file in memory. However, I'm using Scanner and

regex to get the data. I don't think I can run Scanner on ByteBuffer
per line.

So my latest idea is to read the file per line and store it in List. So
a 10,000 line file would have a 10,000 size List. That way I could run
Scanner on the Strings in the List, and Iterate repeatably. The data in
the List would be read only.

Maybe iterating over a List would itself greatly slow down things... not
because of the pure iterating overhead, but because you would force the
regex engine to get going for every line. Why don't you

a) store your entire text file in one (1) String
b) scan it once to locate all the line start offsets, storing these in a
separate data structure
c) use the regex engine to find what you need
d) then map the match offset to a line number

At the expense of slightly expensive b) and d) steps, you give step c) the
simplest possible data structure in which to search.

[BTW, I'm assuming you've already profiled your app and rejected the current
algorithm as a result of your findings]

Chris Uppal · Sep 13, 2005

So my latest idea is to read the file per line and store it in List. So
a 10,000 line file would have a 10,000 size List. That way I could run
Scanner on the Strings in the List, and Iterate repeatably. The data in
the List would be read only.

Any advice appreciated.

But what exactly is your problem ? How to represent that data in memory ? How
to get it into memory ? How to search it once it is in memory ? Something
else ?

-- chris

iksrazal · Sep 13, 2005

Chris Uppal escreveu:

But what exactly is your problem ? How to represent that data in memory ? How
to get it into memory ? How to search it once it is in memory ? Something
else ?

-- chris

My problem is simply speed. Doing lots of RandomAccessFile.seek(0) is
too slow. I'm using JAXB to collect the data - but that is not the
bottleneck. So searching in memory is what I'd like to try. I currently
use a regex per line via Scannner, and I'd like to accomplish the regex
in a different way, ie, not using seek(0).

iksrazal

iksrazal · Sep 13, 2005

That all seems like a good idea. I'm going to try and simply put the
data into List first, and then profile the speed difference versus
RAF.seek(0) . If speed is still an issue, I'll try and implement steps
a,b,c,d that you outlined and give it a shot.

Thanks!
iksrazal

Roedy Green · Sep 13, 2005

However, I'm using Scanner and
regex to get the data.

Make sure you compile your pattern only once, not once per search.

See http://mindprod.com/jgloss/regex.html

John C. Bollinger · Sep 14, 2005

Hi all,

I have tried posting this problem a few times looking for a solution.
What I have works but is slow.

I have to read unordered data from a text file. Data A has a
one-to-many relationship with Data B, and Data B has a one-to-many
relationship with Data C . Once I find Data A, Data B and Data C could
be anywhere in the file. So I do a RandonAccessFile.seek(0) . This
needs to be repeated hundereds of times for lots of Data B and Data C
relationships.

I'd like to put the file in memory. However, I'm using Scanner and
regex to get the data. I don't think I can run Scanner on ByteBuffer
per line.

So my latest idea is to read the file per line and store it in List. So
a 10,000 line file would have a 10,000 size List. That way I could run
Scanner on the Strings in the List, and Iterate repeatably. The data in
the List would be read only.

Your latest idea is your original idea in different clothes. Yes,
getting all the data into memory has a chance of speeding things up
some, but if you're talking about a file of 10,000 lines of (say) 256
characters or less, then the total size of the file is less than 64 KB,
and your OS' filesystem cache probably holds the whole thing in OS
memory anyway. The margin to be gained by moving it into user memory is
probably not enough to go from "slow" to "fast". That would probably
still be true even if the file *wasn't* in the OS' cache.

You want a solution that requires fewer scans of the data -- preferably
just one. You didn't describe how references from A to B and from B to
C are encoded in the data, nor what distinguishes an A from a B from a
C, but chances are good that you would get a significant speedup from
building an index and then using that instead of repeatedly scanning the
data from the beginning. For instance, read the whole file, once,
constructing as you go structured representations of data A, B, C, D, Q,
and Z. Store these data in memory a Map, using for the key whatever
substring is used for linking these things together. It may also be
convenient to build separate Lists (or Maps) of all the A's, B's, and
C's as you go, if indeed they are distinguishable. Once you've parsed
the file use the Map to find / resolve all the references.

If you haven't used Maps before, don't be scared of HashMap. Many
novices seem to gravitate to TreeMap, perhaps because they understand
trees better than they understand hashing, but HashMap is frequently the
better choice.

I hope this helps.

Roedy Green · Sep 14, 2005

If you haven't used Maps before, don't be scared of HashMap.

see http://mindprod.com/jgloss/hashmap.html
http://mindprod.com/jgloss/hashcode.html
http://mindprod.com/jgloss/hashtable.html

iksrazal · Sep 14, 2005

Your latest idea is your original idea in different clothes. Yes,

getting all the data into memory has a chance of speeding things up
some, but if you're talking about a file of 10,000 lines of (say) 256
characters or less, then the total size of the file is less than 64 KB,
and your OS' filesystem cache probably holds the whole thing in OS
memory anyway. The margin to be gained by moving it into user memory is
probably not enough to go from "slow" to "fast". That would probably
still be true even if the file *wasn't* in the OS' cache.

You want a solution that requires fewer scans of the data -- preferably
just one. You didn't describe how references from A to B and from B to
C are encoded in the data, nor what distinguishes an A from a B from a
C, but chances are good that you would get a significant speedup from
building an index and then using that instead of repeatedly scanning the
data from the beginning.

<snip>

I know the whole collections framework a little bit ;-) .

Actually just changing RandomAccessFile.readline and
RandomAccessFile.seek to putting all 10,000 lines into List<String> and
using Iterator, boosted performance 60 times. It went from 14 minutes
to 14 seconds. Which for practical purposes solves my problem.

My file:

2056482 BSS0

I have about 20 of these.

I'm using List with the ArrayList, which seems to be fine but another
List implementation may be better for large indices of read-only data.

Now what you took the time to describe - thanks - is similair to what
another poster described, ie 'building an index'. It is the better
approach. I'm not sure just yet if going, say from 14 seconds to 4, is
neccesary in this app.

Incidently, the data is stored in a Hibernate persisted DB using the
composite pattern for the tree, but is using JAXB to build the tree on
another system as a first step.

I hope this help others - I'm convinced that using List to store all
file lines can achieve much better perfomance then RAF.seek. Using an
index as described would be a good experiment.

Thanks!
iksrazal

jan V · Sep 14, 2005

Actually just changing RandomAccessFile.readline and

RandomAccessFile.seek to putting all 10,000 lines into List<String> and
using Iterator, boosted performance 60 times. It went from 14 minutes
to 14 seconds. Which for practical purposes solves my problem.

So, additionally, there might be a lot more to gain from further optimizing
by using some of the other suggestions in this thread..

I'm using List with the ArrayList, which seems to be fine but another
List implementation may be better for large indices of read-only data.

IMO, None of the standard List implementations would improve on ArrayList.
Though a custom List implementation just might.

Now what you took the time to describe - thanks - is similair to what
another poster described, ie 'building an index'. It is the better
approach. I'm not sure just yet if going, say from 14 seconds to 4, is
neccesary in this app.

Never optimize beyond what the client is prepared to pay for. ;-)

Using an index as described would be a good experiment.

I've written a TextBlock class which I subclass in various applications.. it
allows a linear character offset to be mapped to a line number. Very
handy... makes handling line numbers in all kinds of applications a doddle.
TextBlock takes a block of text, and does a one-off scan to create a private
data structure to make offset-to-line nos lookups very, very fast.
Additionally the constructor takes a performance hint so that you can tune
the performance of this lookup: you can have O(1) lookups if you're prepared
to pay in RAM..

Joan · Sep 14, 2005

I'd like to put the file in memory. However, I'm using Scanner
and
regex to get the data. I don't think I can run Scanner on
ByteBuffer
per line.

I have an application that reads in unordered text files of size
8,250 lines
and puts them into an array of String[][]. It uses readline and
split.
Starting and running the prog takes about 1-2 sec. Then I scan
the
whole list for a match on user input casting the strings to
(CharSequence)
and using matcher. It is very fast.

Php combine identical lines in text file	4	Oct 11, 2023
Getting Error reading in JSON file	0	Apr 28, 2022
Reading File Into 2D List	2	Jul 9, 2013
PHP cURL for large content and single HTTP request	1	Feb 23, 2023
Need help reading .wav file in C#	0	Jun 18, 2019
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
Access violation reading location	0	Oct 23, 2022
Range / empty list issues??	1	Dec 11, 2023

Reading file lines into large List

iksrazal

jan V

Chris Uppal

iksrazal

iksrazal

Roedy Green

John C. Bollinger

Roedy Green

iksrazal

jan V

Joan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads