K
Kevin
Hi Guys,
I am wondering if any suggestions on how to do the coding for this data
structure and requirements:
The story:
1) There are a large number of log data, which is line by line (text).
Each line has an line ID (integer). Basically, we can think each line
of data is logs at that time, say, each second a line is added to the
log. The total lines are more than 10 millions.
2) There are a large number of possible events (say 200K, with event ID
to identify them). When one even occurs, it will generate a value in
the log data. Since the events can occur con-currently, so one line of
data may have many values in it.
The abstract data structure:
It is required that one event ID (Integer) corresponds to many line IDs
(Integer), in which this event occurs.
If the total size is small, we can use a naïve way as: save all the
IDs into a hash, with event ID as key, and an ArrayList (or Hashtable
since we do not need the lineIDs to be in order) as value to the hash,
each item in the ArrayList is line ID (Integer).
There are some methods that can save some memory, such as customized
array and do not use Integer (8 bytes each one), etc. But with the
above mentioned size, these ways are just no help.
The required operations on the data:
The application needs to build such a data structure which supports
these two operations:
1) Given an event ID, find all the line IDs of that event.
2) Given a group of event IDs, find all the line IDs of the group
(basically a "union" of the set of line IDs of each event ID).
Any idea of how to build such a big structure? I think there should not
be any way to fit them into memory (java 1.4's stack size is max
1.3G, on win32, I think). If we can swap some of them out to a file,
read them in only when needed, how to construct the structure so we can
do the job more efficiently? Or will it be better (faster) if we put
all the IDs into a database table and use SQL to get them?
Thanks a lot and you have a great day.
By the way, any faster way to write/read large number of int to and
from a file? Some days ago, I did a test using ObjectOutputStream's
writInt(), if I remember right, it took about 3 seconds to write 10^7
int to a file, which resulted in a file about 38M.
I am wondering if any suggestions on how to do the coding for this data
structure and requirements:
The story:
1) There are a large number of log data, which is line by line (text).
Each line has an line ID (integer). Basically, we can think each line
of data is logs at that time, say, each second a line is added to the
log. The total lines are more than 10 millions.
2) There are a large number of possible events (say 200K, with event ID
to identify them). When one even occurs, it will generate a value in
the log data. Since the events can occur con-currently, so one line of
data may have many values in it.
The abstract data structure:
It is required that one event ID (Integer) corresponds to many line IDs
(Integer), in which this event occurs.
If the total size is small, we can use a naïve way as: save all the
IDs into a hash, with event ID as key, and an ArrayList (or Hashtable
since we do not need the lineIDs to be in order) as value to the hash,
each item in the ArrayList is line ID (Integer).
There are some methods that can save some memory, such as customized
array and do not use Integer (8 bytes each one), etc. But with the
above mentioned size, these ways are just no help.
The required operations on the data:
The application needs to build such a data structure which supports
these two operations:
1) Given an event ID, find all the line IDs of that event.
2) Given a group of event IDs, find all the line IDs of the group
(basically a "union" of the set of line IDs of each event ID).
Any idea of how to build such a big structure? I think there should not
be any way to fit them into memory (java 1.4's stack size is max
1.3G, on win32, I think). If we can swap some of them out to a file,
read them in only when needed, how to construct the structure so we can
do the job more efficiently? Or will it be better (faster) if we put
all the IDs into a database table and use SQL to get them?
Thanks a lot and you have a great day.
By the way, any faster way to write/read large number of int to and
from a file? Some days ago, I did a test using ObjectOutputStream's
writInt(), if I remember right, it took about 3 seconds to write 10^7
int to a file, which resulted in a file about 38M.