advice on loading and searching large map in memory

E

eunever32

Hi

We have a requirement to query across two disparate systems. Both
systems are read-only so no need for updates and once loaded and no
need to check for updates. I would plan to reload the data afresh each
day. Records on both systems map one-one and each has 7million
records.

The first system is legacy and I am reluctant to redevelop (C code).
The second is standard Java/tomcat/SQL

The non-relational query can return up to 1000 records.

This could therefore result in 1000 queries to the relational system
(just one table) before returning to the user.

To avoid 1000 relational queries I was planning to "cache" the entire
relational table in memory. I was planning to have a web service which
would load the entire relational table into memory. The web service,
running in a separate tomcat could then be queried 1000 times or maybe
get a single request with 1000 values and return all results in one
go. Having a separate tomcat process would help to isolate any memory
issues eg JVM heap size.

Can people recommend an approach?

Because the entire set of records would always be in memory does that
make using something like ehcache pointless?

Issues I would anticipate:
time to load 7m records each morning
memory issues
best Java collection to hold the map (HashMap?) The map would be
(int, int) -> Object
Any suggestions regarding specialized cache utility eg EhCache

Thanks in advance.
 
D

Daniele Futtorovic

Hi

We have a requirement to query across two disparate systems. Both
systems are read-only so no need for updates and once loaded and no
need to check for updates. I would plan to reload the data afresh each
day. Records on both systems map one-one and each has 7million
records.

The first system is legacy and I am reluctant to redevelop (C code).
The second is standard Java/tomcat/SQL

The non-relational query can return up to 1000 records.

This could therefore result in 1000 queries to the relational system
(just one table) before returning to the user.

To avoid 1000 relational queries I was planning to "cache" the entire
relational table in memory. I was planning to have a web service which
would load the entire relational table into memory. The web service,
running in a separate tomcat could then be queried 1000 times or maybe
get a single request with 1000 values and return all results in one
go. Having a separate tomcat process would help to isolate any memory
issues eg JVM heap size.

Can people recommend an approach?

Because the entire set of records would always be in memory does that
make using something like ehcache pointless?

Issues I would anticipate:
time to load 7m records each morning
memory issues
best Java collection to hold the map (HashMap?) The map would be
(int, int) -> Object
Any suggestions regarding specialized cache utility eg EhCache

get a single request with 1000 values and return all results in one
go.

If you did that, which to me at first sight seems like a good idea,
would you really have to do the "load in memory" stuff? Couldn't you
just let the underlying system do what it supposedly knows best to do?
 
M

Martin Gregorie

Hi

We have a requirement to query across two disparate systems. Both
systems are read-only so no need for updates and once loaded and no need
to check for updates. I would plan to reload the data afresh each day.
Records on both systems map one-one and each has 7million records.

The first system is legacy and I am reluctant to redevelop (C code). The
second is standard Java/tomcat/SQL

The non-relational query can return up to 1000 records.

This could therefore result in 1000 queries to the relational system
(just one table) before returning to the user.

To avoid 1000 relational queries I was planning to "cache" the entire
relational table in memory. I was planning to have a web service which
would load the entire relational table into memory. The web service,
running in a separate tomcat could then be queried 1000 times or maybe
get a single request with 1000 values and return all results in one go.
Having a separate tomcat process would help to isolate any memory issues
eg JVM heap size.

Can people recommend an approach?
How big are the items in each collection?
How does the search process recognise an item?
If there are specific key terms, how big are they and how many terms are
there per item?

The answers to these questions can have a large effect on selecting the
optimum approach.
 
T

Tom Anderson

We have a requirement to query across two disparate systems. Both
systems are read-only so no need for updates and once loaded and no need
to check for updates. I would plan to reload the data afresh each day.
Records on both systems map one-one and each has 7million records.

The first system is legacy and I am reluctant to redevelop (C code).
The second is standard Java/tomcat/SQL

The non-relational query can return up to 1000 records.

This could therefore result in 1000 queries to the relational system
(just one table) before returning to the user.

Unless you batch them. Can you not do something like:

Collection<LegacyResult> legacyResults = queryLegacySystem();
Iterator<LegacyResult> legacyResultsIterator = legacyResults.iterator();
Collection<CombinedResult> combinedResults = new ArrayList<CombinedResult>();
Connection conn = openDatabaseConnection();
// NB i'm not closing anything after use, but you would have to
PreparedStatement newSystemQuery = conn.prepareStatement("select * from sometable where item_id in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)");
while (legacyResultsIterator.hasNext()) {
Map<String, LegacyResult> batch = new HashMap<LegacyResult>(10);
for (int i = 1; i <= 10; ++i) {
// NB i'm not dealing with the end of the iterator, but you would have to
LegacyResult legacyResult = legacyResultsIterator.next();
String id = legacyResult.getItemID();
batch.put(id, legacyResult);
newSystemQuery.setString(i, id);
}
ResultSet rs = newSystemQuery.executeQuery();
while (rs.next()) {
NewSystemResult newResult = makeNewResultFromResultRow(rs);
LegacyResult legacyResult = batch.get(newResult.getID());
CombinedResult combinedResult = new CombinedResult(legacyResult, newResult);
combinedResults.add(combinedResult);
}
}

Where the batch size might be considerably more than 10?
To avoid 1000 relational queries I was planning to "cache" the entire
relational table in memory. I was planning to have a web service which
would load the entire relational table into memory. The web service,
running in a separate tomcat could then be queried 1000 times or maybe
get a single request with 1000 values and return all results in one go.
Having a separate tomcat process would help to isolate any memory issues
eg JVM heap size.

Can people recommend an approach?

Because the entire set of records would always be in memory does that
make using something like ehcache pointless?

I think you could use EHCache or similar *instead* of writing your own
cache server.

How big are your objects? If they're a kilobyte each (largeish, for an
object), then seven million will take up seven gigs of memory; if they're
100 bytes (pretty tiny), then they'll take up 700 MB. That's before any
overhead. The former will require you to have a machine with a lot of
memory if you want to avoid thrashing; even the latter means taking a good
chunk of memory just for the cache.

tom
 
L

Lew

How big are the items in each collection?
How does the search process recognise an item?
If there are specific key terms, how big are they and how many terms are
there per item?

The answers to these questions can have a large effect on selecting the
optimum approach.

No one is asking the all-important questions!

What is the current performance? Is the set of queries a bottleneck? How do
they know? Under what load conditions? Couldn't they just submit a single
query with the 1000 values instead of 1000 individual queries? (Yes, they
could.)

Asking to optimize something with a technique that likely will have severe
performance problems itself, as the OP did, is a likely sign of sketchy
analysis, at best. Let's at least get the evidence that there's a problem to
solve here that so far has been glaringly absent.

So, OP, tell us - how did you determine that there's a bottleneck in the area
you suspect, and how did you reject all the other approaches to its resolution
but the one you ask about?
 
A

Arved Sandstrom

Hi

We have a requirement to query across two disparate systems. Both
systems are read-only so no need for updates and once loaded and no
need to check for updates. I would plan to reload the data afresh each
day. Records on both systems map one-one and each has 7million
records.

The first system is legacy and I am reluctant to redevelop (C code).
The second is standard Java/tomcat/SQL

The non-relational query can return up to 1000 records.

This could therefore result in 1000 queries to the relational system
(just one table) before returning to the user.

To avoid 1000 relational queries I was planning to "cache" the entire
relational table in memory. I was planning to have a web service which
would load the entire relational table into memory. The web service,
running in a separate tomcat could then be queried 1000 times or maybe
get a single request with 1000 values and return all results in one
go. Having a separate tomcat process would help to isolate any memory
issues eg JVM heap size.
[ SNIP ]

Just to be clear on this, there are N records in each system, one legacy
(non-relational) and one a RDBMS, and N ~ 7 million. There is also some
kind of key for a 1:1 mapping between records in the two systems, but
the data in each record is different. Correct?

Let me ask a few questions:

1. You've obviously got known criteria for each legacy query (that as
you say might return up to 1000 records, let's call this number P). I
take it that's it been asked, and ruled out, that there are any criteria
at this point that would also identify the correct P matching records in
the RDBMS table. So you need to get the "keys" first. Correct?

2. Have you considered an IN operator in a WHERE condition? We're
looking to avoid P separate queries to the RDBMS here, because that's a
serious performance hit, and make it one SELECT instead. If there are
practical limitations on how many values you can jam into the inlist,
you can chunk it up rather coarsely, and still have very many fewer
queries to get your data than you'd have had otherwise...as in less than
10 for P<=1000. Although with most databases I suspect you don't have to
chunk at all.

Any serious objections to that experiment?

AHS
 
L

Lew

Hi

We have a requirement to query across two disparate systems. Both
systems are read-only so no need for updates and once loaded and no
need to check for updates. I would plan to reload the data afresh each
day. Records on both systems map one-one and each has 7million
records.

The first system is legacy and I am reluctant to redevelop (C code).
The second is standard Java/tomcat/SQL

The non-relational query can return up to 1000 records.

This could therefore result in 1000 queries to the relational system
(just one table) before returning to the user.

To avoid 1000 relational queries I was planning to "cache" the entire
relational table in memory. I was planning to have a web service which
would load the entire relational table into memory. The web service,
running in a separate tomcat could then be queried 1000 times or maybe
get a single request with 1000 values and return all results in one
go. Having a separate tomcat process would help to isolate any memory
issues eg JVM heap size.
[ SNIP ]

Just to be clear on this, there are N records in each system, one legacy
(non-relational) and one a RDBMS, and N ~ 7 million. There is also some kind
of key for a 1:1 mapping between records in the two systems, but the data in
each record is different. Correct?

Let me ask a few questions:

1. You've obviously got known criteria for each legacy query (that as you say
might return up to 1000 records, let's call this number P). I take it that's
it been asked, and ruled out, that there are any criteria at this point that
would also identify the correct P matching records in the RDBMS table. So you
need to get the "keys" first. Correct?

2. Have you considered an IN operator in a WHERE condition? We're looking to

EXISTS is another alternative.
avoid P separate queries to the RDBMS here, because that's a serious
performance hit, and make it one SELECT instead. If there are practical

Again, no one has measured performance so we don't even know if it's a
performance hit at all, much less a "serious" one.
limitations on how many values you can jam into the inlist, you can chunk it
up rather coarsely, and still have very many fewer queries to get your data
than you'd have had otherwise...as in less than 10 for P<=1000. Although with
most databases I suspect you don't have to chunk at all.

Any serious objections to that experiment?

As an experiment, no. As a production decision, yes. All this complication
is for a so-called "problem" that has not yet been demonstrated to exist.
 
A

Arved Sandstrom

Hi

We have a requirement to query across two disparate systems. Both
systems are read-only so no need for updates and once loaded and no
need to check for updates. I would plan to reload the data afresh each
day. Records on both systems map one-one and each has 7million
records.

The first system is legacy and I am reluctant to redevelop (C code).
The second is standard Java/tomcat/SQL

The non-relational query can return up to 1000 records.

This could therefore result in 1000 queries to the relational system
(just one table) before returning to the user.

To avoid 1000 relational queries I was planning to "cache" the entire
relational table in memory. I was planning to have a web service which
would load the entire relational table into memory. The web service,
running in a separate tomcat could then be queried 1000 times or maybe
get a single request with 1000 values and return all results in one
go. Having a separate tomcat process would help to isolate any memory
issues eg JVM heap size.
[ SNIP ]

Just to be clear on this, there are N records in each system, one legacy
(non-relational) and one a RDBMS, and N ~ 7 million. There is also
some kind
of key for a 1:1 mapping between records in the two systems, but the
data in
each record is different. Correct?

Let me ask a few questions:

1. You've obviously got known criteria for each legacy query (that as
you say
might return up to 1000 records, let's call this number P). I take it
that's
it been asked, and ruled out, that there are any criteria at this
point that
would also identify the correct P matching records in the RDBMS table.
So you
need to get the "keys" first. Correct?

2. Have you considered an IN operator in a WHERE condition? We're
looking to

EXISTS is another alternative.
avoid P separate queries to the RDBMS here, because that's a serious
performance hit, and make it one SELECT instead. If there are practical

Again, no one has measured performance so we don't even know if it's a
performance hit at all, much less a "serious" one.
limitations on how many values you can jam into the inlist, you can
chunk it
up rather coarsely, and still have very many fewer queries to get your
data
than you'd have had otherwise...as in less than 10 for P<=1000.
Although with
most databases I suspect you don't have to chunk at all.

Any serious objections to that experiment?

As an experiment, no. As a production decision, yes. All this
complication is for a so-called "problem" that has not yet been
demonstrated to exist.
I agree with you in principle. However, if I can write SQL that
generates one SELECT that returns a result set with 1000 rows, as
opposed to having, say, a loop that generates 1000 SELECTs each
returning one row, I don't even consider that a performance
optimization, I call that writing good SQL. I'll do it without thinking
twice.

In practise I've seen N+1 SELECT situations, whether caused by poor SQL
and/or by deliberate loop constructs with fine-grained SQL, cause so
many performance problems that I simply make sure that kind of code
doesn't exist, without waiting for complaints from the field. And
besides, this is simple to do. To put it another way, this is simply
making a decent initial implementation choice as opposed to making a bad
one: performance optimization shouldn't be about replacing bad choices
with good ones, but rather about making good ones even better. IMHO.

Like I said, in principle I agree with you. I wouldn't launch into a
performance optimization that involved serious effort unless it was
called for.

AHS
 
L

Lew

Arved said:
I agree with you in principle.

I agree with you in practice.
However, if I can write SQL that generates one
SELECT that returns a result set with 1000 rows, as opposed to having, say, a
loop that generates 1000 SELECTs each returning one row, I don't even consider
that a performance optimization, I call that writing good SQL. I'll do it
without thinking twice.

Indeed. Somewhere around chunking IN clauses I believe we cross the line into
excessive effort prior to performance measurement. Similarly, architecting
for multiple app instances in order to hold an entire database in memory just
to avoid an unproven bottleneck without consideration of normal approaches
such as those you suggested bodes poorly for the project.
In practise I've seen N+1 SELECT situations, whether caused by poor SQL and/or
by deliberate loop constructs with fine-grained SQL, cause so many performance
problems that I simply make sure that kind of code doesn't exist, without
waiting for complaints from the field. And besides, this is simple to do. To
put it another way, this is simply making a decent initial implementation
choice as opposed to making a bad one: performance optimization shouldn't be
about replacing bad choices with good ones, but rather about making good ones
even better. IMHO.

You can elevate that from opinion to demonstrable fact.

The problem was that the OP's formulation doesn't fit the "decent initial
implementation choice" description. They should rather follow what you proposed.
Like I said, in principle I agree with you. I wouldn't launch into a
performance optimization that involved serious effort unless it was called for.

As I put forward elsethread, and you endorse here IIUC, up-front good
algorithmic and structural choices aren't so much "optimization" as "good
sense". The dark side is that bad architectural choices, such as abandonment
of modern DMBSes' optimizations at the expense of standard good practices and
algorithms whilst increasing maintenance and deployment effort just to handle
a potentially non-existent problem without actual evidence, must be rejected
early on.
 
A

Arved Sandstrom

I agree with you in practice.


Indeed. Somewhere around chunking IN clauses I believe we cross the line
into excessive effort prior to performance measurement. Similarly,
architecting for multiple app instances in order to hold an entire
database in memory just to avoid an unproven bottleneck without
consideration of normal approaches such as those you suggested bodes
poorly for the project.
[ SNIP ]

Yeah, I don't know why I threw in that bit about chunking. Definitely a
premature performance optimization. I'd just try the full IN because
it's a sensible thing to do, to try to get one SELECT, and see how it goes.

As far as the suggested in-memory proposal goes, that sent a frisson of
fear up my back. I use caching all the time - EclipseLink Level 1 and
Level 2 caching, named query caching for small "reference" tables, that
kind of thing - but I've never seen this kind of proposed "solution" as
a necessity in any situation. I think there are better ways to "avoid
1000 relational queries". :)

AHS
 
L

Lew

Arved said:
Yeah, I don't know why I threw in that bit about chunking. Definitely a
premature performance optimization. I'd just try the full IN because it's a
sensible thing to do, to try to get one SELECT, and see how it goes.

Long "IN" clauses risk exceeding statement-length limits and often have
performance issues at the engine level. (Yeah, I know.) It's a minor point,
because other means exist to combine things into a single query (EXISTS, table
replication).
 
A

Arne Vajhøj

I agree with you in principle. However, if I can write SQL that
generates one SELECT that returns a result set with 1000 rows, as
opposed to having, say, a loop that generates 1000 SELECTs each
returning one row, I don't even consider that a performance
optimization, I call that writing good SQL. I'll do it without thinking
twice.

I think that is an important point.

Some people think that "avoid premature optimization" mean
"avoid thinking about performance until you know it is
a problem".

That is not so. It means "avoid writing any ugly or hackish code
until you know the nice code has a problem".

Arne
 
E

eunever32

How big are the items in each collection?
How does the search process recognise an item?
If there are specific key terms, how big are they and how many terms are
there per item?

The answers to these questions can have a large effect on selecting the
optimum approach.

the proposed Java webservice would have 7m records. Each record would
have a
key: based on int (4byte) and short (2byte) = 6 bytes
and a value 4 * 50 bytes = 200 bytes
total: 206 bytes per records

7m * 206 = 1.5GB round figures.

The existing relational table is keyed on the integer + short.
 
E

eunever32

Unless you batch them. Can you not do something like:

Collection<LegacyResult> legacyResults = queryLegacySystem();
Iterator<LegacyResult> legacyResultsIterator = legacyResults.iterator();
Collection<CombinedResult> combinedResults = new ArrayList<CombinedResult>();
Connection conn = openDatabaseConnection();
// NB i'm not closing anything after use, but you would have to
PreparedStatement newSystemQuery = conn.prepareStatement("select * from sometable where item_id in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)");
while (legacyResultsIterator.hasNext()) {
        Map<String, LegacyResult> batch = new HashMap<LegacyResult>(10);
        for (int i = 1; i <= 10; ++i) {
                // NB i'm not dealing with the end of the iterator, but you would have to
                LegacyResult legacyResult = legacyResultsIterator.next();
                String id = legacyResult.getItemID();
                batch.put(id, legacyResult);
                newSystemQuery.setString(i, id);
        }
        ResultSet rs = newSystemQuery.executeQuery();
        while (rs.next()) {
                NewSystemResult newResult = makeNewResultFromResultRow(rs);
                LegacyResult legacyResult = batch.get(newResult.getID());
                CombinedResult combinedResult = new CombinedResult(legacyResult, newResult);
                combinedResults.add(combinedResult);
        }

}

Where the batch size might be considerably more than 10?




I think you could use EHCache or similar *instead* of writing your own
cache server.

How big are your objects? If they're a kilobyte each (largeish, for an
object), then seven million will take up seven gigs of memory; if they're
100 bytes (pretty tiny), then they'll take up 700 MB. That's before any
overhead. The former will require you to have a machine with a lot of
memory if you want to avoid thrashing; even the latter means taking a good
chunk of memory just for the cache.

tom

Hi Tom

That's useful. You're suggesting JDBC batch? eg
public static final int SINGLE_BATCH = 1;
public static final int SMALL_BATCH = 4;
public static final int MEDIUM_BATCH = 11;
public static final int LARGE_BATCH = 51;

Does it make more sense to repeatedly query small repeatable numbers
of parameters rather than an arbitrary number of parameters because of
the saving on not having to re-compile the prepared statement? I would
need to check the performance of this kind of query.

In relation to the cached: the size of the cache would be 1.5GB

Cheers
 
E

eunever32

I think that is an important point.

Some people think that "avoid premature optimization" mean
"avoid thinking about performance until you know it is
a problem".

That is not so. It means "avoid writing any ugly or hackish code
until you know the nice code has a problem".

Arne

Right Guys (Arne + Lew) I will test the performance of where in
(?, ?, ? .. (x1000)) and let you know the performance.

Currently the "legacy system" returns in seconds (say 10). That's what
we're aiming for. That the user response doesn't degrade.

Cheers.
 
M

Martin Gregorie

the proposed Java webservice would have 7m records. Each record would
have a
key: based on int (4byte) and short (2byte) = 6 bytes and a value 4 *
50 bytes = 200 bytes
total: 206 bytes per records

7m * 206 = 1.5GB round figures.
1.75 - 2 GB would be a safer assumption by the time you allow space for
overheads such as indexes, references and slack space in each record.
The existing relational table is keyed on the integer + short.
I assume this means that:

- every key occurs twice: once in the the legacy system's data store and
once in the Java system's data store

- no key value can occur on just one of the systems

- each system stores its data in a separate database table

Please correct these assumptions if any are wrong. And now for some extra
questions:

- are
(a) both rows with the same key retrieved by a single query
or
(b) some queries retrieve a row from just one of the databases?

If b can happen, what proportion of the queries are of type a and
type b?

- how difficult would it be to merge both tables into one of the
databases and scrap the other?
 
L

Lew

Arne said:
I think that is an important point.

Some people think that "avoid premature optimization" mean
"avoid thinking about performance until you know it is
a problem".

That is not so. It means "avoid writing any ugly or hackish code
until you know the nice code has a problem".

Right Guys (Arne + Lew) I will test the performance of where in
(?, ?, ? .. (x1000)) and let you know the performance.

Currently the "legacy system" returns in seconds (say 10). That's what
we're aiming for. That the user response doesn't degrade.

You say, "Right," then propose to go in the opposite direction. Interesting ...
 
A

Abu Yahya

Long "IN" clauses risk exceeding statement-length limits and often have
performance issues at the engine level. (Yeah, I know.) It's a minor
point, because other means exist to combine things into a single query
(EXISTS, table replication).
How would you use EXISTS to replace IN?
 
L

Lew

Abu said:
How would you use EXISTS to replace IN?

SELECT t.keyc, t.colX, t.colY, ..., t.colZ FROM some_table t
WHERE EXISTS
(SELECT v.keyc
FROM (VALUES (val0), (val1), ..., (val999)) AS v (keyc)
WHERE v.keyc = t.keyc)
;

Actually, that VALUES trick works for IN, too.

SELECT t.keyc, t.colX, t.colY, ..., t.colZ FROM some_table t
WHERE t.keyc IN
(SELECT v.keyc
FROM (VALUES (val0), (val1), ..., (val999)) AS v (keyc)
;

If you have WITH syntax you can alias the VALUES that way.

Then there're JOINs.

SELECT t.keyc, t.colX, t.colY, ..., t.colZ FROM some_table t
JOIN (VALUES (val0), (val1), ..., (val999)) AS v (keyc)
USING (keyc)
;

You can create a cursor or temporary table during a transaction with the rows
from the VALUES, too.

'Ware injection!
 
L

Lew

Lew said:
Actually, that VALUES trick works for IN, too.

SELECT t.keyc, t.colX, t.colY, ..., t.colZ FROM some_table t
  WHERE t.keyc IN
  (SELECT v.keyc
   FROM (VALUES (val0), (val1), ..., (val999)) AS v (keyc)
  ;

I'm too long away from SQL.

SELECT t.keyc, t.colX, t.colY, ..., t.colZ FROM some_table t
WHERE t.keyc IN
(VALUES (val0), (val1), ..., (val999)) AS v (keyc)
;

I haven't tried these so tweak commas and parens as needed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top