advice on loading and searching large map in memory

Discussion in 'Java' started by eunever32@yahoo.co.uk, Feb 20, 2011.

  1. Guest

    Hi

    We have a requirement to query across two disparate systems. Both
    systems are read-only so no need for updates and once loaded and no
    need to check for updates. I would plan to reload the data afresh each
    day. Records on both systems map one-one and each has 7million
    records.

    The first system is legacy and I am reluctant to redevelop (C code).
    The second is standard Java/tomcat/SQL

    The non-relational query can return up to 1000 records.

    This could therefore result in 1000 queries to the relational system
    (just one table) before returning to the user.

    To avoid 1000 relational queries I was planning to "cache" the entire
    relational table in memory. I was planning to have a web service which
    would load the entire relational table into memory. The web service,
    running in a separate tomcat could then be queried 1000 times or maybe
    get a single request with 1000 values and return all results in one
    go. Having a separate tomcat process would help to isolate any memory
    issues eg JVM heap size.

    Can people recommend an approach?

    Because the entire set of records would always be in memory does that
    make using something like ehcache pointless?

    Issues I would anticipate:
    time to load 7m records each morning
    memory issues
    best Java collection to hold the map (HashMap?) The map would be
    (int, int) -> Object
    Any suggestions regarding specialized cache utility eg EhCache

    Thanks in advance.
     
    , Feb 20, 2011
    #1
    1. Advertising

  2. On 20/02/2011 02:43, allegedly wrote:
    > Hi
    >
    > We have a requirement to query across two disparate systems. Both
    > systems are read-only so no need for updates and once loaded and no
    > need to check for updates. I would plan to reload the data afresh each
    > day. Records on both systems map one-one and each has 7million
    > records.
    >
    > The first system is legacy and I am reluctant to redevelop (C code).
    > The second is standard Java/tomcat/SQL
    >
    > The non-relational query can return up to 1000 records.
    >
    > This could therefore result in 1000 queries to the relational system
    > (just one table) before returning to the user.
    >
    > To avoid 1000 relational queries I was planning to "cache" the entire
    > relational table in memory. I was planning to have a web service which
    > would load the entire relational table into memory. The web service,
    > running in a separate tomcat could then be queried 1000 times or maybe
    > get a single request with 1000 values and return all results in one
    > go. Having a separate tomcat process would help to isolate any memory
    > issues eg JVM heap size.
    >
    > Can people recommend an approach?
    >
    > Because the entire set of records would always be in memory does that
    > make using something like ehcache pointless?
    >
    > Issues I would anticipate:
    > time to load 7m records each morning
    > memory issues
    > best Java collection to hold the map (HashMap?) The map would be
    > (int, int) -> Object
    > Any suggestions regarding specialized cache utility eg EhCache



    > get a single request with 1000 values and return all results in one
    > go.


    If you did that, which to me at first sight seems like a good idea,
    would you really have to do the "load in memory" stuff? Couldn't you
    just let the underlying system do what it supposedly knows best to do?
     
    Daniele Futtorovic, Feb 20, 2011
    #2
    1. Advertising

  3. On Sat, 19 Feb 2011 17:43:36 -0800, wrote:

    > Hi
    >
    > We have a requirement to query across two disparate systems. Both
    > systems are read-only so no need for updates and once loaded and no need
    > to check for updates. I would plan to reload the data afresh each day.
    > Records on both systems map one-one and each has 7million records.
    >
    > The first system is legacy and I am reluctant to redevelop (C code). The
    > second is standard Java/tomcat/SQL
    >
    > The non-relational query can return up to 1000 records.
    >
    > This could therefore result in 1000 queries to the relational system
    > (just one table) before returning to the user.
    >
    > To avoid 1000 relational queries I was planning to "cache" the entire
    > relational table in memory. I was planning to have a web service which
    > would load the entire relational table into memory. The web service,
    > running in a separate tomcat could then be queried 1000 times or maybe
    > get a single request with 1000 values and return all results in one go.
    > Having a separate tomcat process would help to isolate any memory issues
    > eg JVM heap size.
    >
    > Can people recommend an approach?
    >

    How big are the items in each collection?
    How does the search process recognise an item?
    If there are specific key terms, how big are they and how many terms are
    there per item?

    The answers to these questions can have a large effect on selecting the
    optimum approach.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Feb 20, 2011
    #3
  4. Tom Anderson Guest

    On Sat, 19 Feb 2011, wrote:

    > We have a requirement to query across two disparate systems. Both
    > systems are read-only so no need for updates and once loaded and no need
    > to check for updates. I would plan to reload the data afresh each day.
    > Records on both systems map one-one and each has 7million records.
    >
    > The first system is legacy and I am reluctant to redevelop (C code).
    > The second is standard Java/tomcat/SQL
    >
    > The non-relational query can return up to 1000 records.
    >
    > This could therefore result in 1000 queries to the relational system
    > (just one table) before returning to the user.


    Unless you batch them. Can you not do something like:

    Collection<LegacyResult> legacyResults = queryLegacySystem();
    Iterator<LegacyResult> legacyResultsIterator = legacyResults.iterator();
    Collection<CombinedResult> combinedResults = new ArrayList<CombinedResult>();
    Connection conn = openDatabaseConnection();
    // NB i'm not closing anything after use, but you would have to
    PreparedStatement newSystemQuery = conn.prepareStatement("select * from sometable where item_id in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)");
    while (legacyResultsIterator.hasNext()) {
    Map<String, LegacyResult> batch = new HashMap<LegacyResult>(10);
    for (int i = 1; i <= 10; ++i) {
    // NB i'm not dealing with the end of the iterator, but you would have to
    LegacyResult legacyResult = legacyResultsIterator.next();
    String id = legacyResult.getItemID();
    batch.put(id, legacyResult);
    newSystemQuery.setString(i, id);
    }
    ResultSet rs = newSystemQuery.executeQuery();
    while (rs.next()) {
    NewSystemResult newResult = makeNewResultFromResultRow(rs);
    LegacyResult legacyResult = batch.get(newResult.getID());
    CombinedResult combinedResult = new CombinedResult(legacyResult, newResult);
    combinedResults.add(combinedResult);
    }
    }

    Where the batch size might be considerably more than 10?

    > To avoid 1000 relational queries I was planning to "cache" the entire
    > relational table in memory. I was planning to have a web service which
    > would load the entire relational table into memory. The web service,
    > running in a separate tomcat could then be queried 1000 times or maybe
    > get a single request with 1000 values and return all results in one go.
    > Having a separate tomcat process would help to isolate any memory issues
    > eg JVM heap size.
    >
    > Can people recommend an approach?
    >
    > Because the entire set of records would always be in memory does that
    > make using something like ehcache pointless?


    I think you could use EHCache or similar *instead* of writing your own
    cache server.

    How big are your objects? If they're a kilobyte each (largeish, for an
    object), then seven million will take up seven gigs of memory; if they're
    100 bytes (pretty tiny), then they'll take up 700 MB. That's before any
    overhead. The former will require you to have a machine with a lot of
    memory if you want to avoid thrashing; even the latter means taking a good
    chunk of memory just for the cache.

    tom

    --
    And the future is certain, give us time to work it out
     
    Tom Anderson, Feb 20, 2011
    #4
  5. Lew Guest

    On 02/20/2011 08:00 AM, Martin Gregorie wrote:
    > On Sat, 19 Feb 2011 17:43:36 -0800, wrote:
    >
    >> Hi
    >>
    >> We have a requirement to query across two disparate systems. Both
    >> systems are read-only so no need for updates and once loaded and no need
    >> to check for updates. I would plan to reload the data afresh each day.
    >> Records on both systems map one-one and each has 7million records.
    >>
    >> The first system is legacy and I am reluctant to redevelop (C code). The
    >> second is standard Java/tomcat/SQL
    >>
    >> The non-relational query can return up to 1000 records.
    >>
    >> This could therefore result in 1000 queries to the relational system
    >> (just one table) before returning to the user.
    >>
    >> To avoid 1000 relational queries I was planning to "cache" the entire
    >> relational table in memory. I was planning to have a web service which
    >> would load the entire relational table into memory. The web service,
    >> running in a separate tomcat could then be queried 1000 times or maybe
    >> get a single request with 1000 values and return all results in one go.
    >> Having a separate tomcat process would help to isolate any memory issues
    >> eg JVM heap size.
    >>
    >> Can people recommend an approach?
    >>

    > How big are the items in each collection?
    > How does the search process recognise an item?
    > If there are specific key terms, how big are they and how many terms are
    > there per item?
    >
    > The answers to these questions can have a large effect on selecting the
    > optimum approach.


    No one is asking the all-important questions!

    What is the current performance? Is the set of queries a bottleneck? How do
    they know? Under what load conditions? Couldn't they just submit a single
    query with the 1000 values instead of 1000 individual queries? (Yes, they
    could.)

    Asking to optimize something with a technique that likely will have severe
    performance problems itself, as the OP did, is a likely sign of sketchy
    analysis, at best. Let's at least get the evidence that there's a problem to
    solve here that so far has been glaringly absent.

    So, OP, tell us - how did you determine that there's a bottleneck in the area
    you suspect, and how did you reject all the other approaches to its resolution
    but the one you ask about?

    --
    Lew
    Honi soit qui mal y pense.
     
    Lew, Feb 20, 2011
    #5
  6. On 11-02-19 09:43 PM, wrote:
    > Hi
    >
    > We have a requirement to query across two disparate systems. Both
    > systems are read-only so no need for updates and once loaded and no
    > need to check for updates. I would plan to reload the data afresh each
    > day. Records on both systems map one-one and each has 7million
    > records.
    >
    > The first system is legacy and I am reluctant to redevelop (C code).
    > The second is standard Java/tomcat/SQL
    >
    > The non-relational query can return up to 1000 records.
    >
    > This could therefore result in 1000 queries to the relational system
    > (just one table) before returning to the user.
    >
    > To avoid 1000 relational queries I was planning to "cache" the entire
    > relational table in memory. I was planning to have a web service which
    > would load the entire relational table into memory. The web service,
    > running in a separate tomcat could then be queried 1000 times or maybe
    > get a single request with 1000 values and return all results in one
    > go. Having a separate tomcat process would help to isolate any memory
    > issues eg JVM heap size.

    [ SNIP ]

    Just to be clear on this, there are N records in each system, one legacy
    (non-relational) and one a RDBMS, and N ~ 7 million. There is also some
    kind of key for a 1:1 mapping between records in the two systems, but
    the data in each record is different. Correct?

    Let me ask a few questions:

    1. You've obviously got known criteria for each legacy query (that as
    you say might return up to 1000 records, let's call this number P). I
    take it that's it been asked, and ruled out, that there are any criteria
    at this point that would also identify the correct P matching records in
    the RDBMS table. So you need to get the "keys" first. Correct?

    2. Have you considered an IN operator in a WHERE condition? We're
    looking to avoid P separate queries to the RDBMS here, because that's a
    serious performance hit, and make it one SELECT instead. If there are
    practical limitations on how many values you can jam into the inlist,
    you can chunk it up rather coarsely, and still have very many fewer
    queries to get your data than you'd have had otherwise...as in less than
    10 for P<=1000. Although with most databases I suspect you don't have to
    chunk at all.

    Any serious objections to that experiment?

    AHS

    --
    We must recognize the chief characteristic of the modern era - a
    permanent state of what I call violent peace.
    -- James D. Watkins
     
    Arved Sandstrom, Feb 20, 2011
    #6
  7. Lew Guest

    On 02/20/2011 11:34 AM, Arved Sandstrom wrote:
    > On 11-02-19 09:43 PM, wrote:
    >> Hi
    >>
    >> We have a requirement to query across two disparate systems. Both
    >> systems are read-only so no need for updates and once loaded and no
    >> need to check for updates. I would plan to reload the data afresh each
    >> day. Records on both systems map one-one and each has 7million
    >> records.
    >>
    >> The first system is legacy and I am reluctant to redevelop (C code).
    >> The second is standard Java/tomcat/SQL
    >>
    >> The non-relational query can return up to 1000 records.
    >>
    >> This could therefore result in 1000 queries to the relational system
    >> (just one table) before returning to the user.
    >>
    >> To avoid 1000 relational queries I was planning to "cache" the entire
    >> relational table in memory. I was planning to have a web service which
    >> would load the entire relational table into memory. The web service,
    >> running in a separate tomcat could then be queried 1000 times or maybe
    >> get a single request with 1000 values and return all results in one
    >> go. Having a separate tomcat process would help to isolate any memory
    >> issues eg JVM heap size.

    > [ SNIP ]
    >
    > Just to be clear on this, there are N records in each system, one legacy
    > (non-relational) and one a RDBMS, and N ~ 7 million. There is also some kind
    > of key for a 1:1 mapping between records in the two systems, but the data in
    > each record is different. Correct?
    >
    > Let me ask a few questions:
    >
    > 1. You've obviously got known criteria for each legacy query (that as you say
    > might return up to 1000 records, let's call this number P). I take it that's
    > it been asked, and ruled out, that there are any criteria at this point that
    > would also identify the correct P matching records in the RDBMS table. So you
    > need to get the "keys" first. Correct?
    >
    > 2. Have you considered an IN operator in a WHERE condition? We're looking to


    EXISTS is another alternative.

    > avoid P separate queries to the RDBMS here, because that's a serious
    > performance hit, and make it one SELECT instead. If there are practical


    Again, no one has measured performance so we don't even know if it's a
    performance hit at all, much less a "serious" one.

    > limitations on how many values you can jam into the inlist, you can chunk it
    > up rather coarsely, and still have very many fewer queries to get your data
    > than you'd have had otherwise...as in less than 10 for P<=1000. Although with
    > most databases I suspect you don't have to chunk at all.
    >
    > Any serious objections to that experiment?


    As an experiment, no. As a production decision, yes. All this complication
    is for a so-called "problem" that has not yet been demonstrated to exist.

    --
    Lew
    Honi soit qui mal y pense.
     
    Lew, Feb 20, 2011
    #7
  8. On 11-02-20 01:06 PM, Lew wrote:
    > On 02/20/2011 11:34 AM, Arved Sandstrom wrote:
    >> On 11-02-19 09:43 PM, wrote:
    >>> Hi
    >>>
    >>> We have a requirement to query across two disparate systems. Both
    >>> systems are read-only so no need for updates and once loaded and no
    >>> need to check for updates. I would plan to reload the data afresh each
    >>> day. Records on both systems map one-one and each has 7million
    >>> records.
    >>>
    >>> The first system is legacy and I am reluctant to redevelop (C code).
    >>> The second is standard Java/tomcat/SQL
    >>>
    >>> The non-relational query can return up to 1000 records.
    >>>
    >>> This could therefore result in 1000 queries to the relational system
    >>> (just one table) before returning to the user.
    >>>
    >>> To avoid 1000 relational queries I was planning to "cache" the entire
    >>> relational table in memory. I was planning to have a web service which
    >>> would load the entire relational table into memory. The web service,
    >>> running in a separate tomcat could then be queried 1000 times or maybe
    >>> get a single request with 1000 values and return all results in one
    >>> go. Having a separate tomcat process would help to isolate any memory
    >>> issues eg JVM heap size.

    >> [ SNIP ]
    >>
    >> Just to be clear on this, there are N records in each system, one legacy
    >> (non-relational) and one a RDBMS, and N ~ 7 million. There is also
    >> some kind
    >> of key for a 1:1 mapping between records in the two systems, but the
    >> data in
    >> each record is different. Correct?
    >>
    >> Let me ask a few questions:
    >>
    >> 1. You've obviously got known criteria for each legacy query (that as
    >> you say
    >> might return up to 1000 records, let's call this number P). I take it
    >> that's
    >> it been asked, and ruled out, that there are any criteria at this
    >> point that
    >> would also identify the correct P matching records in the RDBMS table.
    >> So you
    >> need to get the "keys" first. Correct?
    >>
    >> 2. Have you considered an IN operator in a WHERE condition? We're
    >> looking to

    >
    > EXISTS is another alternative.
    >
    >> avoid P separate queries to the RDBMS here, because that's a serious
    >> performance hit, and make it one SELECT instead. If there are practical

    >
    > Again, no one has measured performance so we don't even know if it's a
    > performance hit at all, much less a "serious" one.
    >
    >> limitations on how many values you can jam into the inlist, you can
    >> chunk it
    >> up rather coarsely, and still have very many fewer queries to get your
    >> data
    >> than you'd have had otherwise...as in less than 10 for P<=1000.
    >> Although with
    >> most databases I suspect you don't have to chunk at all.
    >>
    >> Any serious objections to that experiment?

    >
    > As an experiment, no. As a production decision, yes. All this
    > complication is for a so-called "problem" that has not yet been
    > demonstrated to exist.
    >

    I agree with you in principle. However, if I can write SQL that
    generates one SELECT that returns a result set with 1000 rows, as
    opposed to having, say, a loop that generates 1000 SELECTs each
    returning one row, I don't even consider that a performance
    optimization, I call that writing good SQL. I'll do it without thinking
    twice.

    In practise I've seen N+1 SELECT situations, whether caused by poor SQL
    and/or by deliberate loop constructs with fine-grained SQL, cause so
    many performance problems that I simply make sure that kind of code
    doesn't exist, without waiting for complaints from the field. And
    besides, this is simple to do. To put it another way, this is simply
    making a decent initial implementation choice as opposed to making a bad
    one: performance optimization shouldn't be about replacing bad choices
    with good ones, but rather about making good ones even better. IMHO.

    Like I said, in principle I agree with you. I wouldn't launch into a
    performance optimization that involved serious effort unless it was
    called for.

    AHS
    --
    We must recognize the chief characteristic of the modern era - a
    permanent state of what I call violent peace.
    -- James D. Watkins
     
    Arved Sandstrom, Feb 20, 2011
    #8
  9. Lew Guest

    Arved Sandstrom wrote:
    > I agree with you in principle.


    I agree with you in practice.

    > However, if I can write SQL that generates one
    > SELECT that returns a result set with 1000 rows, as opposed to having, say, a
    > loop that generates 1000 SELECTs each returning one row, I don't even consider
    > that a performance optimization, I call that writing good SQL. I'll do it
    > without thinking twice.


    Indeed. Somewhere around chunking IN clauses I believe we cross the line into
    excessive effort prior to performance measurement. Similarly, architecting
    for multiple app instances in order to hold an entire database in memory just
    to avoid an unproven bottleneck without consideration of normal approaches
    such as those you suggested bodes poorly for the project.

    > In practise I've seen N+1 SELECT situations, whether caused by poor SQL and/or
    > by deliberate loop constructs with fine-grained SQL, cause so many performance
    > problems that I simply make sure that kind of code doesn't exist, without
    > waiting for complaints from the field. And besides, this is simple to do. To
    > put it another way, this is simply making a decent initial implementation
    > choice as opposed to making a bad one: performance optimization shouldn't be
    > about replacing bad choices with good ones, but rather about making good ones
    > even better. IMHO.


    You can elevate that from opinion to demonstrable fact.

    The problem was that the OP's formulation doesn't fit the "decent initial
    implementation choice" description. They should rather follow what you proposed.

    > Like I said, in principle I agree with you. I wouldn't launch into a
    > performance optimization that involved serious effort unless it was called for.


    As I put forward elsethread, and you endorse here IIUC, up-front good
    algorithmic and structural choices aren't so much "optimization" as "good
    sense". The dark side is that bad architectural choices, such as abandonment
    of modern DMBSes' optimizations at the expense of standard good practices and
    algorithms whilst increasing maintenance and deployment effort just to handle
    a potentially non-existent problem without actual evidence, must be rejected
    early on.

    --
    Lew
    Honi soit qui mal y pense.
     
    Lew, Feb 20, 2011
    #9
  10. On 11-02-20 02:36 PM, Lew wrote:
    > Arved Sandstrom wrote:
    >> I agree with you in principle.

    >
    > I agree with you in practice.
    >
    >> However, if I can write SQL that generates one
    >> SELECT that returns a result set with 1000 rows, as opposed to having,
    >> say, a
    >> loop that generates 1000 SELECTs each returning one row, I don't even
    >> consider
    >> that a performance optimization, I call that writing good SQL. I'll do it
    >> without thinking twice.

    >
    > Indeed. Somewhere around chunking IN clauses I believe we cross the line
    > into excessive effort prior to performance measurement. Similarly,
    > architecting for multiple app instances in order to hold an entire
    > database in memory just to avoid an unproven bottleneck without
    > consideration of normal approaches such as those you suggested bodes
    > poorly for the project.

    [ SNIP ]

    Yeah, I don't know why I threw in that bit about chunking. Definitely a
    premature performance optimization. I'd just try the full IN because
    it's a sensible thing to do, to try to get one SELECT, and see how it goes.

    As far as the suggested in-memory proposal goes, that sent a frisson of
    fear up my back. I use caching all the time - EclipseLink Level 1 and
    Level 2 caching, named query caching for small "reference" tables, that
    kind of thing - but I've never seen this kind of proposed "solution" as
    a necessity in any situation. I think there are better ways to "avoid
    1000 relational queries". :)

    AHS

    --
    We must recognize the chief characteristic of the modern era - a
    permanent state of what I call violent peace.
    -- James D. Watkins
     
    Arved Sandstrom, Feb 20, 2011
    #10
  11. Lew Guest

    Arved Sandstrom wrote:
    > Yeah, I don't know why I threw in that bit about chunking. Definitely a
    > premature performance optimization. I'd just try the full IN because it's a
    > sensible thing to do, to try to get one SELECT, and see how it goes.


    Long "IN" clauses risk exceeding statement-length limits and often have
    performance issues at the engine level. (Yeah, I know.) It's a minor point,
    because other means exist to combine things into a single query (EXISTS, table
    replication).

    --
    Lew
    Honi soit qui mal y pense.
     
    Lew, Feb 20, 2011
    #11
  12. On 20-02-2011 12:37, Arved Sandstrom wrote:
    > On 11-02-20 01:06 PM, Lew wrote:
    >> On 02/20/2011 11:34 AM, Arved Sandstrom wrote:
    >>> limitations on how many values you can jam into the inlist, you can
    >>> chunk it
    >>> up rather coarsely, and still have very many fewer queries to get your
    >>> data
    >>> than you'd have had otherwise...as in less than 10 for P<=1000.
    >>> Although with
    >>> most databases I suspect you don't have to chunk at all.
    >>>
    >>> Any serious objections to that experiment?

    >>
    >> As an experiment, no. As a production decision, yes. All this
    >> complication is for a so-called "problem" that has not yet been
    >> demonstrated to exist.
    >>

    > I agree with you in principle. However, if I can write SQL that
    > generates one SELECT that returns a result set with 1000 rows, as
    > opposed to having, say, a loop that generates 1000 SELECTs each
    > returning one row, I don't even consider that a performance
    > optimization, I call that writing good SQL. I'll do it without thinking
    > twice.


    I think that is an important point.

    Some people think that "avoid premature optimization" mean
    "avoid thinking about performance until you know it is
    a problem".

    That is not so. It means "avoid writing any ugly or hackish code
    until you know the nice code has a problem".

    Arne
     
    Arne Vajhøj, Feb 20, 2011
    #12
  13. Guest

    On Feb 20, 1:00 pm, Martin Gregorie <>
    wrote:
    > On Sat, 19 Feb 2011 17:43:36 -0800, wrote:
    > > Hi

    >
    > > We have a requirement to query across two disparate systems. Both
    > > systems are read-only so no need for updates and once loaded and no need
    > > to check for updates. I would plan to reload the data afresh each day.
    > > Records on both systems map one-one and each has 7million records.

    >
    > > The first system is legacy and I am reluctant to redevelop (C code). The
    > > second is standard Java/tomcat/SQL

    >
    > > The non-relational query can return up to 1000 records.

    >
    > > This could therefore result in 1000 queries to the relational system
    > > (just one table) before returning to the user.

    >
    > > To avoid 1000 relational queries I was planning to "cache" the entire
    > > relational table in memory. I was planning to have a web service which
    > > would load the entire relational table into memory. The web service,
    > > running in a separate tomcat could then be queried 1000 times or maybe
    > > get a single request with 1000 values and return all results in one go.
    > > Having a separate tomcat process would help to isolate any memory issues
    > > eg JVM heap size.

    >
    > > Can people recommend an approach?

    >
    > How big are the items in each collection?
    > How does the search process recognise an item?
    > If there are specific key terms, how big are they and how many terms are
    > there per item?
    >
    > The answers to these questions can have a large effect on selecting the
    > optimum approach.
    >
    > --
    > martin@   | Martin Gregorie
    > gregorie. | Essex, UK
    > org       |


    the proposed Java webservice would have 7m records. Each record would
    have a
    key: based on int (4byte) and short (2byte) = 6 bytes
    and a value 4 * 50 bytes = 200 bytes
    total: 206 bytes per records

    7m * 206 = 1.5GB round figures.

    The existing relational table is keyed on the integer + short.
     
    , Feb 20, 2011
    #13
  14. Guest

    On Feb 20, 1:25 pm, Tom Anderson <> wrote:
    > On Sat, 19 Feb 2011, wrote:
    > > We have a requirement to query across two disparate systems. Both
    > > systems are read-only so no need for updates and once loaded and no need
    > > to check for updates. I would plan to reload the data afresh each day.
    > > Records on both systems map one-one and each has 7million records.

    >
    > > The first system is legacy and I am reluctant to redevelop (C code).
    > > The second is standard Java/tomcat/SQL

    >
    > > The non-relational query can return up to 1000 records.

    >
    > > This could therefore result in 1000 queries to the relational system
    > > (just one table) before returning to the user.

    >
    > Unless you batch them. Can you not do something like:
    >
    > Collection<LegacyResult> legacyResults = queryLegacySystem();
    > Iterator<LegacyResult> legacyResultsIterator = legacyResults.iterator();
    > Collection<CombinedResult> combinedResults = new ArrayList<CombinedResult>();
    > Connection conn = openDatabaseConnection();
    > // NB i'm not closing anything after use, but you would have to
    > PreparedStatement newSystemQuery = conn.prepareStatement("select * from sometable where item_id in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)");
    > while (legacyResultsIterator.hasNext()) {
    >         Map<String, LegacyResult> batch = new HashMap<LegacyResult>(10);
    >         for (int i = 1; i <= 10; ++i) {
    >                 // NB i'm not dealing with the end of the iterator, but you would have to
    >                 LegacyResult legacyResult = legacyResultsIterator.next();
    >                 String id = legacyResult.getItemID();
    >                 batch.put(id, legacyResult);
    >                 newSystemQuery.setString(i, id);
    >         }
    >         ResultSet rs = newSystemQuery.executeQuery();
    >         while (rs.next()) {
    >                 NewSystemResult newResult = makeNewResultFromResultRow(rs);
    >                 LegacyResult legacyResult = batch.get(newResult.getID());
    >                 CombinedResult combinedResult = new CombinedResult(legacyResult, newResult);
    >                 combinedResults.add(combinedResult);
    >         }
    >
    > }
    >
    > Where the batch size might be considerably more than 10?
    >
    > > To avoid 1000 relational queries I was planning to "cache" the entire
    > > relational table in memory. I was planning to have a web service which
    > > would load the entire relational table into memory. The web service,
    > > running in a separate tomcat could then be queried 1000 times or maybe
    > > get a single request with 1000 values and return all results in one go.
    > > Having a separate tomcat process would help to isolate any memory issues
    > > eg JVM heap size.

    >
    > > Can people recommend an approach?

    >
    > > Because the entire set of records would always be in memory does that
    > > make using something like ehcache pointless?

    >
    > I think you could use EHCache or similar *instead* of writing your own
    > cache server.
    >
    > How big are your objects? If they're a kilobyte each (largeish, for an
    > object), then seven million will take up seven gigs of memory; if they're
    > 100 bytes (pretty tiny), then they'll take up 700 MB. That's before any
    > overhead. The former will require you to have a machine with a lot of
    > memory if you want to avoid thrashing; even the latter means taking a good
    > chunk of memory just for the cache.
    >
    > tom
    >
    > --
    > And the future is certain, give us time to work it out


    Hi Tom

    That's useful. You're suggesting JDBC batch? eg
    public static final int SINGLE_BATCH = 1;
    public static final int SMALL_BATCH = 4;
    public static final int MEDIUM_BATCH = 11;
    public static final int LARGE_BATCH = 51;

    Does it make more sense to repeatedly query small repeatable numbers
    of parameters rather than an arbitrary number of parameters because of
    the saving on not having to re-compile the prepared statement? I would
    need to check the performance of this kind of query.

    In relation to the cached: the size of the cache would be 1.5GB

    Cheers
     
    , Feb 20, 2011
    #14
  15. Guest

    On Feb 20, 8:03 pm, Arne Vajhøj <> wrote:
    > On 20-02-2011 12:37, Arved Sandstrom wrote:
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > > On 11-02-20 01:06 PM, Lew wrote:
    > >> On 02/20/2011 11:34 AM, Arved Sandstrom wrote:
    > >>> limitations on how many values you can jam into the inlist, you can
    > >>> chunk it
    > >>> up rather coarsely, and still have very many fewer queries to get your
    > >>> data
    > >>> than you'd have had otherwise...as in less than 10 for P<=1000.
    > >>> Although with
    > >>> most databases I suspect you don't have to chunk at all.

    >
    > >>> Any serious objections to that experiment?

    >
    > >> As an experiment, no. As a production decision, yes. All this
    > >> complication is for a so-called "problem" that has not yet been
    > >> demonstrated to exist.

    >
    > > I agree with you in principle. However, if I can write SQL that
    > > generates one SELECT that returns a result set with 1000 rows, as
    > > opposed to having, say, a loop that generates 1000 SELECTs each
    > > returning one row, I don't even consider that a performance
    > > optimization, I call that writing good SQL. I'll do it without thinking
    > > twice.

    >
    > I think that is an important point.
    >
    > Some people think that "avoid premature optimization" mean
    > "avoid thinking about performance until you know it is
    > a problem".
    >
    > That is not so. It means "avoid writing any ugly or hackish code
    > until you know the nice code has a problem".
    >
    > Arne


    Right Guys (Arne + Lew) I will test the performance of where in
    (?, ?, ? .. (x1000)) and let you know the performance.

    Currently the "legacy system" returns in seconds (say 10). That's what
    we're aiming for. That the user response doesn't degrade.

    Cheers.
     
    , Feb 20, 2011
    #15
  16. On Sun, 20 Feb 2011 15:19:26 -0800, wrote:

    > On Feb 20, 1:00 pm, Martin Gregorie <>
    > wrote:
    >> On Sat, 19 Feb 2011 17:43:36 -0800, wrote:
    >> > Hi

    >>
    >> > We have a requirement to query across two disparate systems. Both
    >> > systems are read-only so no need for updates and once loaded and no
    >> > need to check for updates. I would plan to reload the data afresh
    >> > each day. Records on both systems map one-one and each has 7million
    >> > records.

    >>
    >> > The first system is legacy and I am reluctant to redevelop (C code).
    >> > The second is standard Java/tomcat/SQL

    >>
    >> > The non-relational query can return up to 1000 records.

    >>
    >> > This could therefore result in 1000 queries to the relational system
    >> > (just one table) before returning to the user.

    >>
    >> > To avoid 1000 relational queries I was planning to "cache" the entire
    >> > relational table in memory. I was planning to have a web service
    >> > which would load the entire relational table into memory. The web
    >> > service, running in a separate tomcat could then be queried 1000
    >> > times or maybe get a single request with 1000 values and return all
    >> > results in one go. Having a separate tomcat process would help to
    >> > isolate any memory issues eg JVM heap size.

    >>
    >> > Can people recommend an approach?

    >>
    >> How big are the items in each collection? How does the search process
    >> recognise an item? If there are specific key terms, how big are they
    >> and how many terms are there per item?
    >>
    >> The answers to these questions can have a large effect on selecting the
    >> optimum approach.
    >>
    >> --
    >> martin@   | Martin Gregorie
    >> gregorie. | Essex, UK
    >> org       |

    >
    > the proposed Java webservice would have 7m records. Each record would
    > have a
    > key: based on int (4byte) and short (2byte) = 6 bytes and a value 4 *
    > 50 bytes = 200 bytes
    > total: 206 bytes per records
    >
    > 7m * 206 = 1.5GB round figures.
    >

    1.75 - 2 GB would be a safer assumption by the time you allow space for
    overheads such as indexes, references and slack space in each record.

    > The existing relational table is keyed on the integer + short.
    >

    I assume this means that:

    - every key occurs twice: once in the the legacy system's data store and
    once in the Java system's data store

    - no key value can occur on just one of the systems

    - each system stores its data in a separate database table

    Please correct these assumptions if any are wrong. And now for some extra
    questions:

    - are
    (a) both rows with the same key retrieved by a single query
    or
    (b) some queries retrieve a row from just one of the databases?

    If b can happen, what proportion of the queries are of type a and
    type b?

    - how difficult would it be to merge both tables into one of the
    databases and scrap the other?


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Feb 21, 2011
    #16
  17. Lew Guest

    Arne Vajhøj wrote:
    > I think that is an important point.
    >
    > Some people think that "avoid premature optimization" mean
    > "avoid thinking about performance until you know it is
    > a problem".
    >
    > That is not so. It means "avoid writing any ugly or hackish code
    > until you know the nice code has a problem".


    wrote:
    > Right Guys (Arne + Lew) I will test the performance of where in
    > (?, ?, ? .. (x1000)) and let you know the performance.
    >
    > Currently the "legacy system" returns in seconds (say 10). That's what
    > we're aiming for. That the user response doesn't degrade.


    You say, "Right," then propose to go in the opposite direction. Interesting ...

    --
    Lew
    Honi soit qui mal y pense.
     
    Lew, Feb 21, 2011
    #17
  18. Abu Yahya Guest

    On 2/21/2011 12:58 AM, Lew wrote:
    > Arved Sandstrom wrote:
    >> Yeah, I don't know why I threw in that bit about chunking. Definitely a
    >> premature performance optimization. I'd just try the full IN because
    >> it's a
    >> sensible thing to do, to try to get one SELECT, and see how it goes.

    >
    > Long "IN" clauses risk exceeding statement-length limits and often have
    > performance issues at the engine level. (Yeah, I know.) It's a minor
    > point, because other means exist to combine things into a single query
    > (EXISTS, table replication).
    >

    How would you use EXISTS to replace IN?
     
    Abu Yahya, Feb 21, 2011
    #18
  19. Lew Guest

    Abu Yahya wrote:
    > How would you use EXISTS to replace IN?


    SELECT t.keyc, t.colX, t.colY, ..., t.colZ FROM some_table t
    WHERE EXISTS
    (SELECT v.keyc
    FROM (VALUES (val0), (val1), ..., (val999)) AS v (keyc)
    WHERE v.keyc = t.keyc)
    ;

    Actually, that VALUES trick works for IN, too.

    SELECT t.keyc, t.colX, t.colY, ..., t.colZ FROM some_table t
    WHERE t.keyc IN
    (SELECT v.keyc
    FROM (VALUES (val0), (val1), ..., (val999)) AS v (keyc)
    ;

    If you have WITH syntax you can alias the VALUES that way.

    Then there're JOINs.

    SELECT t.keyc, t.colX, t.colY, ..., t.colZ FROM some_table t
    JOIN (VALUES (val0), (val1), ..., (val999)) AS v (keyc)
    USING (keyc)
    ;

    You can create a cursor or temporary table during a transaction with the rows
    from the VALUES, too.

    'Ware injection!

    --
    Lew
    Honi soit qui mal y pense.
     
    Lew, Feb 21, 2011
    #19
  20. Lew Guest

    Lew wrote:
    > Actually, that VALUES trick works for IN, too.
    >
    > SELECT t.keyc, t.colX, t.colY, ..., t.colZ FROM some_table t
    >   WHERE t.keyc IN
    >   (SELECT v.keyc
    >    FROM (VALUES (val0), (val1), ..., (val999)) AS v (keyc)
    >   ;
    >


    I'm too long away from SQL.

    SELECT t.keyc, t.colX, t.colY, ..., t.colZ FROM some_table t
    WHERE t.keyc IN
    (VALUES (val0), (val1), ..., (val999)) AS v (keyc)
    ;

    I haven't tried these so tweak commas and parens as needed.

    --
    Lew
     
    Lew, Feb 21, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Erik Arner
    Replies:
    0
    Views:
    1,320
    Erik Arner
    Nov 2, 2004
  2. Philipp K. Janert, Ph.D.

    Slow loading of large in-memory tables

    Philipp K. Janert, Ph.D., Sep 7, 2004, in forum: Python
    Replies:
    1
    Views:
    418
    Thorsten Kampe
    Sep 15, 2004
  3. dads
    Replies:
    3
    Views:
    278
    Richard Tobin
    Jan 16, 2010
  4. dads
    Replies:
    7
    Views:
    285
    Stefan Behnel
    Jan 19, 2010
  5. stumblng.tumblr
    Replies:
    1
    Views:
    215
    stumblng.tumblr
    Feb 4, 2008
Loading...

Share This Page