Java heap and big database queries

C

Chris Uppal

David said:
I have not seen _any_ argument for preferring MySQL over the other
databases I listed at http://database.profectus.com.au. Why risk a
legal battle when you have several alternatives that do not attempt to
restrict how you use the database?

I have my own opinions on the legal matters, but I think there is -- or at
least may be -- a technical argument for MySQL (in this case).

MySQL is, as far as I can tell, alone amongst the heavyweight databases in that
it can run without full ACID support (and may even defaults to doing so!).
Normally I would say that was a huge argument /against/ MySQL, but for the
particular application under discussion (Patricia's), I think that might be an
advantage. It ought to allow MySQL to run faster, perhaps significantly so.

Whether there is any real advantage, I have no idea. But I have assumed (maybe
wrongly) for a long time that the frequently heard justification: "I use MySQL
'cos it's lightning fast", is (if it's not pure myth) a consequence of running
without a the safety net that sensible DBMSs provide.

-- chris
 
P

Patricia Shanahan

Chris said:
I have my own opinions on the legal matters, but I think there is -- or at
least may be -- a technical argument for MySQL (in this case).

MySQL is, as far as I can tell, alone amongst the heavyweight databases in that
it can run without full ACID support (and may even defaults to doing so!).
Normally I would say that was a huge argument /against/ MySQL, but for the
particular application under discussion (Patricia's), I think that might be an
advantage. It ought to allow MySQL to run faster, perhaps significantly so.

Whether there is any real advantage, I have no idea. But I have assumed (maybe
wrongly) for a long time that the frequently heard justification: "I use MySQL
'cos it's lightning fast", is (if it's not pure myth) a consequence of running
without a the safety net that sensible DBMSs provide.

Avoiding ACID overhead was one of my thoughts in picking MySQL originally.

However, the non-streaming of query results was very discouraging. It's
not just the immediate problem. It is the implication that the way I
want to work is not something MySQL is tuned for.

My Ph.D. adviser has also come down in favor of PostgreSQL, so I would
need a really strong reason to go the other way.

I had a few days delay while I did some other work, but now I'm in the
process of reconstructing the database. I'll soon find out how well it
works in practice.

Patricia
 
M

Mike Schilling

Patricia Shanahan said:
Avoiding ACID overhead was one of my thoughts in picking MySQL originally.

However, the non-streaming of query results was very discouraging.

And the alternative being to transmit one row at a time isn't promising
either; for a large dataset, network overhead will dwarf any other
processing you might do.
It's not just the immediate problem. It is the implication that the way I
want to work is not something MySQL is tuned for.

Or that MySQL isn't tuned, period.
 
A

aloha.kakuikanu

I'm using Java to query a MySQL database through com.mysql.jdbc.Driver.

My Java program calculates a relatively small summary of the query
result, and I'm sure my Java structure would fit in the default heap memory.

However, I run out of heap memory while running with -Xmx700m.

The program runs perfectly on a million line subset of the database, but
fails on the full database, over 88 million lines.

My query is:

String getData = "select count(*), "
+ "sum(net_amt), "
+ "exists(select * from stores where stores.store_id =
transactions.store_id) "
+ "from transactions group by txn_id;";

...

Statement st = con.createStatement();
ResultSet result = st.executeQuery(getData);

There are about 10,160,053 distinct values of txn_id.

Any ideas for what I may be doing wrong? Suggestions for how to obtain
and scan a large query result without Java heap problems?

It can be argued that it is entirely possible to design your client
application is such a way that it never is required to fetch huge
result set. After all query results are presented to the end user, who
is capable digesting only rather limited amount of data. In your
example, why do you need to do partial aggregation and fetch
10,160,053 values, especially that you projected away txn_id column?
 
P

Patricia Shanahan

aloha.kakuikanu wrote:
....
It can be argued that it is entirely possible to design your client
application is such a way that it never is required to fetch huge
result set. After all query results are presented to the end user, who
is capable digesting only rather limited amount of data. In your
example, why do you need to do partial aggregation and fetch
10,160,053 values, especially that you projected away txn_id column?
....

I'm doing data mining, not transaction processing. I am the end user.

If I were more skilled at SQL than Java, I might do more of the
processing in SQL, and have smaller results sets. As it is, I've been
using SQL for a few days, and Java for years...

I know, for example, exactly how to generate histogram tables in Java,
in a form that can be read into a spreadsheet for display.

Patricia
 
V

Vadim Tropashko

I'm doing data mining, not transaction processing. I am the end user.

More power to SQL, then.
If I were more skilled at SQL than Java, I might do more of the
processing in SQL, and have smaller results sets. As it is, I've been
using SQL for a few days, and Java for years...

I know, for example, exactly how to generate histogram tables in Java,
in a form that can be read into a spreadsheet for display.

Here is sample book chapter with section on histograms http://
vadimtropashko.wordpress.com/files/2007/02/ch3.pdf
The whole SQL patterns book is here
http://www.rampant-books.com/book_2006_1_sql_coding_styles.htm
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Chris said:
MySQL is, as far as I can tell, alone amongst the heavyweight databases in that
it can run without full ACID support (and may even defaults to doing so!).
Normally I would say that was a huge argument /against/ MySQL, but for the
particular application under discussion (Patricia's), I think that might be an
advantage. It ought to allow MySQL to run faster, perhaps significantly so.

Whether there is any real advantage, I have no idea. But I have assumed (maybe
wrongly) for a long time that the frequently heard justification: "I use MySQL
'cos it's lightning fast", is (if it's not pure myth) a consequence of running
without a the safety net that sensible DBMSs provide.

I think that is correct.

Arne
 
A

aloha.kakuikanu

If I were more skilled at SQL than Java, I might do more of the
processing in SQL, and have smaller results sets. As it is, I've been
using SQL for a few days, and Java for years...

What people usually when they find a challenging query, is they post
it to a proper forum. You write a short description of a problem, a
sample data, and the expected output. Unlike typical java help
request, each SQL query is like a tiny math puzzle. This is why there
would be plenty of people who are eager to take on your challenge.

Unfortunately, the big 3 (and open source) communities has grown
apart, so that there is no single the most appropriate spot where to
post generic SQL problems. For Oracle the best place would be OTN, for
SQLServer public.microsoft.sqlserver.programming. I don't know what is
the best for MySQL.
 
M

Mark Thornton

Mike said:
And the alternative being to transmit one row at a time isn't promising
either; for a large dataset, network overhead will dwarf any other
processing you might do.

Transmitting rows a batch at a time (say 100) avoids that problem and
also has only a modest heap impact.

Mark Thornton
 
M

Mike Schilling

Mark Thornton said:
Transmitting rows a batch at a time (say 100) avoids that problem and also
has only a modest heap impact.

Yes. That MySQL doesn't batch rows is a sign of its primitiveness.
 
M

Mark Thornton

Mike said:
Yes. That MySQL doesn't batch rows is a sign of its primitiveness.

Indeed. However this behaviour is common to quite a lot of JDBC
driver/database combinations. Apparently it suits common enterprise
database activities. It is rather less suitable to
scientific/mathematical use of databases. In these cases you would have
to be barking mad to perform the computation in SQL, but the alternative
involves retrieving a very substantial amount of data. There are also
other issues with this kind of use, notably in locking / transactions.

SQL Server with a suitable driver will stream results. For smallish
datasets you can use SQL Server Express for free (assuming you are
running on Windows).

FireBird is another option worth considering.

Mark Thornton
 
P

Patricia Shanahan

Mark said:
Indeed. However this behaviour is common to quite a lot of JDBC
driver/database combinations. Apparently it suits common enterprise
database activities. It is rather less suitable to
scientific/mathematical use of databases. In these cases you would have
to be barking mad to perform the computation in SQL, but the alternative
involves retrieving a very substantial amount of data. There are also
other issues with this kind of use, notably in locking / transactions.

Ultimately, I will be using this data e.g. as training data in my Ph.D.
research project, so I do feel more comfortable knowing how to get the
data to my Java program.
SQL Server with a suitable driver will stream results. For smallish
datasets you can use SQL Server Express for free (assuming you are
running on Windows).

FireBird is another option worth considering.

Mark Thornton

SQL Server Express is limited to 4 GB total database size, and one of my
tables is about 10 GB, so it is not suitable for my particular situation.

Meanwhile, I've moved my database to PostgreSQL. I had some initial
performance problems with a large query, but I was able to cure it by
increasing the sort buffer size.

It streams data to my Java program, so I no longer run out of memory.

Thanks to everyone for all the advice.

Patricia
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

David said:
Arne, I want you to like it! If I can achieve this by adding some
brief, relevant information to the web page I would be pleased to do
so and would appreciate it if you would post the information here or
email it to david at segall dot net. I am reluctant to include any
detailed licensing information because each of the products listed
comes with several pages of legal restrictions that I do not wish to
maintain.

I added the licensing qualifications to the MySQL entry in my list
because I was concerned by a previous post from Chris Smith that
contained the information that he has posted in this thread. Because
MySQL is so popular and many people, encouraged by MySQL AB, consider
it to be freely distributable I chose to qualify the entry in my list
rather than simply delete it.

Maybe something like:

You should be aware that the MySQL ODBC, JDBC and ADO.NET drivers
is only freely available under GPL license (not LGPL license) with
a FLOSS exception. The GPL license states that code "linked" to
GPL code must be GPL code itself. If your code is GPL or other
open source, then you do not have any problems. If your code
is closed source, then you should seek legal advice before
using those drivers.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,480
Members
44,900
Latest member
Nell636132

Latest Threads

Top