Help me!! Why java is so popular

  • Thread starter amalikarunanayake
  • Start date
L

Lew

Ah. Yes, of course, that makes sense. Either way, though, time
needed would be proportional to the number of reachable objects rather
than to the number of discarded objects, which was the idea behind
my speculation that maybe GC might not be so bad.
Correct.

("In the young generation"? Is that making a distinction, as you
say later is done, between objects that stay around a while versus
short-term ones?)
<http://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf>

It can do that even when the object being created is immutable? as
would be the case with readLine in BufferedReader, which seems to be
the "isn't this a problem?" in the OP's example?

Especially when the object is immutable.

Huh. I've heard some of this before but maybe not all. Maybe I need
to finally read Bloch's "Effective Java" rather than just housing it
on a bookshelf ....

Yes!

I reread it often.

- Lew
 
L

Lew

Mark said:
Java's compulsory use of Unicode means that any third party tools I use
will also work with data like mine. In systems where use of unicode is
optional (or non-existent) it is common to find tools that would be nice
if only they mad provision for characters beyond ASCII. It also
simplifies internationalization even where the original developer didn't
pay too much attention to these requirements.

Character set conversion can only be avoided if you can be sure of
always working in a single specified character set. Once you have to do
a conversion of some sort, conversion to/from Unicode isn't much (if
any) more expensive than conversion between simple single byte character
sets. Even in the US I expect you are likely to see data in CP-437,
CP-850, CP-1252, and ISO-8859-1 as well as UTF-8, and UTF-16.

Finally just how much difference does the extra overhead of Unicode
actually make? In most substantial applications the overhead will be
negligible.

Java is what it is. It has its reasons to be that way. Not all decisions are
optimal from all points of view. Some programmers don't need everything a
language or an API offers. The language or API still has to offer it.

This is no different from any language. Considering the overall population
that uses it, a language will always make compromises to satisfy the greatest
portion of that population.

If you don't like Java then there are alternatives. For what it does, and
considering Mark's points about the general usefulness of requiring Unicde and
its nearly complete lack of impact on code efficiency, Java is extremely well
suited.

If you wrote your own language without Unicode support, or with the
"Balkanized" version of it, you'd probably find pretty quickly that the "all
Unicode" approach offers significant advantages.

In the meantime, when we use Java we are stuck with all its warts as well as
its advantages. Focusing on corner issues like "it's always Unicode" or "there
are no closures" does not diminish the actual, real-world usefulness of the
language. (And some of these ideas, if sufficiently universally beneficial,
wind up in the language eventually anyway.)

- Lew
 
O

Oliver Wong

raddog58c said:
The data is stored in UNICODE whether you require it or not.

Well, Unicode is not a storage encoding system, or anything like that.
Unicode is primarily a mapping from characters (in the linguistic conceptual
sense, not in the C/C++ data type sense) to numbers. And you can't directly
store numbers in computers. You can store bitstreams, and thus you need an
extra step to encode from numbers to bitstreams. There are many such
encodings: ASCII, UTF-8, UTF-16, etc. some of them being lossy (e.g. ASCII).
I'm not
writing multinational code at this juncture. In 25+ years of
programming the number of times I've needed multinational character
sets can be counted on one had with fingers to spare.

Well, I don't know what kind of software you write, so I can't comment
much on that. But consider how many people have requested that the
developers of WinAmp (a once popular mp3 player) to support unicode
characters, so that I WinAmp could probably display the names of my English,
French, Russian, Japanese and Korean songs. They refused to do so, stating
that 90% of the Internet is English (a figure I'm sure they just made up).
There are several problems with this argument.

First of all, internet usage in Asia is huge. Gold farming (which
essentially comes down to playing video games online for pay) is a 1 billion
dollar business in Korea alone
(http://arstechnica.com/news.ars/post/20061227-8503.html), and playing video
games online is a tiny segment of the internet usage pie chart, compare to
web browsing, e-mail or file sharing, for example. According to
http://www.internetworldstats.com/stats2.htm, North America accounts for
only 20% of the internet usage, and while Internet usage is growing at a
rate of 100+% (i.e. doubling) over 7 years, Internet usage in the rest of
the world is growing at a rate of 200+% (i.e. tripling) over 7 years. This
last diagram really says it all: http://www.internetworldstats.com/stats.htm

Second of all, just because one is an English-only speaker doesn't mean
one wouldn't benefit from the ability to display characters outside of ASCII
but within Unicode. Another poster presented the example of being able to
display mathematical symbols. I'll present an additional example of my mp3s
again.

One of the ID3 tags for my mp3s contains what I believe to be russian
characters. I'm not sure, because I don't actually speak Russian. The artist
name can be viewed at http://en.wikipedia.org/wiki/TËЯRA and it's
very easy for an English speaker to recognize: It's a T, an E with two dots
on top, a backwards R, a forwards R, and an A. And the prounciation "Terra"
comes intuitively. But try to load an ID3 tag with this text via an
ASCII-only mp3 player, and you'll only see gibberish.

See, I don't even speak Russian, and yet I benefit from my software
being able to display Russian characters. That's why Unicode is more than
just "supporting other countries' languages". It's about being able to
represent text that you would normally find all around you in real life on
your computer.
You might find it archaic, but I find it wasteful. It's a waste
converting into and out of a format you never use.

What formats do you think one is converting to and from? There are bits
on the harddrive or RAM, and you need to somehow semantically treat these
bits as if they represented text. From what I understand, in C, you actually
manipulate these bits almost directly, and so an algorithm (e.g. testing
whether a character is numeric) designed to work with ASCII will not work
with EBCDIC and vice versa. In Java, things are a bit more high level: You
*don't* work directly with bits. Instead, you work with characters.
Theoretically, how these characters are represented in the JVM shouldn't
matter to you (in practice, due to backwards compabitility reasons, it has
"leaked out" that the internal representation is UTF-16-like). They might
internally be stored as UTF-16, UTF-8, or some crazy undocumented internal
format. It doesn't matter, because you shouldn't be manipulating the bits
that represent those characters, you should be dealing with the characters
directly. Any algorithm (e.g. testing whether a character is numeric) will
work regardless of the encoding, because the actual encoding is (supposed to
be) abstracted away.

Now if you have a String of characters in memory, and you want to store
it on disk somehow, there are many encodings to do this, just like if you
wanted to store a binary tree on disk somehow, there are many encodings to
do this. *This* is where any "converting" might occur, though the term
"converting" is misleading: "encoding" would be a better term. You can
encode the text as ASCII, UTF-8, or some other format. And if you want to
read the bitstream from disk and convert it back to text, a decoding stage
occurs.

In C, there's no similar stage, because once again, there's no
abstracting the encoding away from the text. If you want to replicate C's
behaviour in Java, rather than reading in text, read in bytes. Then, you can
manipulate the bytes in anyway you like, and if you think these bytes
represent text, you'll have to guess at the encoding (ASCII? EBCDIC? UTF-8?)
just like you would with C. And just like in C, a "isNumeric()" algorithm
written with the assumption of ASCII will fail for other encodings. And just
like in C, no encoding or decoding stage secretly occurs beneath the covers.
Why don't you convert your data into Russian characterset. Since
you're never communicating in Russian, when you need English, swap
back. What's the big deal?

Dataloss. There isn't a one-to-one correspondance between Russian
characters and English characters.
That's what I'm saying. It's conversion to a format that I'm not
personally using. Some people need it; some don't; yet we all pay for
it.

Let's say I never use the pipe character: |, and I'd be perfectly happy
if whenever someone sent me text containing the pipe character, it would
instead get converted to some lossy gibberish. Why don't we simply invent a
new encoding scheme, more efficient than ASCII, so that I wouldn't have to
pay for this character that I don't need? Well, we certainly *could* do
that, but it'd be a lot of work to support such a small proportion of all
computer users.

Similarly, people who speak only English form such a small proportion of
all computer users. It's been a lot of work trying to support these people.
Why should we all pay for it?

- Oliver
 
C

Chris Uppal

raddog58c said:
Hey Chis, where can I find doc on the UOPS specification? I was
perusing the pentium data sheets (downloadable off the Intel site) and
didn't see this discussed -- what I saw were essentially the same
opcodes as they ever were.....

I don't know of much explicit documentation. The processor's internal
instuction set is just that -- internal to the processor. Of the Intel
documentation, I think the most likely place to look (for whatever info there
is -- there may not be much) would be the "IA-32 Intel® Architecture
Optimization Reference Manual".

(BTW: I don't know what the ® symbol that I've just pasted into my news client
will look like in your news client -- there's a risk that they may not agree on
which code page this message uses ;-)

The best guide I know of to the increasing, and increasingly interesting,
divergence between what the IA32 instruction-set /looks/ as if it's doing, and
what's actually happening on-chip, is a (book-length) guide by Agner Fog called
"How to optimize for the Pentium family of microprocessors". If you haven't
already come across it (and you might have since it seems to be quite
well-known), and if you are interested in such matters, then it's a facinating
read.

I think it used to be available from:

http://www.agner.org/assem/

but doesn't seem to be any more (there are copies elsewhere on the Web). I
haven't looked yet at the replacement material at the above link -- at first
sight it may be that he's just split the old guide up into 5 parts, but I'm not
yet certain. The stuff there looks very interesting anyway, and is presumably
more up to date than my copy of his original guide.

-- chris
 
C

Chris Uppal

Huh. I've heard some of this before but maybe not all. Maybe I need
to finally read Bloch's "Effective Java" rather than just housing it
on a bookshelf ....

Bloch's book is quite old and so (even if it ever was a good guide to
understanding the performance of Java code -- which I don't think it was
intended to be), I doubt it is a good guide in any but the most general sense
now. (I don't know of any good, current, guide).

That's not to put you off reading Bloch (which I'm told is worthwhile, although
I've never read it myself), but it may not be the best on /this particular/
question.

-- chris
 
C

Chris Uppal

raddog58c said:
That's what I'm saying. It's conversion to a format that I'm not
personally using. Some people need it; some don't; yet we all pay for
it.

Everyone needs it. You may think you don't, and you may have thought that for
25+ years, but if over that time you have written any software which has been
widely used (and I assume you must have done) then you have probably created
problems for people without realising it.

I have, I'm sure, done the just same, since I also have a wide experience --
much of which was gained before I realised that text != 7-bit ASCII.

I get the impression that a lot of your background is in embedded stuff. In
that context, perhaps, text isn't so important (note, I say /text/ isn't
important, not /non-ASCII text/ isn't important). But even so I rather doubt
it. Consider the mess we all are in with trying to de-ASCII the domain-name
system, for example.

It can be hard for C programmers (and presumably others too) to realise, but
text is not the same as binary, and never has been. For one thing, the gamut
of characters needed is always greater than can be represented in such an
inflexible form as 8-bits each. (Ask yourself why ASCII has a sign for dollars
but not for cents ;-) The computer world's long lasting obsession with trying
to pretend that they /are/ the same has caused enormous headaches -- which
indeed are still ongoing...

-- chris
 
C

Chris Uppal

Oliver said:
Let's say I never use the pipe character: |, and I'd be perfectly
happy if whenever someone sent me text containing the pipe character, it
would instead get converted to some lossy gibberish. Why don't we simply
invent a new encoding scheme, more efficient than ASCII, so that I
wouldn't have to pay for this character that I don't need? Well, we
certainly *could* do that, but it'd be a lot of work to support such a
small proportion of all computer users.

People who have worked as programmers in England for some time will recognise
that /exact/ problem. As a Unix user and a C programmer, I kinda /needed/ '|',
but nobody else does. So systems (especially in a semi-international context)
were frequently set up so that most things worked but '|' didn't. Quite often
it was a hell of a pain tracking down which part of which system was using the
wrong code page (or similar). '|' was missing altogether. Or was mapped to
something other than 0x7C. Or was mapped correctly, but the keyboard mapping
didn't match the keycaps. Or was generated by the key marked '|' and displayed
as '|' on the terminal, but not actually stored as 0x7C, or...

The days of code pages were a GIGANTIC MESS, a tribute to short-sighted
fuckwittedness applied on a more than industrical scale. Anything which
eliminates that is good. Anything which even helps control the mess is good
(and one of Unicode's less often appreciated purposes /is/ to help control the
mess -- at least we have a shared vocabulary for talking about what each code
page contains).

-- chris
 
B

blmblm

Bloch's book is quite old and so (even if it ever was a good guide to
understanding the performance of Java code -- which I don't think it was
intended to be), I doubt it is a good guide in any but the most general sense
now. (I don't know of any good, current, guide).

That's not to put you off reading Bloch (which I'm told is worthwhile, although
I've never read it myself), but it may not be the best on /this particular/
question.

What particular question .... In context I guess that would be the
performance aspects of Lew's claim about "best practice", while
my comment about Bloch was more in reference to "best practice"
in general. Probably if I'd thought longer about that comment
before posting it I'd have left it out.
 
R

raddog58c

Bloch's book is quite old and so (even if it ever was a good guide to
understanding the performance of Java code -- which I don't think it was
intended to be), I doubt it is a good guide in any but the most general sense
now. (I don't know of any good, current, guide).

That's not to put you off reading Bloch (which I'm told is worthwhile, although
I've never read it myself), but it may not be the best on /this particular/
question.

-- chris


There's an O'Reilly Book "Java Performance Tuning" that many of my
collegues here say is very good. It's been sitting on my desk for a
couple of days, but I've not yet had nor taken the opportunity to
peruse it.
 
D

Daniel Dyer

What particular question .... In context I guess that would be the
performance aspects of Lew's claim about "best practice", while
my comment about Bloch was more in reference to "best practice"
in general. Probably if I'd thought longer about that comment
before posting it I'd have left it out.

Effective Java 2nd Edition has been on its way for a while now. Sometime
in 2007 is the latest ETA.

These slides are a preview of some of the things that will be in it:

http://developers.sun.com/learning/javaoneonline/2006/coreplatform/TS-1512.pdf
http://developers.sun.com/learning/javaoneonline/2006/coreplatform/TS-1188.pdf

Dan.
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

raddog58c said:
There's an O'Reilly Book "Java Performance Tuning" that many of my
collegues here say is very good. It's been sitting on my desk for a
couple of days, but I've not yet had nor taken the opportunity to
peruse it.

Shirazi's book ?

I think that is older than Bloch's.

Arne
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Chris said:
Bloch's book is quite old and so (even if it ever was a good guide to
understanding the performance of Java code -- which I don't think it was
intended to be), I doubt it is a good guide in any but the most general sense
now. (I don't know of any good, current, guide).

That's not to put you off reading Bloch (which I'm told is worthwhile, although
I've never read it myself), but it may not be the best on /this particular/
question.

That is correct.

I think the main point in the book is to write nice simple robust
code not doing weird stuff to gain 0.5% in a loop.

Arne
 
J

John W. Kennedy

Mark said:
Character set conversion can only be avoided if you can be sure of
always working in a single specified character set. Once you have to do
a conversion of some sort, conversion to/from Unicode isn't much (if
any) more expensive than conversion between simple single byte character
sets. Even in the US I expect you are likely to see data in CP-437,
CP-850, CP-1252, and ISO-8859-1 as well as UTF-8, and UTF-16.

Don't forget IBM-037 (US EBCDIC) and, thanks to the Euro, ISO-8859-15.
 
R

raddog58c

Don't forget IBM-037 (US EBCDIC) and, thanks to the Euro, ISO-8859-15.

--


I recently dealt with IBM037 aka CP037 in a STRUTS app from hell for
IBM's OnDemand product. I was unfamiliar with the CP support in the
String class prior to this endeavor, and now that I've used it I'm
less enthused. It's simple (more or less), but obscure and very
infrequently used in this environment -- I had problems using it
(turned out to be a missing charset.jar on my workstation) and when I
tried to find someone familiar with codepage support I came up empty.

Is this commonly used by others? It was difficult finding
understandable documentation on the WWW. I had assumed I was going to
need to "add" support for it or something, but I couldn't find a clear
explanation of how to do it.

The XLAT assembler instruction (translate) would have been an order of
magnitude easier, but I couldn't use it in this context. I could have
implemented it more easily via xlatChar = xlatTable[unxlatedChar].

In any event, after lots of perusing and debugging and looking at
other people's workstations, I found that the missing charset.jar was
my problem. The Sun java documentation said CP037 was supported as
part of the Extended Encoding Set, so the
UnsupportedEncodingExceptions really threw me.
 
M

Mark Thornton

raddog58c said:
In any event, after lots of perusing and debugging and looking at
other people's workstations, I found that the missing charset.jar was

Outside the US installing a comprehensive set of character sets and
locales is (I think) the default. Because some crazy people ;-) whinge
about the extra space entailed when all they need is ASCII, the default
US install leaves all of this out!

Mark Thornton
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

raddog58c said:
Yeah, that's it. 2nd Edition -- it's copyrighted 2003.

Is it no good?

It is OK in my opinion.

But it is doing all the tests with Java 1.1.8, 1.2.2, 1.3.1 and
1.4.0 so it is not uptodate with the latest in Java performance.

Arne
 
J

John W. Kennedy

raddog58c said:
I recently dealt with IBM037 aka CP037 in a STRUTS app from hell for
IBM's OnDemand product. I was unfamiliar with the CP support in the
String class prior to this endeavor, and now that I've used it I'm
less enthused. It's simple (more or less), but obscure and very
infrequently used in this environment -- I had problems using it
(turned out to be a missing charset.jar on my workstation) and when I
tried to find someone familiar with codepage support I came up empty.

The Java standard only mandates US-ASCII (ISO646-US), ISO-8859-1
(ISO-LATIN-1), UTF-8, UTF-16BE, UTF-16LE, and UTF-16. The choice of what
else to supply is the responsibility of the Java implementation. Sun
Java for Windows supplies something like 150 in all, not counting
aliases. If you were not using Sun Java for Windows (or, probably, Sun
Java for Solaris), IBM037 might not have been included (although it
would obviously be included in any implementation of Java for MVS).
The XLAT assembler instruction (translate) would have been an order of
magnitude easier, but I couldn't use it in this context. I could have
implemented it more easily via xlatChar = xlatTable[unxlatedChar].

I suspect x86 Java does use XLAT, where possible.

I really don't see what's so hard about:

Charset cs037 = Charset.forName("ibm037");
...
String st = new String(bytearray, cs037);
...
Byte[] newbytes = st.getBytes(cs037);

(You can even leave off creating the Charset variable, but that would
mean looking the thing up at run time over and over again, which is
obviously wasteful, not to mention that you have to code to handle an
UnsupportedEncodingException, even if it's a SNOC.)

Reader input = new FileReader(filename, "ibm037");

is even simpler, where it applies.
 
M

mei

Oliver Wong a écrit :
[...]
additionally, it also knows the exact behaviour of the user using the
program. Because it's performing the translation while the user is using it!

Waow, this machine learning thing sounds pretty interesting.
Do you have pointers to additional documentations about that?
Do you know about the payload of those functionalities?

Regards,
Mei.
 
O

Oliver Wong

mei said:
Oliver Wong a écrit :
[...]
additionally, it also knows the exact behaviour of the user using the
program. Because it's performing the translation while the user is using
it!

Waow, this machine learning thing sounds pretty interesting.
Do you have pointers to additional documentations about that?
Do you know about the payload of those functionalities?

Calling it "machine learning" is a bit of a stretch. I don't have any
specifications handy, but if you're really interested, google for "Java
hotspot". Here's a relevant snippet from the Wikipedia article:

http://en.wikipedia.org/wiki/HotSpot
<quote>
Its name derives from the fact that as it runs Java byte-code, it
continually analyzes the program's performance for "hot spots" which are
frequently or repeatedly executed. These are then targeted for optimization,
leading to high performance execution with a minimum of overhead for less
performance-critical code. HotSpot is widely acclaimed as providing the best
performance in its class of JVM. In theory, though rarely in practice, it is
possible for adaptive optimization of a JVM to exceed hand coded C++ or
assembly language.
</quote>

"it is possible for adaptive optimization of a JVM to exceed hand coded
C++ or assembly language."

If Wikipedia says it, it must be true. ;)

- Oliver
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top