string to ascii on line feed

D

donald

Hi there,

i doing some java work and basic i get a String "/n" and i need to get
the ascii value of it which in this case is 10 what is the best way of
going about this?

Thanks

Donald
 
O

Oliver Wong

donald said:
Hi there,

i doing some java work and basic i get a String "/n" and i need to get
the ascii value of it which in this case is 10 what is the best way of
going about this?

The ASCII code of a string is not well defined, but the ASCII code of a
character is. First, convert your string to a single character (how you do
this depends on what assumptions you can make with respect to the string).
From there, in Java, you can cast a character to an integer, like this:

<code, not tested or compiled>
char myChar = '\n';
int myInt = (int)myChar;
</code, not tested or compiled>

- Oliver
 
D

donald

i am trying to look at a string and determine whether it is a single
character and, as such convert to an integer. I already understand how
to convert to the integer (using byte rather than int is also better!)
but it seems that escape characters are represented as two chars within
a java string. i.e. "\n" has length two whereas i need it to find it by
length one. a regular expression:
Pattern.matches("\n|.", "\n")
does not return true as I would expect.

Any ideas?

donald
 
J

Jeffrey Schwab

donald said:
i am trying to look at a string and determine whether it is a single
character and, as such convert to an integer. I already understand how
to convert to the integer (using byte rather than int is also better!)
but it seems that escape characters are represented as two chars within
a java string. i.e. "\n" has length two whereas i need it to find it by
length one. a regular expression:
Pattern.matches("\n|.", "\n")
does not return true as I would expect.

"\n" has length 1. Try printing the result of "\n".length().

You can access individual characters within a string using zero-based
indexes; e.g., to get the third character in string s, use s[2].
 
J

Jeffrey Schwab

donald said:
i am trying to look at a string and determine whether it is a single
character and, as such convert to an integer. I already understand how
to convert to the integer (using byte rather than int is also better!)
but it seems that escape characters are represented as two chars within
a java string. i.e. "\n" has length two whereas i need it to find it by
length one. a regular expression:
Pattern.matches("\n|.", "\n")
does not return true as I would expect.

"\n" has length 1. Try printing the result of "\n".length().

Access individual characters within a string using zero-based indexes;
e.g., to get the first character in string s, use s[0].

comp.lang.java.help is probably a better place for this kind of question.
 
T

tom fredriksen

donald said:
Hi there,

i doing some java work and basic i get a String "/n" and i need to get
the ascii value of it which in this case is 10 what is the best way of
going about this?

RTMF!

Strings in Java are unicode, so they are 16 bits wide, ascii is 8 bit.
Meaning you have to use Strings class methods to retrieve the individual
character values correctly by looping over the string converting each
character to an integer.

/tom
 
J

jeanlutrin

tom fredriksen wrote:
....
Strings in Java are unicode, so they are 16 bits wide,

Strings in Java are Strings. The primitive char type is based
on Unicode 3.0 and char in Java are hence 16 bits wide, which
is unfortunate since since Unicode 3.1 this is not enough to
represent all Unicode codepoints.
ascii is 8 bit.

No.

ASCII is a seven-bit code.
 
C

Chris Uppal

Strings in Java are Strings. The primitive char type is based
on Unicode 3.0 and char in Java are hence 16 bits wide, which
is unfortunate since since Unicode 3.1

Small correction (just for historical interest): the Unicode standard abandoned
16-bitness no later than v 2.0.0 published in July '96.

-- chris
 
O

Oliver Wong

donald said:
i am trying to look at a string and determine whether it is a single
character and, as such convert to an integer. I already understand how
to convert to the integer (using byte rather than int is also better!)
but it seems that escape characters are represented as two chars within
a java string. i.e. "\n" has length two whereas i need it to find it by
length one. a regular expression:
Pattern.matches("\n|.", "\n")
does not return true as I would expect.

Any ideas?

First, read
http://groups.google.ca/group/comp.lang.java.programmer/msg/3fd11f7fb586e837

Now after having read that, do you mean an in-memory string of length 2
of which the first character is the slash, and the second character is an
'n', or do you mean an in-memory string of length 1 of which the only
character is the newline character?

- Oliver
 
T

tom fredriksen

tom fredriksen wrote:
....

Strings in Java are Strings. The primitive char type is based
on Unicode 3.0 and char in Java are hence 16 bits wide, which
is unfortunate since since Unicode 3.1 this is not enough to
represent all Unicode codepoints.

Not really the point is it.

Java string are based on Unicode which in java is based on UTF-16, so
strings in java are 16 bit wide. The fact that the underlying primitive
type is char which is based on UTF-16 is irrelevant for this discussion.
No.

ASCII is a seven-bit code.

No, US-ASCII is 7 bit, ASCII is 8 bit. The fact that you distinguish
between ascii from 1967 and the current definition of ascii is
interresting only if you are using Bells teleprinter.


I hate language lawyers:(

/tom
 
J

jeanlutrin

No, US-ASCII is 7 bit, ASCII is 8 bit.

That is plain wrong. Even more scarier: so many semi-knowledgeable
programmer get this wrong that it *is* definitely a very common
source of endless bugs and misconceptions.

ASCII is 7 bit. Get over with it.

Now, you find me a string that gives me a byte over 127 by using
the following method, will you?

final String s = ...;
final byte[] missionImpossible = s.getBytes("ASCII");

Hint: "US-ASCII" and "ASCII" is the same for Sun, as it is for
anybody familiar with this concept. And ASCII *fscking* is 7 bit.
Which is why you will *not* give me an "ASCII byte" above 127.
There's no such thing. How do you want me to explain it ?

Now it is *you* who should "RTMF!" (sic) as you wrote it in your
first (false) post.

You'll also be nice and explain how comes both ASCII and ISO-8859-1
are what is called "code subsets" of Unicode.

You'll also explain me how comes ASCII is a code subset of
ISO-8859-1 if ASCII is 8 bits. That should be interesting to hear
because on one side it is an accepted *fact* that ASCII is a code
subset of ISO-Latin-1. It is also an accepted *fact* that
ISO-Latin-1 is a 8 bits code. And, to the best of my knowledge,
it is also an accepted *fact* that no matter how strong the
reality distortion field you have, ISO-Latin-1 (ie ISO-8859-1)
is *not* a synomym for ASCII.

Before coming with logical phallacies I urge you to use a search
engine and look for topics on this issue.

I am right and you're plain wrong. My assertions are based on
facts, so you'll have a very hard time arguing with me on this topic.

The fact that you distinguish
between ascii from 1967 and the current definition of ascii is
interresting only if you are using Bells teleprinter.

There's no such thing about "the current definition of ASCII".
You may be confusing ASCII (7 bit) with the much less common
"extended ASCII". Extended ASCII is *definitely NOT* what most
people are referring to when they're referring to ASCII. While ASCII
is very common, extended ASCII is not. Most characters set are
ASCII supersets (ISO-8859-1 and Unicode to name two very
common ones). Most characters set are *not* "extended ASCII"
supersets.

That said, the fact that sadly some programmers, just like you,
think that there's such a thing as a "current definition of ascii" has
created numerous problems and incompatibilities in many
applications, countless misleading docs and continue to help the
spread of that misconception by filling blogs and Usenet's archives
with such blatantly wrong claims.

Now, will you persist on insisting that, your words:

"ASCII is 8 bit" ?

I hate language lawyers:(

I hate people making blatanlty false claims, spreading
misconceptions, filling Usenet's archives with junk and, most
importantly, refusing to admit their errors in spite of
undeniable evidence.

:(
 
T

tom fredriksen

I do not need to respond to people behaving rudely, please learn some
manners and netiquette before talking online. This is a discussion group
not an abuse forum!

But, I will apologise for the "language lawyer" statement, I was
influenced by private matters which should not affect other people.

/tom

No, US-ASCII is 7 bit, ASCII is 8 bit.

That is plain wrong. Even more scarier: so many semi-knowledgeable
programmer get this wrong that it *is* definitely a very common
source of endless bugs and misconceptions.

ASCII is 7 bit. Get over with it.

Now, you find me a string that gives me a byte over 127 by using
the following method, will you?

final String s = ...;
final byte[] missionImpossible = s.getBytes("ASCII");

Hint: "US-ASCII" and "ASCII" is the same for Sun, as it is for
anybody familiar with this concept. And ASCII *fscking* is 7 bit.
Which is why you will *not* give me an "ASCII byte" above 127.
There's no such thing. How do you want me to explain it ?

Now it is *you* who should "RTMF!" (sic) as you wrote it in your
first (false) post.

You'll also be nice and explain how comes both ASCII and ISO-8859-1
are what is called "code subsets" of Unicode.

You'll also explain me how comes ASCII is a code subset of
ISO-8859-1 if ASCII is 8 bits. That should be interesting to hear
because on one side it is an accepted *fact* that ASCII is a code
subset of ISO-Latin-1. It is also an accepted *fact* that
ISO-Latin-1 is a 8 bits code. And, to the best of my knowledge,
it is also an accepted *fact* that no matter how strong the
reality distortion field you have, ISO-Latin-1 (ie ISO-8859-1)
is *not* a synomym for ASCII.

Before coming with logical phallacies I urge you to use a search
engine and look for topics on this issue.

I am right and you're plain wrong. My assertions are based on
facts, so you'll have a very hard time arguing with me on this topic.

The fact that you distinguish
between ascii from 1967 and the current definition of ascii is
interresting only if you are using Bells teleprinter.

There's no such thing about "the current definition of ASCII".
You may be confusing ASCII (7 bit) with the much less common
"extended ASCII". Extended ASCII is *definitely NOT* what most
people are referring to when they're referring to ASCII. While ASCII
is very common, extended ASCII is not. Most characters set are
ASCII supersets (ISO-8859-1 and Unicode to name two very
common ones). Most characters set are *not* "extended ASCII"
supersets.

That said, the fact that sadly some programmers, just like you,
think that there's such a thing as a "current definition of ascii" has
created numerous problems and incompatibilities in many
applications, countless misleading docs and continue to help the
spread of that misconception by filling blogs and Usenet's archives
with such blatantly wrong claims.

Now, will you persist on insisting that, your words:

"ASCII is 8 bit" ?

I hate language lawyers:(

I hate people making blatanlty false claims, spreading
misconceptions, filling Usenet's archives with junk and, most
importantly, refusing to admit their errors in spite of
undeniable evidence.

:(
 
O

Oliver Wong

I'm not disagreeing with *most* of what you wrote; just two minor
nitpicks, and an open statement at the end.
There's no such thing about "the current definition of ASCII".

According to Wikipedia, http://en.wikipedia.org/wiki/Ascii

<quote>
The American Standards Association (ASA, later to become ANSI) first
published ASCII as a standard in 1963. ASCII-1963 lacked the lowercase
letters, and had an up-arrow instead of the caret and a left-arrow instead
of the underscore. The 1967 version added the lowercase letters, changed the
names of a few control characters and moved the two controls ACK and ESC
from the lowercase letters area into the control codes area.

ASCII was subsequently updated and published as ANSI X3.4-1968, ANSI
X3.4-1977, and finally, ANSI X3.4-1986
</quote>

So while it may be pedantic, it would not be incorrect or meaningless to
ask, "Which version of ASCII do you mean?"
While ASCII
is very common, extended ASCII is not.

I believe MS-DOS (I forget which versions) uses extended ASCII, so it
couldn't have been that uncommon (the MS-QBasic program, for example, made
heavy use of characters 176 to 218).
Now, will you persist on insisting that, your words:
"ASCII is 8 bit" ?

The term "ASCII" in the sentence "ASCII is 8 bit" in this context might
refer to multiple things (even if we disregard all versions of ASCII prior
to the ANSI X3.4-1986 standard), one of which might be "The encoding Java
uses when we ask for the 'ASCII' encoding."

Conceptually, we have a string in memory, and we wish to store that
string to disk, using a specific encoding. In our case, the 'ASCII'
encoding. Now when we say "Encoding FOO is n bits", what we usually mean is
either "the encoding uses n bits per character to represent a given string"
or the less restrictive "*on average*, the encoding uses n bits per
character to represent a given string". In this sense, UTF-16 can be said to
be "16 bits" even though certain characters take 32 bits to encode. It's
imprecise (arguably flat out wrong), but you "know what they mean" when they
say it.

Now if we had an encoding which was said to be "7 bits", then the
encoding of a 16 character string should be 112 bits. An encoding which is
said to be "8 bits" would use 128 bits to encode that same 16 character
string.

So when you encoding a 16 character string in Java using the "ASCII"
encoding, does it result in a bitstream of length 112 or 128? I would guess
it 128.

I think one problem here is that ASCII conflates the concept of
numbering characters and encoding them. There's a clear dinstinction between
those concepts with Unicode and, say, UTF-8. Unicode merely assigns numbers
to each character, and UTF-8 assigns a mapping between numbers and
bitstreams.

When ASCII is used as a character-numbering scheme, there are 128
character-number mappings, and ASCII is a "closed" system, where no new
characters can be added to it, so it might make sense to actually say that
this character-number mapping is inherently 7 bits (contrast this with
Unicode, where more characters may be added in the future, and so the system
does not inherently have a bit size).

When ASCII is used as an encoding, to convert to bitstream, it seems
most implementations use 8 bits per character. So in that sense, it would
seem that "ASCII", the number-to-bitstream mapping system, is 8 bits.

- Oliver
 
T

tom fredriksen

Oliver said:
I believe MS-DOS (I forget which versions) uses extended ASCII, so it
couldn't have been that uncommon (the MS-QBasic program, for example,
made heavy use of characters 176 to 218).

MS-DOS since 1989 (I think 2.x or 3.x) has been using ASCII and code
paging (also known as extended ASCII) to support national characters in
at least europe. Where the codepage maps onto the last 128 values of the
byte.

The thing is, most widely used character encodings today use US-ASCII as
their foundation and then extends it with either 1, 9 or 25 bits.

/tom
 
J

John O'Conner

tom fredriksen wrote:
...

Strings in Java are Strings. The primitive char type is based
on Unicode 3.0 and char in Java are hence 16 bits wide, which
is unfortunate since since Unicode 3.1 this is not enough to
represent all Unicode codepoints.

As of 1.5 (Tiger), Java supports the Unicode 4.0 standard. Also, several
classes, including String, have been updated to handle the fact that a
"character" can now be 1 or 2 char values. The char type now represents
a Unicode code unit in UTF-16. UTF-16 encodes Unicode code points
(0x0000 through 0x10FFFF) as one or two 16-bit code units.

For slightly more information regarding Strings and their length, you
can read my blog entry on this topic:
http://weblogs.java.net/blog/joconner/archive/2005/08/how_long_is_you.html

Regards,
John O'Conner
 
J

jeanlutrin

John O'Conner wrote:
....
As of 1.5 (Tiger), Java supports the Unicode 4.0 standard. Also, several
classes, including String, have been updated to handle the fact that a
"character" can now be 1 or 2 char values. The char type now represents
a Unicode code unit in UTF-16. UTF-16 encodes Unicode code points
(0x0000 through 0x10FFFF) as one or two 16-bit code units.

For slightly more information regarding Strings and their length, you
can read my blog entry on this topic:
http://weblogs.java.net/blog/joconner/archive/2005/08/how_long_is_you.html

Thanks John,

I'm fully aware of that. It's exactly because a character can now
need more than one char value that I wrote that it was unfortunate :)

See you soon on c.l.j.p.,

Jean
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,681
Members
48,796
Latest member
Greg L.

Latest Threads

Top