Regular Expressions

Discussion in 'Java' started by Markos Charatzas, Feb 5, 2004.

  1. Hi all,

    I'm trying to parse the following expression but i'm having difficulties
    understanding the whole "parse a String" theory.

    The string starts like this
    XXX XXX 00-00-00000000000:0000:00XXXX X000 X

    and then continues 'n' times

    either in the same form as before
    XXX XXX 00-00-00000000000:0000:00XXXX X000 X

    or
    00-00-00000000000:0000:00XXXX X000 X


    Where 0 a digit and X a character.

    I have this idea of checking the first 10 bytes of each string to see
    whether or not they represent a character.
    If yes then I link the current 'XXX XXX ' with the remaining string,
    If not then I link the last 'XXX XXX ' with the remaining string.


    but i've having trouble implementing it :(

    Thanx in advance for ur responses.
    Markos Charatzas, Feb 5, 2004
    #1
    1. Advertising

  2. Markos Charatzas

    nos Guest

    "Markos Charatzas" <> wrote in message
    news:...
    > Hi all,
    >
    > I'm trying to parse the following expression but i'm having difficulties
    > understanding the whole "parse a String" theory.
    >
    > The string starts like this
    > XXX XXX 00-00-00000000000:0000:00XXXX X000 X
    >
    > and then continues 'n' times
    >
    > either in the same form as before
    > XXX XXX 00-00-00000000000:0000:00XXXX X000 X
    >
    > or
    > 00-00-00000000000:0000:00XXXX X000 X
    >
    >
    > Where 0 a digit and X a character.
    >
    > I have this idea of checking the first 10 bytes of each string to see
    > whether or not they represent a character.
    > If yes then I link the current 'XXX XXX ' with the remaining string,
    > If not then I link the last 'XXX XXX ' with the remaining string.
    >
    >
    > but i've having trouble implementing it :(
    >
    > Thanx in advance for ur responses.


    Perhaps I am incorrect, but are not Strings comprised
    of characters?
    Can you provide a concrete example?
    nos, Feb 5, 2004
    #2
    1. Advertising

  3. Markos Charatzas

    Chris Smith Guest

    Markos Charatzas wrote:
    > I'm trying to parse the following expression but i'm having difficulties
    > understanding the whole "parse a String" theory.


    Okay. Part of your confusion may come from a confusion about the nature
    of character strings in Java. Let's clear that one up first:

    > I have this idea of checking the first 10 bytes of each string to see
    > whether or not they represent a character.


    This makes absolutely no sense. I don't know what you mean by "byte"
    and "character", but here is the general take on those:

    1. A "character" is a single component of a string. There are many
    possible characters; in Java, a character can be any of 64 million
    different standard Unicode characters. These include letters and digits
    and punctuation from a variety of worldwide languages, plus some control
    characters, math symbols, and a lot of other stuff.

    2. A "byte" is an eight-bit binary value.

    3. A "string" is a sequence of characters. Strings have no particular
    connection to bytes, though, and it makes no sense at all to talk about
    the first ten bytes of a string. Strings simply don't contain bytes;
    they contain characters.

    4. Characters and bytes are related by something called a character
    encoding. There are many different character encodings (easily hundreds
    of them), and a very common mistake is to assume the one you're familiar
    with -- often Windows CP1252 or ISO 8859-1 -- is the *only* possible
    encoding. Strings don't have an encoding, but whenever you write them
    to a binary form (such as a file or network stream), you are writing
    them using some specific encoding.

    Now, on to your problem:

    > The string starts like this
    > XXX XXX 00-00-00000000000:0000:00XXXX X000 X
    >
    > and then continues 'n' times
    >
    > either in the same form as before
    > XXX XXX 00-00-00000000000:0000:00XXXX X000 X
    >
    > or
    > 00-00-00000000000:0000:00XXXX X000 X
    >
    >
    > Where 0 a digit and X a character.
    >
    > I have this idea of checking the first 10 bytes of each string to see
    > whether or not they represent a character.
    > If yes then I link the current 'XXX XXX ' with the remaining string,
    > If not then I link the last 'XXX XXX ' with the remaining string.
    >
    >
    > but i've having trouble implementing it :(


    Have you got anything at all to show us? Since the title of your post
    is "Regular Expressions", should I assume that you want to use regular
    expressions to implement this? What do you mean by "X [is] a
    character"? That it's a letter (and if so, in what language -- English
    only, or is it okay if it's a letter in the current locale, whatever
    that may be)? Or could it be a digit or punctuation mark or even a
    control character?

    One thing I'll say is that this looks a lot more like a lexing problem
    than a true parsing problem. Regular expressions are, therefore, an
    appropriate tool for solving it.

    --
    www.designacourse.com
    The Easiest Way to Train Anyone... Anywhere.

    Chris Smith - Lead Software Developer/Technical Trainer
    MindIQ Corporation
    Chris Smith, Feb 6, 2004
    #3
  4. Yeap,

    Sorry about the confusion! :)

    I was a bit over my head when I wrote it having spend more than 2 hours
    trying to figure it out.

    When I mentioned 'X as character' '0 as digit' I really meant X being
    [a-zA-Z] and 0 [0-9].

    Also, by saying '10 bytes of a String' i meant the 5 first characters
    since 1 char is 2 bytes in Java.

    I do have in mind Regular Expressions, cause I believe its the solution
    to my problem.

    I thought about it again and I'm wondering whether it makes sense to
    look for the complete 'XXX XXX 'expression and match it to the
    trailing characters till another 'XXX XXX ' comes along.

    Thanks for your time reading this.




    Chris Smith wrote:
    > Markos Charatzas wrote:
    >
    >>I'm trying to parse the following expression but i'm having difficulties
    >>understanding the whole "parse a String" theory.

    >
    >
    > Okay. Part of your confusion may come from a confusion about the nature
    > of character strings in Java. Let's clear that one up first:
    >
    >
    >>I have this idea of checking the first 10 bytes of each string to see
    >>whether or not they represent a character.

    >
    >
    > This makes absolutely no sense. I don't know what you mean by "byte"
    > and "character", but here is the general take on those:
    >
    > 1. A "character" is a single component of a string. There are many
    > possible characters; in Java, a character can be any of 64 million
    > different standard Unicode characters. These include letters and digits
    > and punctuation from a variety of worldwide languages, plus some control
    > characters, math symbols, and a lot of other stuff.
    >
    > 2. A "byte" is an eight-bit binary value.
    >
    > 3. A "string" is a sequence of characters. Strings have no particular
    > connection to bytes, though, and it makes no sense at all to talk about
    > the first ten bytes of a string. Strings simply don't contain bytes;
    > they contain characters.
    >
    > 4. Characters and bytes are related by something called a character
    > encoding. There are many different character encodings (easily hundreds
    > of them), and a very common mistake is to assume the one you're familiar
    > with -- often Windows CP1252 or ISO 8859-1 -- is the *only* possible
    > encoding. Strings don't have an encoding, but whenever you write them
    > to a binary form (such as a file or network stream), you are writing
    > them using some specific encoding.
    >
    > Now, on to your problem:
    >
    >
    >>The string starts like this
    >>XXX XXX 00-00-00000000000:0000:00XXXX X000 X
    >>
    >>and then continues 'n' times
    >>
    >>either in the same form as before
    >>XXX XXX 00-00-00000000000:0000:00XXXX X000 X
    >>
    >>or
    >>00-00-00000000000:0000:00XXXX X000 X
    >>
    >>
    >>Where 0 a digit and X a character.
    >>
    >>I have this idea of checking the first 10 bytes of each string to see
    >>whether or not they represent a character.
    >>If yes then I link the current 'XXX XXX ' with the remaining string,
    >>If not then I link the last 'XXX XXX ' with the remaining string.
    >>
    >>
    >>but i've having trouble implementing it :(

    >
    >
    > Have you got anything at all to show us? Since the title of your post
    > is "Regular Expressions", should I assume that you want to use regular
    > expressions to implement this? What do you mean by "X [is] a
    > character"? That it's a letter (and if so, in what language -- English
    > only, or is it okay if it's a letter in the current locale, whatever
    > that may be)? Or could it be a digit or punctuation mark or even a
    > control character?
    >
    > One thing I'll say is that this looks a lot more like a lexing problem
    > than a true parsing problem. Regular expressions are, therefore, an
    > appropriate tool for solving it.
    >
    Markos Charatzas, Feb 6, 2004
    #4
  5. Ok, I managed to find this REGEX to do the trick.

    [A-Z\s]{10}(\d{1}.{37}){1,}

    Thanks all of you for trying to help!



    Markos Charatzas wrote:
    >
    > Yeap,
    >
    > Sorry about the confusion! :)
    >
    > I was a bit over my head when I wrote it having spend more than 2 hours
    > trying to figure it out.
    >
    > When I mentioned 'X as character' '0 as digit' I really meant X being
    > [a-zA-Z] and 0 [0-9].
    >
    > Also, by saying '10 bytes of a String' i meant the 5 first characters
    > since 1 char is 2 bytes in Java.
    >
    > I do have in mind Regular Expressions, cause I believe its the solution
    > to my problem.
    >
    > I thought about it again and I'm wondering whether it makes sense to
    > look for the complete 'XXX XXX 'expression and match it to the
    > trailing characters till another 'XXX XXX ' comes along.
    >
    > Thanks for your time reading this.
    >
    >
    >
    >
    > Chris Smith wrote:
    >
    >> Markos Charatzas wrote:
    >>
    >>> I'm trying to parse the following expression but i'm having
    >>> difficulties understanding the whole "parse a String" theory.

    >>
    >>
    >>
    >> Okay. Part of your confusion may come from a confusion about the
    >> nature of character strings in Java. Let's clear that one up first:
    >>
    >>
    >>> I have this idea of checking the first 10 bytes of each string to see
    >>> whether or not they represent a character.

    >>
    >>
    >>
    >> This makes absolutely no sense. I don't know what you mean by "byte"
    >> and "character", but here is the general take on those:
    >>
    >> 1. A "character" is a single component of a string. There are many
    >> possible characters; in Java, a character can be any of 64 million
    >> different standard Unicode characters. These include letters and
    >> digits and punctuation from a variety of worldwide languages, plus
    >> some control characters, math symbols, and a lot of other stuff.
    >>
    >> 2. A "byte" is an eight-bit binary value.
    >>
    >> 3. A "string" is a sequence of characters. Strings have no particular
    >> connection to bytes, though, and it makes no sense at all to talk
    >> about the first ten bytes of a string. Strings simply don't contain
    >> bytes; they contain characters.
    >>
    >> 4. Characters and bytes are related by something called a character
    >> encoding. There are many different character encodings (easily
    >> hundreds of them), and a very common mistake is to assume the one
    >> you're familiar with -- often Windows CP1252 or ISO 8859-1 -- is the
    >> *only* possible encoding. Strings don't have an encoding, but
    >> whenever you write them to a binary form (such as a file or network
    >> stream), you are writing them using some specific encoding.
    >>
    >> Now, on to your problem:
    >>
    >>
    >>> The string starts like this
    >>> XXX XXX 00-00-00000000000:0000:00XXXX X000 X
    >>>
    >>> and then continues 'n' times
    >>>
    >>> either in the same form as before
    >>> XXX XXX 00-00-00000000000:0000:00XXXX X000 X
    >>>
    >>> or
    >>> 00-00-00000000000:0000:00XXXX X000 X
    >>>
    >>>
    >>> Where 0 a digit and X a character.
    >>>
    >>> I have this idea of checking the first 10 bytes of each string to see
    >>> whether or not they represent a character.
    >>> If yes then I link the current 'XXX XXX ' with the remaining string,
    >>> If not then I link the last 'XXX XXX ' with the remaining string.
    >>>
    >>>
    >>> but i've having trouble implementing it :(

    >>
    >>
    >>
    >> Have you got anything at all to show us? Since the title of your post
    >> is "Regular Expressions", should I assume that you want to use regular
    >> expressions to implement this? What do you mean by "X [is] a
    >> character"? That it's a letter (and if so, in what language --
    >> English only, or is it okay if it's a letter in the current locale,
    >> whatever that may be)? Or could it be a digit or punctuation mark or
    >> even a control character?
    >>
    >> One thing I'll say is that this looks a lot more like a lexing problem
    >> than a true parsing problem. Regular expressions are, therefore, an
    >> appropriate tool for solving it.
    >>
    Markos Charatzas, Feb 6, 2004
    #5
  6. Markos Charatzas

    Dale King Guest

    "Chris Smith" <> wrote in message
    news:4.net...
    > Markos Charatzas wrote:
    > > I'm trying to parse the following expression but i'm having difficulties
    > > understanding the whole "parse a String" theory.

    >
    > Okay. Part of your confusion may come from a confusion about the nature
    > of character strings in Java. Let's clear that one up first:
    >
    > > I have this idea of checking the first 10 bytes of each string to see
    > > whether or not they represent a character.

    >
    > This makes absolutely no sense. I don't know what you mean by "byte"
    > and "character", but here is the general take on those:
    >
    > 1. A "character" is a single component of a string. There are many
    > possible characters; in Java, a character can be any of 64 million
    > different standard Unicode characters. These include letters and digits
    > and punctuation from a variety of worldwide languages, plus some control
    > characters, math symbols, and a lot of other stuff.



    And in JDK1.5 it has gotten slightly more complex, since it now supports
    Unicode 4.0 and surrogates.
    --
    Dale King
    Dale King, Feb 6, 2004
    #6
  7. Markos Charatzas

    skeptic Guest

    "Dale King" <kingd[at]tmicha[dot]net> wrote in message news:<>...
    > > 1. A "character" is a single component of a string. There are many
    > > possible characters; in Java, a character can be any of 64 million
    > > different standard Unicode characters. These include letters and digits
    > > and punctuation from a variety of worldwide languages, plus some control
    > > characters, math symbols, and a lot of other stuff.

    >
    >
    > And in JDK1.5 it has gotten slightly more complex, since it now supports
    > Unicode 4.0 and surrogates.


    Hello Dale!

    Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
    If not, how do they implement the charAt(i)?

    Regards
    skeptic, Feb 7, 2004
    #7
  8. Thomas Schodt, Feb 8, 2004
    #8
  9. Markos Charatzas

    Dale King Guest

    "skeptic" <> wrote in message
    news:...
    > "Dale King" <kingd[at]tmicha[dot]net> wrote in message

    news:<>...
    > > > 1. A "character" is a single component of a string. There are many
    > > > possible characters; in Java, a character can be any of 64 million
    > > > different standard Unicode characters. These include letters and

    digits
    > > > and punctuation from a variety of worldwide languages, plus some

    control
    > > > characters, math symbols, and a lot of other stuff.

    > >
    > >
    > > And in JDK1.5 it has gotten slightly more complex, since it now supports
    > > Unicode 4.0 and surrogates.

    >
    > Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
    > If not, how do they implement the charAt(i)?



    No, it still is 16 bits. Basically String and Character arrays are now
    encoded in UTF-16 as opposed to UCS-2. To handle characters outside the BMP
    requires the use of surrogates. They now distinguish between code points
    (the Unicode value) and code units (Java char which is either a symbol from
    BMP or a surrogate).

    The best way to see what changes is to view the docs for Character (which
    Thomas provided a link to) and also for String and search for "1.5" and see
    the methods and values added since 1.5.
    --
    Dale King
    Dale King, Feb 9, 2004
    #9
  10. Markos Charatzas

    skeptic Guest

    "Dale King" <kingd[at]tmicha[dot]net> wrote in message news:<>...
    ................
    > > > And in JDK1.5 it has gotten slightly more complex, since it now supports
    > > > Unicode 4.0 and surrogates.

    > >
    > > Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
    > > If not, how do they implement the charAt(i)?

    >
    >
    > No, it still is 16 bits. Basically String and Character arrays are now
    > encoded in UTF-16 as opposed to UCS-2. To handle characters outside the BMP
    > requires the use of surrogates. They now distinguish between code points
    > (the Unicode value) and code units (Java char which is either a symbol from
    > BMP or a surrogate).
    >
    > The best way to see what changes is to view the docs for Character (which
    > Thomas provided a link to) and also for String and search for "1.5" and see
    > the methods and values added since 1.5.


    Hi Dale!
    I'm familiar with the basics of Unicode. Let me emphasize the point of
    the question.
    If the data inside a String are kept as UTF16-encoded char array, then
    getting the i-th char is not as simple as return _data, hence slow.
    The use of int[] solves it, but adds to memory hogginess.
    What was their choice?

    Regards
    skeptic, Feb 10, 2004
    #10
  11. skeptic wrote:

    > If the data inside a String are kept as UTF16-encoded char array, then
    > getting the i-th char is not as simple as return _data, hence slow.


    You are assuming it won't just return two surrogates?
    Thomas Schodt, Feb 10, 2004
    #11
  12. Markos Charatzas

    skeptic Guest

    Thomas Schodt <"news04jan"@\"xenoc.demon.co.uk\"> wrote in message news:<c0b1vn$9iu$1$>...
    > skeptic wrote:
    >
    > > If the data inside a String are kept as UTF16-encoded char array, then
    > > getting the i-th char is not as simple as return _data, hence slow.

    >
    > You are assuming it won't just return two surrogates?


    No, the problem is that one would have to count all the previous surrogates.
    For the each charAt().
    Some smart indexing scheme is possible, but still would be rather slow.

    Regards
    skeptic, Feb 11, 2004
    #12
  13. skeptic wrote:

    > The problem is that one would have to count all the previous surrogates.
    > For the each charAt().
    > Some smart indexing scheme is possible, but still would be rather slow.


    Or just return two surrogates.


    I'm saying that for

    String u="A \uD840\uDC08 F \uD840\uDC08 K";

    which contains 9 Unicode "tokens" (code points)

    u.length() returns 11 and

    u.charAt(0) returns 'A'
    u.charAt(1) returns ' '
    u.charAt(2) returns '\uD840'
    u.charAt(3) returns '\uDC08'
    u.charAt(4) returns ' '
    u.charAt(5) returns 'F'
    u.charAt(6) returns ' '
    u.charAt(7) returns '\uD840'
    u.charAt(8) returns '\uDC08'
    u.charAt(9) returns ' '
    u.charAt(10) returns 'K'

    To get the scalar 21-bit (int) values of
    the two Unicode 4.0 supplementary codepoints you have to use
    u.codePointAt(2)
    and
    u.codePointAt(7)

    I don't know what u.codePointAt(3) and u.codePointAt(8) would do.
    Like I said earlier, try it...
    Thomas Schodt, Feb 11, 2004
    #13
  14. Markos Charatzas

    Carl Howells Guest

    Thomas Schodt wrote:
    > skeptic wrote:
    >
    >> The problem is that one would have to count all the previous surrogates.
    >> For the each charAt().
    >> Some smart indexing scheme is possible, but still would be rather slow.

    >
    >
    > Or just return two surrogates.


    You seem to be intentionally missing the point. skeptic's point is that
    charAt() will no longer be able to be a simple index lookup, assuming
    that String objects still use a char [] as their internal datatype.

    Which means that one of the following will happen: charAt() will be much
    slower now than it used to be, String will use more memory than it used
    to (if it used an int [] internally, for instance), or some more
    complicated clever approach will have to be used for internal storage
    and/or the charAt method.
    Carl Howells, Feb 11, 2004
    #14
  15. Markos Charatzas

    Dale King Guest

    "skeptic" <> wrote in message
    news:...
    > "Dale King" <kingd[at]tmicha[dot]net> wrote in message

    news:<>...
    > ...............
    > > > > And in JDK1.5 it has gotten slightly more complex, since it now

    supports
    > > > > Unicode 4.0 and surrogates.
    > > >
    > > > Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
    > > > If not, how do they implement the charAt(i)?

    > >
    > >
    > > No, it still is 16 bits. Basically String and Character arrays are now
    > > encoded in UTF-16 as opposed to UCS-2. To handle characters outside the

    BMP
    > > requires the use of surrogates. They now distinguish between code points
    > > (the Unicode value) and code units (Java char which is either a symbol

    from
    > > BMP or a surrogate).
    > >
    > > The best way to see what changes is to view the docs for Character

    (which
    > > Thomas provided a link to) and also for String and search for "1.5" and

    see
    > > the methods and values added since 1.5.

    >
    > Hi Dale!
    > I'm familiar with the basics of Unicode. Let me emphasize the point of
    > the question.
    > If the data inside a String are kept as UTF16-encoded char array, then
    > getting the i-th char is not as simple as return _data, hence slow.
    > The use of int[] solves it, but adds to memory hogginess.
    > What was their choice?



    I agree that you can come up with some operations that are faster using an
    int[] array. But I don't think those operations are nearly as common as you
    think. It is not that often that you actually need to just randomly access
    the contents such that you need to access the ith code point (the full 32
    bit value).

    Think about where is that i value supposedly coming from? How often given a
    string do you want to goto the 1000th character. Most of the time you want
    to find an index into the string (e.g. by doing a search) and then get the
    characters around that point. That doesn't require the ability to index into
    the string by number of code points (they don't even provide an API for
    doing this). It is perfectly doable by using code unit indexes.

    So for example given a code unit index into a string this code would extract
    the next 5 codepoints:

    int[] points = new int[ 5 ];

    for( int i = 0; i < 5; i++ )
    {
    int point = myString.codePointAt( index );
    points[ i ] = point;
    index += Character.charCount( point )
    }

    I'm sure you can come up with some reasonable cases where you might need the
    functionality you describe, but I think it is rare enough that a little
    extra time wins the trade-off with twice as much memory for every single
    string.
    --
    Dale King
    Dale King, Feb 11, 2004
    #15
  16. Carl Howells wrote:

    > Thomas Schodt wrote:
    >> Or just return two surrogates.

    >
    >
    > You seem to be intentionally missing the point.


    o_O

    > skeptic's point is that
    > charAt() will no longer be able to be a simple index lookup, assuming
    > that String objects still use a char [] as their internal datatype.


    My point is that charAt() is *still* a simple index lookup.

    Any Unicode 4.0 supplementary codepoints units in Strings are stored as
    two char values (surrogates).

    This means that Strings can potentially display as few as half as many
    codepoint units as String.length() reports.
    For Strings containing Unicode 4.0 supplementary codepoints the index
    you must pass to charAt() no longer corresponds to the offset of the
    codepoint unit in the visual representation of the String.

    You can use codePointAt() to get the 21-bit int value of codepoint units
    in a String. When codePointAt() is called with the index of the first
    surrogate of a Unicode 4.0 supplementary codepoint unit it returns the
    21-bit int value of the entire pointcode unit (occupying the bytes at
    index and at index+1 in the String). When codePointAt() is called with
    the index of a "regular" Unicode codepoint it returns the 16-bit int
    value of the pointcode unit numerically equivalent to the value charAt()
    would return.

    I'll let someone else try what happens if you give codePointAt() the
    index of the second surrogate of a Unicode 4.0 supplementary pointcode.


    > charAt() will be much
    > slower now than it used to be,


    Nope.

    > String will use more memory than it used to


    Nope.
    Well, for Unicode 4.0 supplementary codepoints, yes.
    Since these are 21-bit values and would not fit in a char.

    > or some more
    > complicated clever approach will have to be used for internal storage


    Yes. Surrogate pairs.

    > or some more
    > complicated clever approach will have to be used for
    > the charAt() method.


    Nope.
    Thomas Schodt, Feb 11, 2004
    #16
  17. Thomas Schodt wrote:

    > You can use codePointAt() to get the 21-bit int value of codepoint units
    > in a String. When codePointAt() is called with the index of the first
    > surrogate of a Unicode 4.0 supplementary codepoint unit it returns the
    > 21-bit int value of the entire pointcode unit (occupying the bytes at
    > index and at index+1 in the String)


    That should be

    (occupying the chars at index and at index+1 in the String)
    Thomas Schodt, Feb 11, 2004
    #17
  18. Markos Charatzas

    Dale King Guest

    "Thomas Schodt" <"news04jan"@\"xenoc.demon.co.uk\"> wrote in message
    news:c0e6m2$jb0$1$...
    >
    > I'll let someone else try what happens if you give codePointAt() the
    > index of the second surrogate of a Unicode 4.0 supplementary pointcode.



    It will return the int value of that surrogate. Basically if an unpaired
    surrogate is found then it returns the surrogate that it did find. I just
    submitted a bug report yesterday to do with this. How would you detect that
    the value you got back was a surrogate. There is Character.isHighSurrogate
    and Character.isLowSurrogate, but they only take char not int.
    --
    Dale King
    Dale King, Feb 12, 2004
    #18
  19. Dale King wrote:

    > "Thomas Schodt" <"news04jan"@\"xenoc.demon.co.uk\"> wrote in message
    > news:c0e6m2$jb0$1$...
    >
    >>I'll let someone else try what happens if you give codePointAt() the
    >>index of the second surrogate of a Unicode 4.0 supplementary pointcode.
    > >

    > It will return the int value of that surrogate. Basically if an unpaired
    > surrogate is found then it returns the surrogate that it did find. I just
    > submitted a bug report yesterday to do with this. How would you detect that
    > the value you got back was a surrogate. There is Character.isHighSurrogate()
    > and Character.isLowSurrogate(), but they only take char not int.


    int val = s.codePointAt(i);
    if ((val&0xffff) != val) {} // supplementary (int) codepoint
    else if (Character.isLowSurrogate((char)val)) {} // 2nd surrogate
    else if (Character.isHighSurrogate((char)val)) {} // 1st surrogate
    else {} // regular codepoint

    or

    int val = s.codePointAt(i);
    if (Character.getType(val) == Character.SURROGATE) {}
    Thomas Schodt, Feb 13, 2004
    #19
  20. Markos Charatzas

    Dale King Guest

    "Thomas Schodt" <"news04jan"@\"xenoc.demon.co.uk\"> wrote in message
    news:c0hs50$1ap$1$...
    > Dale King wrote:
    >
    > > "Thomas Schodt" <"news04jan"@\"xenoc.demon.co.uk\"> wrote in message
    > > news:c0e6m2$jb0$1$...
    > >
    > >>I'll let someone else try what happens if you give codePointAt() the
    > >>index of the second surrogate of a Unicode 4.0 supplementary pointcode.
    > > >

    > > It will return the int value of that surrogate. Basically if an unpaired
    > > surrogate is found then it returns the surrogate that it did find. I

    just
    > > submitted a bug report yesterday to do with this. How would you detect

    that
    > > the value you got back was a surrogate. There is

    Character.isHighSurrogate()
    > > and Character.isLowSurrogate(), but they only take char not int.

    >
    > int val = s.codePointAt(i);
    > if ((val&0xffff) != val) {} // supplementary (int) codepoint
    > else if (Character.isLowSurrogate((char)val)) {} // 2nd surrogate
    > else if (Character.isHighSurrogate((char)val)) {} // 1st surrogate
    > else {} // regular codepoint


    Which relies too much on knowing the numeric values. I could have just
    compared against MIN_SURROGATE and MAX_SURROGATE, but I shouldn't have to do
    that.

    > int val = s.codePointAt(i);
    > if (Character.getType(val) == Character.SURROGATE) {}


    Yes, I mentioned this one in relation to the bug and Brian Beck is going to
    discuss the whole issue with the expert group.

    As I wrote to him today, I'm thinking that the real problem here is that
    codePoint method is returning the surrogate value as an int. Perhaps it
    would be better served by returning something like -1 to indicate an error.
    If you then want the erroneous surrogate value then you can get it using
    charAt, which will give you the correctly typed code unit.

    --
    Dale King
    Dale King, Feb 13, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas

    Custom Regular Expressions in ASP.net

    Jay Douglas, Nov 2, 2003, in forum: ASP .Net
    Replies:
    3
    Views:
    596
    mikeb
    Nov 3, 2003
  2. mark

    Regular expressions

    mark, Jun 30, 2003, in forum: Perl
    Replies:
    4
    Views:
    1,710
  3. Dustin D.
    Replies:
    1
    Views:
    11,141
  4. Jay Douglas
    Replies:
    0
    Views:
    591
    Jay Douglas
    Aug 15, 2003
  5. Noman Shapiro
    Replies:
    0
    Views:
    219
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page