Regular Expressions

  • Thread starter Markos Charatzas
  • Start date
M

Markos Charatzas

Hi all,

I'm trying to parse the following expression but i'm having difficulties
understanding the whole "parse a String" theory.

The string starts like this
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

and then continues 'n' times

either in the same form as before
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

or
00-00-00000000000:0000:00XXXX X000 X


Where 0 a digit and X a character.

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.
If yes then I link the current 'XXX XXX ' with the remaining string,
If not then I link the last 'XXX XXX ' with the remaining string.


but i've having trouble implementing it :(

Thanx in advance for ur responses.
 
N

nos

Markos Charatzas said:
Hi all,

I'm trying to parse the following expression but i'm having difficulties
understanding the whole "parse a String" theory.

The string starts like this
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

and then continues 'n' times

either in the same form as before
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

or
00-00-00000000000:0000:00XXXX X000 X


Where 0 a digit and X a character.

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.
If yes then I link the current 'XXX XXX ' with the remaining string,
If not then I link the last 'XXX XXX ' with the remaining string.


but i've having trouble implementing it :(

Thanx in advance for ur responses.

Perhaps I am incorrect, but are not Strings comprised
of characters?
Can you provide a concrete example?
 
C

Chris Smith

Markos said:
I'm trying to parse the following expression but i'm having difficulties
understanding the whole "parse a String" theory.

Okay. Part of your confusion may come from a confusion about the nature
of character strings in Java. Let's clear that one up first:
I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.

This makes absolutely no sense. I don't know what you mean by "byte"
and "character", but here is the general take on those:

1. A "character" is a single component of a string. There are many
possible characters; in Java, a character can be any of 64 million
different standard Unicode characters. These include letters and digits
and punctuation from a variety of worldwide languages, plus some control
characters, math symbols, and a lot of other stuff.

2. A "byte" is an eight-bit binary value.

3. A "string" is a sequence of characters. Strings have no particular
connection to bytes, though, and it makes no sense at all to talk about
the first ten bytes of a string. Strings simply don't contain bytes;
they contain characters.

4. Characters and bytes are related by something called a character
encoding. There are many different character encodings (easily hundreds
of them), and a very common mistake is to assume the one you're familiar
with -- often Windows CP1252 or ISO 8859-1 -- is the *only* possible
encoding. Strings don't have an encoding, but whenever you write them
to a binary form (such as a file or network stream), you are writing
them using some specific encoding.

Now, on to your problem:
The string starts like this
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

and then continues 'n' times

either in the same form as before
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

or
00-00-00000000000:0000:00XXXX X000 X


Where 0 a digit and X a character.

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.
If yes then I link the current 'XXX XXX ' with the remaining string,
If not then I link the last 'XXX XXX ' with the remaining string.


but i've having trouble implementing it :(

Have you got anything at all to show us? Since the title of your post
is "Regular Expressions", should I assume that you want to use regular
expressions to implement this? What do you mean by "X [is] a
character"? That it's a letter (and if so, in what language -- English
only, or is it okay if it's a letter in the current locale, whatever
that may be)? Or could it be a digit or punctuation mark or even a
control character?

One thing I'll say is that this looks a lot more like a lexing problem
than a true parsing problem. Regular expressions are, therefore, an
appropriate tool for solving it.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
M

Markos Charatzas

Yeap,

Sorry about the confusion! :)

I was a bit over my head when I wrote it having spend more than 2 hours
trying to figure it out.

When I mentioned 'X as character' '0 as digit' I really meant X being
[a-zA-Z] and 0 [0-9].

Also, by saying '10 bytes of a String' i meant the 5 first characters
since 1 char is 2 bytes in Java.

I do have in mind Regular Expressions, cause I believe its the solution
to my problem.

I thought about it again and I'm wondering whether it makes sense to
look for the complete 'XXX XXX 'expression and match it to the
trailing characters till another 'XXX XXX ' comes along.

Thanks for your time reading this.




Chris said:
Markos said:
I'm trying to parse the following expression but i'm having difficulties
understanding the whole "parse a String" theory.


Okay. Part of your confusion may come from a confusion about the nature
of character strings in Java. Let's clear that one up first:

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.


This makes absolutely no sense. I don't know what you mean by "byte"
and "character", but here is the general take on those:

1. A "character" is a single component of a string. There are many
possible characters; in Java, a character can be any of 64 million
different standard Unicode characters. These include letters and digits
and punctuation from a variety of worldwide languages, plus some control
characters, math symbols, and a lot of other stuff.

2. A "byte" is an eight-bit binary value.

3. A "string" is a sequence of characters. Strings have no particular
connection to bytes, though, and it makes no sense at all to talk about
the first ten bytes of a string. Strings simply don't contain bytes;
they contain characters.

4. Characters and bytes are related by something called a character
encoding. There are many different character encodings (easily hundreds
of them), and a very common mistake is to assume the one you're familiar
with -- often Windows CP1252 or ISO 8859-1 -- is the *only* possible
encoding. Strings don't have an encoding, but whenever you write them
to a binary form (such as a file or network stream), you are writing
them using some specific encoding.

Now, on to your problem:

The string starts like this
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

and then continues 'n' times

either in the same form as before
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

or
00-00-00000000000:0000:00XXXX X000 X


Where 0 a digit and X a character.

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.
If yes then I link the current 'XXX XXX ' with the remaining string,
If not then I link the last 'XXX XXX ' with the remaining string.


but i've having trouble implementing it :(


Have you got anything at all to show us? Since the title of your post
is "Regular Expressions", should I assume that you want to use regular
expressions to implement this? What do you mean by "X [is] a
character"? That it's a letter (and if so, in what language -- English
only, or is it okay if it's a letter in the current locale, whatever
that may be)? Or could it be a digit or punctuation mark or even a
control character?

One thing I'll say is that this looks a lot more like a lexing problem
than a true parsing problem. Regular expressions are, therefore, an
appropriate tool for solving it.
 
M

Markos Charatzas

Ok, I managed to find this REGEX to do the trick.

[A-Z\s]{10}(\d{1}.{37}){1,}

Thanks all of you for trying to help!



Markos said:
Yeap,

Sorry about the confusion! :)

I was a bit over my head when I wrote it having spend more than 2 hours
trying to figure it out.

When I mentioned 'X as character' '0 as digit' I really meant X being
[a-zA-Z] and 0 [0-9].

Also, by saying '10 bytes of a String' i meant the 5 first characters
since 1 char is 2 bytes in Java.

I do have in mind Regular Expressions, cause I believe its the solution
to my problem.

I thought about it again and I'm wondering whether it makes sense to
look for the complete 'XXX XXX 'expression and match it to the
trailing characters till another 'XXX XXX ' comes along.

Thanks for your time reading this.




Chris said:
Markos said:
I'm trying to parse the following expression but i'm having
difficulties understanding the whole "parse a String" theory.



Okay. Part of your confusion may come from a confusion about the
nature of character strings in Java. Let's clear that one up first:

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.



This makes absolutely no sense. I don't know what you mean by "byte"
and "character", but here is the general take on those:

1. A "character" is a single component of a string. There are many
possible characters; in Java, a character can be any of 64 million
different standard Unicode characters. These include letters and
digits and punctuation from a variety of worldwide languages, plus
some control characters, math symbols, and a lot of other stuff.

2. A "byte" is an eight-bit binary value.

3. A "string" is a sequence of characters. Strings have no particular
connection to bytes, though, and it makes no sense at all to talk
about the first ten bytes of a string. Strings simply don't contain
bytes; they contain characters.

4. Characters and bytes are related by something called a character
encoding. There are many different character encodings (easily
hundreds of them), and a very common mistake is to assume the one
you're familiar with -- often Windows CP1252 or ISO 8859-1 -- is the
*only* possible encoding. Strings don't have an encoding, but
whenever you write them to a binary form (such as a file or network
stream), you are writing them using some specific encoding.

Now, on to your problem:

The string starts like this
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

and then continues 'n' times

either in the same form as before
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

or
00-00-00000000000:0000:00XXXX X000 X


Where 0 a digit and X a character.

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.
If yes then I link the current 'XXX XXX ' with the remaining string,
If not then I link the last 'XXX XXX ' with the remaining string.


but i've having trouble implementing it :(



Have you got anything at all to show us? Since the title of your post
is "Regular Expressions", should I assume that you want to use regular
expressions to implement this? What do you mean by "X [is] a
character"? That it's a letter (and if so, in what language --
English only, or is it okay if it's a letter in the current locale,
whatever that may be)? Or could it be a digit or punctuation mark or
even a control character?

One thing I'll say is that this looks a lot more like a lexing problem
than a true parsing problem. Regular expressions are, therefore, an
appropriate tool for solving it.
 
D

Dale King

Chris Smith said:
Okay. Part of your confusion may come from a confusion about the nature
of character strings in Java. Let's clear that one up first:


This makes absolutely no sense. I don't know what you mean by "byte"
and "character", but here is the general take on those:

1. A "character" is a single component of a string. There are many
possible characters; in Java, a character can be any of 64 million
different standard Unicode characters. These include letters and digits
and punctuation from a variety of worldwide languages, plus some control
characters, math symbols, and a lot of other stuff.


And in JDK1.5 it has gotten slightly more complex, since it now supports
Unicode 4.0 and surrogates.
 
S

skeptic

Dale King said:
And in JDK1.5 it has gotten slightly more complex, since it now supports
Unicode 4.0 and surrogates.

Hello Dale!

Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
If not, how do they implement the charAt(i)?

Regards
 
D

Dale King

skeptic said:
"Dale King" <kingd[at]tmicha[dot]net> wrote in message
And in JDK1.5 it has gotten slightly more complex, since it now supports
Unicode 4.0 and surrogates.

Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
If not, how do they implement the charAt(i)?


No, it still is 16 bits. Basically String and Character arrays are now
encoded in UTF-16 as opposed to UCS-2. To handle characters outside the BMP
requires the use of surrogates. They now distinguish between code points
(the Unicode value) and code units (Java char which is either a symbol from
BMP or a surrogate).

The best way to see what changes is to view the docs for Character (which
Thomas provided a link to) and also for String and search for "1.5" and see
the methods and values added since 1.5.
 
S

skeptic

Dale King said:
No, it still is 16 bits. Basically String and Character arrays are now
encoded in UTF-16 as opposed to UCS-2. To handle characters outside the BMP
requires the use of surrogates. They now distinguish between code points
(the Unicode value) and code units (Java char which is either a symbol from
BMP or a surrogate).

The best way to see what changes is to view the docs for Character (which
Thomas provided a link to) and also for String and search for "1.5" and see
the methods and values added since 1.5.

Hi Dale!
I'm familiar with the basics of Unicode. Let me emphasize the point of
the question.
If the data inside a String are kept as UTF16-encoded char array, then
getting the i-th char is not as simple as return _data, hence slow.
The use of int[] solves it, but adds to memory hogginess.
What was their choice?

Regards
 
T

Thomas Schodt

skeptic said:
If the data inside a String are kept as UTF16-encoded char array, then
getting the i-th char is not as simple as return _data, hence slow.


You are assuming it won't just return two surrogates?
 
S

skeptic

Thomas Schodt said:
skeptic said:
If the data inside a String are kept as UTF16-encoded char array, then
getting the i-th char is not as simple as return _data, hence slow.


You are assuming it won't just return two surrogates?


No, the problem is that one would have to count all the previous surrogates.
For the each charAt().
Some smart indexing scheme is possible, but still would be rather slow.

Regards
 
T

Thomas Schodt

skeptic said:
The problem is that one would have to count all the previous surrogates.
For the each charAt().
Some smart indexing scheme is possible, but still would be rather slow.

Or just return two surrogates.


I'm saying that for

String u="A \uD840\uDC08 F \uD840\uDC08 K";

which contains 9 Unicode "tokens" (code points)

u.length() returns 11 and

u.charAt(0) returns 'A'
u.charAt(1) returns ' '
u.charAt(2) returns '\uD840'
u.charAt(3) returns '\uDC08'
u.charAt(4) returns ' '
u.charAt(5) returns 'F'
u.charAt(6) returns ' '
u.charAt(7) returns '\uD840'
u.charAt(8) returns '\uDC08'
u.charAt(9) returns ' '
u.charAt(10) returns 'K'

To get the scalar 21-bit (int) values of
the two Unicode 4.0 supplementary codepoints you have to use
u.codePointAt(2)
and
u.codePointAt(7)

I don't know what u.codePointAt(3) and u.codePointAt(8) would do.
Like I said earlier, try it...
 
C

Carl Howells

Thomas said:
Or just return two surrogates.

You seem to be intentionally missing the point. skeptic's point is that
charAt() will no longer be able to be a simple index lookup, assuming
that String objects still use a char [] as their internal datatype.

Which means that one of the following will happen: charAt() will be much
slower now than it used to be, String will use more memory than it used
to (if it used an int [] internally, for instance), or some more
complicated clever approach will have to be used for internal storage
and/or the charAt method.
 
D

Dale King

skeptic said:
"Dale King" <kingd[at]tmicha[dot]net> wrote in message
...............


No, it still is 16 bits. Basically String and Character arrays are now
encoded in UTF-16 as opposed to UCS-2. To handle characters outside the BMP
requires the use of surrogates. They now distinguish between code points
(the Unicode value) and code units (Java char which is either a symbol from
BMP or a surrogate).

The best way to see what changes is to view the docs for Character (which
Thomas provided a link to) and also for String and search for "1.5" and see
the methods and values added since 1.5.

Hi Dale!
I'm familiar with the basics of Unicode. Let me emphasize the point of
the question.
If the data inside a String are kept as UTF16-encoded char array, then
getting the i-th char is not as simple as return _data, hence slow.
The use of int[] solves it, but adds to memory hogginess.
What was their choice?



I agree that you can come up with some operations that are faster using an
int[] array. But I don't think those operations are nearly as common as you
think. It is not that often that you actually need to just randomly access
the contents such that you need to access the ith code point (the full 32
bit value).

Think about where is that i value supposedly coming from? How often given a
string do you want to goto the 1000th character. Most of the time you want
to find an index into the string (e.g. by doing a search) and then get the
characters around that point. That doesn't require the ability to index into
the string by number of code points (they don't even provide an API for
doing this). It is perfectly doable by using code unit indexes.

So for example given a code unit index into a string this code would extract
the next 5 codepoints:

int[] points = new int[ 5 ];

for( int i = 0; i < 5; i++ )
{
int point = myString.codePointAt( index );
points[ i ] = point;
index += Character.charCount( point )
}

I'm sure you can come up with some reasonable cases where you might need the
functionality you describe, but I think it is rare enough that a little
extra time wins the trade-off with twice as much memory for every single
string.
 
T

Thomas Schodt

Carl said:
Thomas said:
Or just return two surrogates.


You seem to be intentionally missing the point.
o_O

skeptic's point is that
charAt() will no longer be able to be a simple index lookup, assuming
that String objects still use a char [] as their internal datatype.

My point is that charAt() is *still* a simple index lookup.

Any Unicode 4.0 supplementary codepoints units in Strings are stored as
two char values (surrogates).

This means that Strings can potentially display as few as half as many
codepoint units as String.length() reports.
For Strings containing Unicode 4.0 supplementary codepoints the index
you must pass to charAt() no longer corresponds to the offset of the
codepoint unit in the visual representation of the String.

You can use codePointAt() to get the 21-bit int value of codepoint units
in a String. When codePointAt() is called with the index of the first
surrogate of a Unicode 4.0 supplementary codepoint unit it returns the
21-bit int value of the entire pointcode unit (occupying the bytes at
index and at index+1 in the String). When codePointAt() is called with
the index of a "regular" Unicode codepoint it returns the 16-bit int
value of the pointcode unit numerically equivalent to the value charAt()
would return.

I'll let someone else try what happens if you give codePointAt() the
index of the second surrogate of a Unicode 4.0 supplementary pointcode.

charAt() will be much
slower now than it used to be,
Nope.

String will use more memory than it used to

Nope.
Well, for Unicode 4.0 supplementary codepoints, yes.
Since these are 21-bit values and would not fit in a char.
or some more
complicated clever approach will have to be used for internal storage

Yes. Surrogate pairs.
or some more
complicated clever approach will have to be used for
the charAt() method.

Nope.
 
T

Thomas Schodt

Thomas said:
You can use codePointAt() to get the 21-bit int value of codepoint units
in a String. When codePointAt() is called with the index of the first
surrogate of a Unicode 4.0 supplementary codepoint unit it returns the
21-bit int value of the entire pointcode unit (occupying the bytes at
index and at index+1 in the String)

That should be

(occupying the chars at index and at index+1 in the String)
 
D

Dale King

Thomas Schodt said:
I'll let someone else try what happens if you give codePointAt() the
index of the second surrogate of a Unicode 4.0 supplementary pointcode.


It will return the int value of that surrogate. Basically if an unpaired
surrogate is found then it returns the surrogate that it did find. I just
submitted a bug report yesterday to do with this. How would you detect that
the value you got back was a surrogate. There is Character.isHighSurrogate
and Character.isLowSurrogate, but they only take char not int.
 
T

Thomas Schodt

Dale said:
It will return the int value of that surrogate. Basically if an unpaired
surrogate is found then it returns the surrogate that it did find. I just
submitted a bug report yesterday to do with this. How would you detect that
the value you got back was a surrogate. There is Character.isHighSurrogate()
and Character.isLowSurrogate(), but they only take char not int.

int val = s.codePointAt(i);
if ((val&0xffff) != val) {} // supplementary (int) codepoint
else if (Character.isLowSurrogate((char)val)) {} // 2nd surrogate
else if (Character.isHighSurrogate((char)val)) {} // 1st surrogate
else {} // regular codepoint

or

int val = s.codePointAt(i);
if (Character.getType(val) == Character.SURROGATE) {}
 
D

Dale King

Thomas Schodt said:
int val = s.codePointAt(i);
if ((val&0xffff) != val) {} // supplementary (int) codepoint
else if (Character.isLowSurrogate((char)val)) {} // 2nd surrogate
else if (Character.isHighSurrogate((char)val)) {} // 1st surrogate
else {} // regular codepoint

Which relies too much on knowing the numeric values. I could have just
compared against MIN_SURROGATE and MAX_SURROGATE, but I shouldn't have to do
that.
int val = s.codePointAt(i);
if (Character.getType(val) == Character.SURROGATE) {}

Yes, I mentioned this one in relation to the bug and Brian Beck is going to
discuss the whole issue with the expert group.

As I wrote to him today, I'm thinking that the real problem here is that
codePoint method is returning the surrogate value as an int. Perhaps it
would be better served by returning something like -1 to indicate an error.
If you then want the erroneous surrogate value then you can get it using
charAt, which will give you the correctly typed code unit.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top