Regular Expressions

Markos Charatzas · Feb 5, 2004

Hi all,

I'm trying to parse the following expression but i'm having difficulties
understanding the whole "parse a String" theory.

The string starts like this
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

and then continues 'n' times

either in the same form as before
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

or
00-00-00000000000:0000:00XXXX X000 X

Where 0 a digit and X a character.

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.
If yes then I link the current 'XXX XXX ' with the remaining string,
If not then I link the last 'XXX XXX ' with the remaining string.

but i've having trouble implementing it

Thanx in advance for ur responses.

nos · Feb 5, 2004

Markos Charatzas said:
Hi all,

I'm trying to parse the following expression but i'm having difficulties
understanding the whole "parse a String" theory.

The string starts like this
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

and then continues 'n' times

either in the same form as before
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

or
00-00-00000000000:0000:00XXXX X000 X

Where 0 a digit and X a character.

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.
If yes then I link the current 'XXX XXX ' with the remaining string,
If not then I link the last 'XXX XXX ' with the remaining string.

but i've having trouble implementing it

Thanx in advance for ur responses.

Perhaps I am incorrect, but are not Strings comprised
of characters?
Can you provide a concrete example?

Chris Smith · Feb 6, 2004

Markos said:
I'm trying to parse the following expression but i'm having difficulties
understanding the whole "parse a String" theory.

Okay. Part of your confusion may come from a confusion about the nature
of character strings in Java. Let's clear that one up first:

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.

This makes absolutely no sense. I don't know what you mean by "byte"
and "character", but here is the general take on those:

1. A "character" is a single component of a string. There are many
possible characters; in Java, a character can be any of 64 million
different standard Unicode characters. These include letters and digits
and punctuation from a variety of worldwide languages, plus some control
characters, math symbols, and a lot of other stuff.

2. A "byte" is an eight-bit binary value.

3. A "string" is a sequence of characters. Strings have no particular
connection to bytes, though, and it makes no sense at all to talk about
the first ten bytes of a string. Strings simply don't contain bytes;
they contain characters.

4. Characters and bytes are related by something called a character
encoding. There are many different character encodings (easily hundreds
of them), and a very common mistake is to assume the one you're familiar
with -- often Windows CP1252 or ISO 8859-1 -- is the *only* possible
encoding. Strings don't have an encoding, but whenever you write them
to a binary form (such as a file or network stream), you are writing
them using some specific encoding.

Now, on to your problem:

The string starts like this
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

and then continues 'n' times

either in the same form as before
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

or
00-00-00000000000:0000:00XXXX X000 X

Where 0 a digit and X a character.

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.
If yes then I link the current 'XXX XXX ' with the remaining string,
If not then I link the last 'XXX XXX ' with the remaining string.

but i've having trouble implementing it

Have you got anything at all to show us? Since the title of your post
is "Regular Expressions", should I assume that you want to use regular
expressions to implement this? What do you mean by "X [is] a
character"? That it's a letter (and if so, in what language -- English
only, or is it okay if it's a letter in the current locale, whatever
that may be)? Or could it be a digit or punctuation mark or even a
control character?

One thing I'll say is that this looks a lot more like a lexing problem
than a true parsing problem. Regular expressions are, therefore, an
appropriate tool for solving it.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Markos Charatzas · Feb 6, 2004

Yeap,

Sorry about the confusion!

I was a bit over my head when I wrote it having spend more than 2 hours
trying to figure it out.

When I mentioned 'X as character' '0 as digit' I really meant X being
[a-zA-Z] and 0 [0-9].

Also, by saying '10 bytes of a String' i meant the 5 first characters
since 1 char is 2 bytes in Java.

I do have in mind Regular Expressions, cause I believe its the solution
to my problem.

I thought about it again and I'm wondering whether it makes sense to
look for the complete 'XXX XXX 'expression and match it to the
trailing characters till another 'XXX XXX ' comes along.

Thanks for your time reading this.

Chris said:
Markos said:

I'm trying to parse the following expression but i'm having difficulties
understanding the whole "parse a String" theory.

Click to expand...

Okay. Part of your confusion may come from a confusion about the nature
of character strings in Java. Let's clear that one up first:

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.

Click to expand...

This makes absolutely no sense. I don't know what you mean by "byte"
and "character", but here is the general take on those:

1. A "character" is a single component of a string. There are many
possible characters; in Java, a character can be any of 64 million
different standard Unicode characters. These include letters and digits
and punctuation from a variety of worldwide languages, plus some control
characters, math symbols, and a lot of other stuff.

2. A "byte" is an eight-bit binary value.

3. A "string" is a sequence of characters. Strings have no particular
connection to bytes, though, and it makes no sense at all to talk about
the first ten bytes of a string. Strings simply don't contain bytes;
they contain characters.

4. Characters and bytes are related by something called a character
encoding. There are many different character encodings (easily hundreds
of them), and a very common mistake is to assume the one you're familiar
with -- often Windows CP1252 or ISO 8859-1 -- is the *only* possible
encoding. Strings don't have an encoding, but whenever you write them
to a binary form (such as a file or network stream), you are writing
them using some specific encoding.

Now, on to your problem:

The string starts like this
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

and then continues 'n' times

either in the same form as before
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

or
00-00-00000000000:0000:00XXXX X000 X

Where 0 a digit and X a character.

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.
If yes then I link the current 'XXX XXX ' with the remaining string,
If not then I link the last 'XXX XXX ' with the remaining string.

but i've having trouble implementing it

Click to expand...

Have you got anything at all to show us? Since the title of your post
is "Regular Expressions", should I assume that you want to use regular
expressions to implement this? What do you mean by "X [is] a
character"? That it's a letter (and if so, in what language -- English
only, or is it okay if it's a letter in the current locale, whatever
that may be)? Or could it be a digit or punctuation mark or even a
control character?

One thing I'll say is that this looks a lot more like a lexing problem
than a true parsing problem. Regular expressions are, therefore, an
appropriate tool for solving it.

Markos Charatzas · Feb 6, 2004

Ok, I managed to find this REGEX to do the trick.

[A-Z\s]{10}(\d{1}.{37}){1,}

Thanks all of you for trying to help!

Markos said:
Yeap,

Sorry about the confusion!

I was a bit over my head when I wrote it having spend more than 2 hours
trying to figure it out.

When I mentioned 'X as character' '0 as digit' I really meant X being
[a-zA-Z] and 0 [0-9].

Also, by saying '10 bytes of a String' i meant the 5 first characters
since 1 char is 2 bytes in Java.

I do have in mind Regular Expressions, cause I believe its the solution
to my problem.

I thought about it again and I'm wondering whether it makes sense to
look for the complete 'XXX XXX 'expression and match it to the
trailing characters till another 'XXX XXX ' comes along.

Thanks for your time reading this.

Chris said:

Markos said:

I'm trying to parse the following expression but i'm having
difficulties understanding the whole "parse a String" theory.

Click to expand...

Okay. Part of your confusion may come from a confusion about the
nature of character strings in Java. Let's clear that one up first:

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.

Click to expand...

This makes absolutely no sense. I don't know what you mean by "byte"
and "character", but here is the general take on those:

1. A "character" is a single component of a string. There are many
possible characters; in Java, a character can be any of 64 million
different standard Unicode characters. These include letters and
digits and punctuation from a variety of worldwide languages, plus
some control characters, math symbols, and a lot of other stuff.

2. A "byte" is an eight-bit binary value.

3. A "string" is a sequence of characters. Strings have no particular
connection to bytes, though, and it makes no sense at all to talk
about the first ten bytes of a string. Strings simply don't contain
bytes; they contain characters.

4. Characters and bytes are related by something called a character
encoding. There are many different character encodings (easily
hundreds of them), and a very common mistake is to assume the one
you're familiar with -- often Windows CP1252 or ISO 8859-1 -- is the
*only* possible encoding. Strings don't have an encoding, but
whenever you write them to a binary form (such as a file or network
stream), you are writing them using some specific encoding.

Now, on to your problem:

The string starts like this
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

and then continues 'n' times

either in the same form as before
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

or
00-00-00000000000:0000:00XXXX X000 X

Where 0 a digit and X a character.

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.
If yes then I link the current 'XXX XXX ' with the remaining string,
If not then I link the last 'XXX XXX ' with the remaining string.

but i've having trouble implementing it

Click to expand...

Have you got anything at all to show us? Since the title of your post
is "Regular Expressions", should I assume that you want to use regular
expressions to implement this? What do you mean by "X [is] a
character"? That it's a letter (and if so, in what language --
English only, or is it okay if it's a letter in the current locale,
whatever that may be)? Or could it be a digit or punctuation mark or
even a control character?

One thing I'll say is that this looks a lot more like a lexing problem
than a true parsing problem. Regular expressions are, therefore, an
appropriate tool for solving it.

Click to expand...

Dale King · Feb 6, 2004

Chris Smith said:
Okay. Part of your confusion may come from a confusion about the nature
of character strings in Java. Let's clear that one up first:

This makes absolutely no sense. I don't know what you mean by "byte"
and "character", but here is the general take on those:

1. A "character" is a single component of a string. There are many
possible characters; in Java, a character can be any of 64 million
different standard Unicode characters. These include letters and digits
and punctuation from a variety of worldwide languages, plus some control
characters, math symbols, and a lot of other stuff.

And in JDK1.5 it has gotten slightly more complex, since it now supports
Unicode 4.0 and surrogates.

skeptic · Feb 7, 2004

Dale King said:
And in JDK1.5 it has gotten slightly more complex, since it now supports
Unicode 4.0 and surrogates.

Hello Dale!

Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
If not, how do they implement the charAt(i)?

Regards

Thomas Schodt · Feb 8, 2004

skeptic said:
Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
http://makeashorterlink.com/?P37821657

If not, how do they implement the charAt(i)?

Try it.

Dale King · Feb 9, 2004

skeptic said:
"Dale King" <kingd[at]tmicha[dot]net> wrote in message

And in JDK1.5 it has gotten slightly more complex, since it now supports
Unicode 4.0 and surrogates.

Click to expand...

Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
If not, how do they implement the charAt(i)?

No, it still is 16 bits. Basically String and Character arrays are now
encoded in UTF-16 as opposed to UCS-2. To handle characters outside the BMP
requires the use of surrogates. They now distinguish between code points
(the Unicode value) and code units (Java char which is either a symbol from
BMP or a surrogate).

The best way to see what changes is to view the docs for Character (which
Thomas provided a link to) and also for String and search for "1.5" and see
the methods and values added since 1.5.

skeptic · Feb 10, 2004

Dale King said:
No, it still is 16 bits. Basically String and Character arrays are now
encoded in UTF-16 as opposed to UCS-2. To handle characters outside the BMP
requires the use of surrogates. They now distinguish between code points
(the Unicode value) and code units (Java char which is either a symbol from
BMP or a surrogate).

The best way to see what changes is to view the docs for Character (which
Thomas provided a link to) and also for String and search for "1.5" and see
the methods and values added since 1.5.

Hi Dale!
I'm familiar with the basics of Unicode. Let me emphasize the point of
the question.
If the data inside a String are kept as UTF16-encoded char array, then
getting the i-th char is not as simple as return _data, hence slow.
The use of int[] solves it, but adds to memory hogginess.
What was their choice?

Regards

Thomas Schodt · Feb 10, 2004

skeptic said:
If the data inside a String are kept as UTF16-encoded char array, then
getting the i-th char is not as simple as return _data, hence slow.

You are assuming it won't just return two surrogates?

skeptic · Feb 11, 2004

Thomas Schodt said:
skeptic said:

If the data inside a String are kept as UTF16-encoded char array, then
getting the i-th char is not as simple as return _data, hence slow.

Click to expand...

You are assuming it won't just return two surrogates?

No, the problem is that one would have to count all the previous surrogates.
For the each charAt().
Some smart indexing scheme is possible, but still would be rather slow.

Regards

Thomas Schodt · Feb 11, 2004

skeptic said:
The problem is that one would have to count all the previous surrogates.
For the each charAt().
Some smart indexing scheme is possible, but still would be rather slow.

Or just return two surrogates.

I'm saying that for

String u="A \uD840\uDC08 F \uD840\uDC08 K";

which contains 9 Unicode "tokens" (code points)

u.length() returns 11 and

u.charAt(0) returns 'A'
u.charAt(1) returns ' '
u.charAt(2) returns '\uD840'
u.charAt(3) returns '\uDC08'
u.charAt(4) returns ' '
u.charAt(5) returns 'F'
u.charAt(6) returns ' '
u.charAt(7) returns '\uD840'
u.charAt(8) returns '\uDC08'
u.charAt(9) returns ' '
u.charAt(10) returns 'K'

To get the scalar 21-bit (int) values of
the two Unicode 4.0 supplementary codepoints you have to use
u.codePointAt(2)
and
u.codePointAt(7)

I don't know what u.codePointAt(3) and u.codePointAt(8) would do.
Like I said earlier, try it...

Carl Howells · Feb 11, 2004

Thomas said:
Or just return two surrogates.

You seem to be intentionally missing the point. skeptic's point is that
charAt() will no longer be able to be a simple index lookup, assuming
that String objects still use a char [] as their internal datatype.

Which means that one of the following will happen: charAt() will be much
slower now than it used to be, String will use more memory than it used
to (if it used an int [] internally, for instance), or some more
complicated clever approach will have to be used for internal storage
and/or the charAt method.

Dale King · Feb 11, 2004

skeptic said:
"Dale King" <kingd[at]tmicha[dot]net> wrote in message

...............

No, it still is 16 bits. Basically String and Character arrays are now
encoded in UTF-16 as opposed to UCS-2. To handle characters outside the BMP
requires the use of surrogates. They now distinguish between code points
(the Unicode value) and code units (Java char which is either a symbol from
BMP or a surrogate).

The best way to see what changes is to view the docs for Character (which
Thomas provided a link to) and also for String and search for "1.5" and see
the methods and values added since 1.5.

Click to expand...

Hi Dale!
I'm familiar with the basics of Unicode. Let me emphasize the point of
the question.
If the data inside a String are kept as UTF16-encoded char array, then
getting the i-th char is not as simple as return _data, hence slow.
The use of int[] solves it, but adds to memory hogginess.
What was their choice?

I agree that you can come up with some operations that are faster using an
int[] array. But I don't think those operations are nearly as common as you
think. It is not that often that you actually need to just randomly access
the contents such that you need to access the ith code point (the full 32
bit value).

Think about where is that i value supposedly coming from? How often given a
string do you want to goto the 1000th character. Most of the time you want
to find an index into the string (e.g. by doing a search) and then get the
characters around that point. That doesn't require the ability to index into
the string by number of code points (they don't even provide an API for
doing this). It is perfectly doable by using code unit indexes.

So for example given a code unit index into a string this code would extract
the next 5 codepoints:

int[] points = new int[ 5 ];

for( int i = 0; i < 5; i++ )
{
int point = myString.codePointAt( index );
points[ i ] = point;
index += Character.charCount( point )
}

I'm sure you can come up with some reasonable cases where you might need the
functionality you describe, but I think it is rare enough that a little
extra time wins the trade-off with twice as much memory for every single
string.

Thomas Schodt · Feb 11, 2004

Carl said:
Thomas said:

Or just return two surrogates.

Click to expand...

You seem to be intentionally missing the point.

skeptic's point is that
charAt() will no longer be able to be a simple index lookup, assuming
that String objects still use a char [] as their internal datatype.

My point is that charAt() is *still* a simple index lookup.

Any Unicode 4.0 supplementary codepoints units in Strings are stored as
two char values (surrogates).

This means that Strings can potentially display as few as half as many
codepoint units as String.length() reports.
For Strings containing Unicode 4.0 supplementary codepoints the index
you must pass to charAt() no longer corresponds to the offset of the
codepoint unit in the visual representation of the String.

You can use codePointAt() to get the 21-bit int value of codepoint units
in a String. When codePointAt() is called with the index of the first
surrogate of a Unicode 4.0 supplementary codepoint unit it returns the
21-bit int value of the entire pointcode unit (occupying the bytes at
index and at index+1 in the String). When codePointAt() is called with
the index of a "regular" Unicode codepoint it returns the 16-bit int
value of the pointcode unit numerically equivalent to the value charAt()
would return.

I'll let someone else try what happens if you give codePointAt() the
index of the second surrogate of a Unicode 4.0 supplementary pointcode.

charAt() will be much
slower now than it used to be,
Nope.

String will use more memory than it used to

Nope.
Well, for Unicode 4.0 supplementary codepoints, yes.
Since these are 21-bit values and would not fit in a char.

or some more
complicated clever approach will have to be used for internal storage

Yes. Surrogate pairs.

or some more
complicated clever approach will have to be used for
the charAt() method.

Nope.

Thomas Schodt · Feb 11, 2004

Thomas said:
You can use codePointAt() to get the 21-bit int value of codepoint units
in a String. When codePointAt() is called with the index of the first
surrogate of a Unicode 4.0 supplementary codepoint unit it returns the
21-bit int value of the entire pointcode unit (occupying the bytes at
index and at index+1 in the String)

That should be

(occupying the chars at index and at index+1 in the String)

Dale King · Feb 12, 2004

Thomas Schodt said:
I'll let someone else try what happens if you give codePointAt() the
index of the second surrogate of a Unicode 4.0 supplementary pointcode.

It will return the int value of that surrogate. Basically if an unpaired
surrogate is found then it returns the surrogate that it did find. I just
submitted a bug report yesterday to do with this. How would you detect that
the value you got back was a surrogate. There is Character.isHighSurrogate
and Character.isLowSurrogate, but they only take char not int.

Thomas Schodt · Feb 13, 2004

Dale said:
It will return the int value of that surrogate. Basically if an unpaired
surrogate is found then it returns the surrogate that it did find. I just
submitted a bug report yesterday to do with this. How would you detect that
the value you got back was a surrogate. There is Character.isHighSurrogate()
and Character.isLowSurrogate(), but they only take char not int.

int val = s.codePointAt(i);
if ((val&0xffff) != val) {} // supplementary (int) codepoint
else if (Character.isLowSurrogate((char)val)) {} // 2nd surrogate
else if (Character.isHighSurrogate((char)val)) {} // 1st surrogate
else {} // regular codepoint

or

int val = s.codePointAt(i);
if (Character.getType(val) == Character.SURROGATE) {}

Dale King · Feb 13, 2004

Thomas Schodt said:
int val = s.codePointAt(i);
if ((val&0xffff) != val) {} // supplementary (int) codepoint
else if (Character.isLowSurrogate((char)val)) {} // 2nd surrogate
else if (Character.isHighSurrogate((char)val)) {} // 1st surrogate
else {} // regular codepoint

Which relies too much on knowing the numeric values. I could have just
compared against MIN_SURROGATE and MAX_SURROGATE, but I shouldn't have to do
that.

int val = s.codePointAt(i);
if (Character.getType(val) == Character.SURROGATE) {}

Yes, I mentioned this one in relation to the bug and Brian Beck is going to
discuss the whole issue with the expert group.

As I wrote to him today, I'm thinking that the real problem here is that
codePoint method is returning the surrogate value as an int. Perhaps it
would be better served by returning something like -1 to indicate an error.
If you then want the erroneous surrogate value then you can get it using
charAt, which will give you the correctly typed code unit.

regular expressions and matching delimeters	17	May 21, 2014
regexp(ing) Backus-Naurish expressions ...	7	Mar 13, 2013
Utility to locate errors in regular expressions	3	May 24, 2013
regular expressions on strings with multiple newlines	2	Sep 13, 2008
Regular Expression problem	5	Aug 10, 2007
know-how(-not) about regular expressions	11	Feb 12, 2010
Parsing Log records with regular expressions	2	Feb 3, 2011
using regular expressions...	1	Nov 11, 2008

Regular Expressions

Markos Charatzas

nos

Chris Smith

Markos Charatzas

Markos Charatzas

Dale King

skeptic

Thomas Schodt

Dale King

skeptic

Thomas Schodt

skeptic

Thomas Schodt

Carl Howells

Dale King

Thomas Schodt

Thomas Schodt

Dale King

Thomas Schodt

Dale King

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads