How to split(//) with respect to bigraphs?

P

Pavel Smerk

And once more question:

In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters with
respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width positive
look-behind assertion", so the question is how can one efficiently split
the string in Ruby?

Thanks,

P.
 
J

Justin Collins

Pavel said:
And once more question:

In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters
with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width
positive look-behind assertion", so the question is how can one
efficiently split the string in Ruby?

Thanks,

P.
Does this work?

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

-Justin
 
P

Paul Battley

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Or use scan:

str.scan(/(?:ch)|./i)

You might still have a problem with other characters, though,
depending on the encoding and normalisation.

Paul.
 
P

Pavel Smerk

one more :)
In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters
with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width
positive look-behind assertion", so the question is how can one
efficiently split the string in Ruby?

Stupid question. :) One should not insist on word-for-word translation
when rewriting some code from Perl to Ruby. :)

The solution can be e.g. scan(/[cC][hH]|./)

irb(main):001:0> "cHeck czeCh".scan(/[cC][hH]|./)
=> ["cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]
Does this work?

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Scan version is slightly better as it never returns the empty string. Of
course, thanks anyway.

But where can one find this feature of the split in the documentation?
http://www.rubycentral.com/ref/ref_c_string.html#split does not mention
split returns not only delimited substrings, but also successful groups
from the match of the regexp.

Regards,

P.
 
P

Pavel Smerk

Paul said:
irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]


Or use scan:

str.scan(/(?:ch)|./i)

Yes, the use of scan strikes me in the meantime too. Why (?:)?
str.scan(/ch|./i) does exactly the same, doesn't it?

Thank you,

P.
 
P

Paul Battley

Yes, the use of scan strikes me in the meantime too. Why (?:)?
str.scan(/ch|./i) does exactly the same, doesn't it?

Yeah, there's no need for the (?: ... ). I started off thinking it was
more complicated than it was, and forgot to take that out. I really
need a regexp refactoring tool.

Paul.
 
J

Justin Collins

Pavel said:
one more :)
In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters
with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width
positive look-behind assertion", so the question is how can one
efficiently split the string in Ruby?

Stupid question. :) One should not insist on word-for-word
translation when rewriting some code from Perl to Ruby. :)

The solution can be e.g. scan(/[cC][hH]|./)

irb(main):001:0> "cHeck czeCh".scan(/[cC][hH]|./)
=> ["cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]
Does this work?

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Scan version is slightly better as it never returns the empty string.
Of course, thanks anyway.

But where can one find this feature of the split in the documentation?
http://www.rubycentral.com/ref/ref_c_string.html#split does not
mention split returns not only delimited substrings, but also
successful groups from the match of the regexp.

Regards,

P.

As far as I can see, it's not in the documentation. I found it by
accident. But, yes, the scan method is better. :)

-Justin
 
D

Dave Howell

As far as I can see, it's not in the documentation. I found it by
accident. But, yes, the scan method is better. :)

Oh, my gosh. If only you'd posted this little tidbit two days ago, I'd
have saved a couple hours of code-wrangling.

For sorting purposes, I needed to turn something like
(e-mail address removed)
into
(e-mail address removed)-and

I started with str.split(/[.]|@/), but then I'd lose where the @ went.
I tried turning it into
["one-and", ".", "two", "@", "three", ".", "net"]
so I could .reverse that, but without positive look-behind, I couldn't
find any way to detect the break *after* the dot except with \w, which
would also trigger after the hyphen.

After hours of work, I ended up with something that was not only long
and confusing, involving .collect and an inner search loop and other
stuff, but when I brought it back up to check it for this email
message, I discovered that it didn't even actually work correctly.

And all along, all I needed to do was change
str.split(/[.]|@).reverse.join
into
str.split(/([.]|@)/).reverse.join

Dang. And thanks! :)
 
M

Morton Goldberg

But where can one find this feature of the split in the
documentation? http://www.rubycentral.com/ref/
ref_c_string.html#split does not mention split returns not only
delimited substrings, but also successful groups from the match of
the regexp.

In Dave Thomas' Pickaxe book. Under String#split he writes:

"If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern
matches a zero-length string, str is split into individual
characters. If pattern includes
groups, these groups will be included in the returned values."

Then he gives the following example:

"a@1bb@2ccc".split(/@(\d)/) => ["a", "1", "bb", "2", "ccc"]

Regards, Morton
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top