How to split(//) with respect to bigraphs?

Pavel Smerk · Aug 2, 2006

And once more question:

In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters with
respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width positive
look-behind assertion", so the question is how can one efficiently split
the string in Ruby?

Thanks,

P.

Justin Collins · Aug 2, 2006

Pavel said:
And once more question:

In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters
with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width
positive look-behind assertion", so the question is how can one
efficiently split the string in Ruby?

Thanks,

P.

Does this work?

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

-Justin

Paul Battley · Aug 2, 2006

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Or use scan:

str.scan(/(?:ch)|./i)

You might still have a problem with other characters, though,
depending on the encoding and normalisation.

Paul.

Pavel Smerk · Aug 2, 2006

one more

In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters
with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width
positive look-behind assertion", so the question is how can one
efficiently split the string in Ruby?

Click to expand...

Stupid question.

One should not insist on word-for-word translation
when rewriting some code from Perl to Ruby.

The solution can be e.g. scan(/[cC][hH]|./)

irb(main):001:0> "cHeck czeCh".scan(/[cC][hH]|./)
=> ["cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Does this work?

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Scan version is slightly better as it never returns the empty string. Of
course, thanks anyway.

But where can one find this feature of the split in the documentation?
http://www.rubycentral.com/ref/ref_c_string.html#split does not mention
split returns not only delimited substrings, but also successful groups
from the match of the regexp.

Regards,

P.

Pavel Smerk · Aug 2, 2006

Paul said:
irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Click to expand...

Or use scan:

str.scan(/(?:ch)|./i)

Yes, the use of scan strikes me in the meantime too. Why (?

?
str.scan(/ch|./i) does exactly the same, doesn't it?

Thank you,

P.

Paul Battley · Aug 2, 2006

Yes, the use of scan strikes me in the meantime too. Why (??
str.scan(/ch|./i) does exactly the same, doesn't it?

Yeah, there's no need for the (?: ... ). I started off thinking it was
more complicated than it was, and forgot to take that out. I really
need a regexp refactoring tool.

Paul.

Justin Collins · Aug 2, 2006

Pavel said:
one more

In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters
with respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width
positive look-behind assertion", so the question is how can one
efficiently split the string in Ruby?

Click to expand...

Click to expand...

Stupid question. One should not insist on word-for-word
translation when rewriting some code from Perl to Ruby.

The solution can be e.g. scan(/[cC][hH]|./)

irb(main):001:0> "cHeck czeCh".scan(/[cC][hH]|./)
=> ["cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Does this work?

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Click to expand...

Scan version is slightly better as it never returns the empty string.
Of course, thanks anyway.

But where can one find this feature of the split in the documentation?
http://www.rubycentral.com/ref/ref_c_string.html#split does not
mention split returns not only delimited substrings, but also
successful groups from the match of the regexp.

Regards,

P.

As far as I can see, it's not in the documentation. I found it by
accident. But, yes, the scan method is better.

-Justin

Dave Howell · Aug 2, 2006

As far as I can see, it's not in the documentation. I found it by
accident. But, yes, the scan method is better.

Oh, my gosh. If only you'd posted this little tidbit two days ago, I'd
have saved a couple hours of code-wrangling.

For sorting purposes, I needed to turn something like
(e-mail address removed)
into
(e-mail address removed)-and

I started with str.split(/[.]|@/), but then I'd lose where the @ went.
I tried turning it into
["one-and", ".", "two", "@", "three", ".", "net"]
so I could .reverse that, but without positive look-behind, I couldn't
find any way to detect the break *after* the dot except with \w, which
would also trigger after the hyphen.

After hours of work, I ended up with something that was not only long
and confusing, involving .collect and an inner search loop and other
stuff, but when I brought it back up to check it for this email
message, I discovered that it didn't even actually work correctly.

And all along, all I needed to do was change
str.split(/[.]|@).reverse.join
into
str.split(/([.]|@)/).reverse.join

Dang. And thanks!

Morton Goldberg · Aug 2, 2006

But where can one find this feature of the split in the
documentation? http://www.rubycentral.com/ref/
ref_c_string.html#split does not mention split returns not only
delimited substrings, but also successful groups from the match of
the regexp.

In Dave Thomas' Pickaxe book. Under String#split he writes:

"If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern
matches a zero-length string, str is split into individual
characters. If pattern includes
groups, these groups will be included in the returned values."

Then he gives the following example:

"a@1bb@2ccc".split(/@(\d)/) => ["a", "1", "bb", "2", "ccc"]

Regards, Morton

how to use split method in ruby	5	Dec 7, 2007
Way to split a string based on fixed length?	7	Oct 20, 2008
irb: Segmentation fault	1	Aug 2, 2006
Fwd: How to Split Chinese Character with backslash representation?	0	Oct 26, 2006
How to Split Chinese Character with backslash representation?	8	Oct 26, 2006
How to use regex to split a sentence with different spaces?	2	Jan 30, 2007
XSLT: branching node processing with respect to node type possible?	2	Jul 15, 2003
Beginner User having issue with converting char to ASCII	8	Sep 12, 2008

How to split(//) with respect to bigraphs?

Pavel Smerk

Justin Collins

Paul Battley

Pavel Smerk

Pavel Smerk

Paul Battley

Justin Collins

Dave Howell

Morton Goldberg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads