String spliting and inclusion

S

Stuart Clarke

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = "big long string"

puts data.scan(/{50}/)

This nicely breaks up the string however there are a few problems with
it, including:

It only outputs 50 character chunks, therefore when it gets to the end
and only 20 characters remain it misses them off the output (it outputs
6 50 characters strings and ignores the remaining 20)

This regex also splits up words, which is something I don't want. I want
a script to count to 50 and when it gets there, go backwards to find
some white space and split it at that point, therefore not breaking up a
word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Thanks in advance

Stuart
 
R

Robert Dober

s =3D "a bad day in the office today, " * 3
puts "Attention, some backtracking here:"
puts s.scan( /.{,20}\b/ )
puts "I cannot come up with a non backtracking solution right now :("

HTH
Robert

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data =3D "big long string"

puts data.scan(/{50}/)

This nicely breaks up the string however there are a few problems with
it, including:

It only outputs 50 character chunks, therefore when it gets to the end
and only 20 characters remain it misses them off the output (it outputs
6 50 characters strings and ignores the remaining 20)

This regex also splits up words, which is something I don't want. I want
a script to count to 50 and when it gets there, go backwards to find
some white space and split it at that point, therefore not breaking up a
word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Thanks in advance

Stuart


--=20
Toutes les grandes personnes ont d=92abord =E9t=E9 des enfants, mais peu
d=92entre elles s=92en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exup=E9ry]
 
7

7stud --

Stuart said:
Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = "big long string"

puts data.scan(/{50}/)

error: invalid regular expression; there's no previous pattern, to which
'{' would define cardinality


data =<<ENDOFSTRING
Hello world. Hello moon.
Goodbye world. Goodbye moon.

Hello world. Hello moon.
Goodbye world. Goodbye moon.
The end.
ENDOFSTRING

chunks = []
curr_chunk = []
curr_length = 0

data.scan(/.+?\b/m) do |word|
wlen = word.length

if curr_length + wlen <= 50
curr_chunk << word
curr_length += wlen
else
chunks << curr_chunk.join()
curr_chunk = [word]
curr_length = wlen
end
end

if curr_chunk.length > 0
chunks << curr_chunk.join()
end

p chunks

chunks.each do |chunk|
puts chunk.length
end
 
7

7stud --

--output:--
["Hello world. Hello moon.\nGoodbye world. Goodbye ", "moon.\n\nHello
world. Hello moon.\nGoodbye world. ", "Goodbye moon.\nThe end"]
49
48
21

Hmmm...I'm having a problem getting the ending period while using the
word boundary in the regex. I guess that's because there is no start of
a word after the ending period for the regex to match. \s works:

data.scan(/.+?\s/m) do |word|
 
H

Harry Kakueki

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Thanks in advance

Stuart


str = "I did not test this completely so you may need to make some
adjustments to this, but give it a try. This cuts on twenty instead of
fifty characters."

(str.length/20).times do
arr = str.split(//)
ess = arr.zip((0...arr.length).to_a)
tee = ess.reverse.detect{|y| y[0] == " " and y[1] <= 20}
p str.slice!(0..tee[1]).strip
end
p str



Harry
 
D

David A. Black

Hi --

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = "big long string"

puts data.scan(/{50}/)

This nicely breaks up the string however there are a few problems with
it, including:

It only outputs 50 character chunks, therefore when it gets to the end
and only 20 characters remain it misses them off the output (it outputs
6 50 characters strings and ignores the remaining 20)

This regex also splits up words, which is something I don't want. I want
a script to count to 50 and when it gets there, go backwards to find
some white space and split it at that point, therefore not breaking up a
word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Try this. I don't guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)


David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN)
 
R

Robert Dober

Hi --




Try this. I don't guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)
Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?

In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:

/.{,50}(?!\B)/

BTW it seems that {n,m} does not have a "non greedy" and "possessive"
variant, or did I miss it?

Cheers
Robert
 
R

Robert Dober

Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?

In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:

/.{,50}(?!\B)/
Nahh that leaves us with spaces at the beginning of the line, of
course we could do
scan(...).map( &:lstrip ) but that hurts my regex pride ;)

This seems to work (but does not really):

s =3D "Some words are made of letters! Some are not!"
puts s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ )

Replace the puts with p and you will see trailing whitespace now :(.

This is a little bastard of a problem indeed. Simplest I could come up
with so far:

s =3D "Some words are made of letters! Some are not!"
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )

HTH
Robert

BTW it seems that {n,m} does not have a "non greedy" and "possessive"
variant, or did I miss it?
Yes I did, they are there {n,m}? and {n,m}+, sorry.
Cheers
Robert


--=20
Toutes les grandes personnes ont d=92abord =E9t=E9 des enfants, mais peu
d=92entre elles s=92en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exup=E9ry]
 
R

Robert Dober

s = "Some words are made of letters! Some are not!"
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )

s.split( /(.{,10}\S)\s/ ).reject( &:empty? )
 
R

Robert Dober

s.split( /(.{,10}\S)\s/ ).reject( &:empty? )

good enough? Certainly not :(
s.split( /(.{,10}\S)\s+/ ).reject( &:empty? )

Robert

--=20
Toutes les grandes personnes ont d=92abord =E9t=E9 des enfants, mais peu
d=92entre elles s=92en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exup=E9ry]
 
D

David A. Black

Hi --

Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?

The idea was to start every scan at a \b. It's definitely not an
all-purpose solution to the problem anyway. For one thing, it doesn't
handle words of more than 50 characters -- which probably doesn't
matter, unless you're using it with a number less than 50:
=> ["this ", "is a ", "", " and ", "i ", "", " to ", "split", " it ",
"up ", "into ", "", " ", "", ""]

Without the first \b you get:

["this ", "is a ", "", "tring", " and ", "i ", "", "ntend", " to ",
"split", " it ", "up ", "into ", "", "ittle", " ", "", "rings", ""]

So... further tweaking required :)


David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN)
 
7

7stud --

Robert said:
I fail to see why you put a \b at the beginning David,
would you mind to explain?

Yes. Please explain that. Also please explain why you don't have
{1,50}?

Or, will you claim the 5th under the robustness disclaimer?
 
D

David A. Black

Yes. Please explain that. Also please explain why you don't have
{1,50}?

Or, will you claim the 5th under the robustness disclaimer?

I won't claim anything. Feel free to experiment with the code, which
I've already said repeatedly isn't a full solution, and see what you
come up with.


David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN)
 
R

Robert Dober

I won't claim anything. Feel free to experiment with the code, which
I've already said repeatedly isn't a full solution, and see what you
come up with.
Indeed this is very tricky, I had some doubts about your leading \b
example, but I experimented with lots of solutions and they were
covering it up. Thx for explaining. Unless OP says what he really
wants I shall stop for not making too much noise. e.g. there is the
issue of more than and one space and of course punctuation.
R.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top