[QUIZ] Editing Text (#145)

R

Ruby Quiz

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Eric Mahurin

Have you ever wondered how a text buffer might be represented in a text editor
or word processor? A simple string to represent text buffer isn't efficient
enough because inserting (i.e. typing) and deleting (backspace) in the middle
would result in moving all of the text to the end for each operation. A data
structure that can efficiently insert and delete in the middle is needed.

The task is to implement data structures for efficiently editing text. One
common data structure is a gap buffer:

http://en.wikipedia.org/wiki/Gap_buffer

Other options to consider include: ropes (see quiz #137), linked lists, simple
strings, or a multi-level combination of data structures (i.e. for lines vs.
characters in a line). There are many data structures that may work efficiently
with simple editing operations, but not all of those data structures will work
well for more complex functionality.

All of the basic operations occur around a text cursor. The minimal operations
on/at the cursor should be:

* Insert a character before or after the cursor.
* Delete a character before or after the cursor and return the
deleted character.
* Move the cursor left or right (crossing line boundaries if
necessary) and return the character or nil at the beginning
or end of the buffer.
* Move the cursor up or down a line and return nil/false only if a
line boundary could not be crossed. The cursor may be placed in
the most natural column for the data structure.

Additional useful operations that you might find in a typical text editor can
also be added:

* Get current line and column numbers
* Copy some amount of text before or after the cursor and return
this buffer.
* Display some context around the cursor.
* Cut some amount of text before or after the cursor and return
this buffer.
* Paste a copy/cut buffer before or after the cursor.
* Insert a file.
* Write to a file.
* Goto a specific line or column.
* Goto the begin/end of the line/buffer.
* Copy or cut to a specific line/column.
* Filter some text through a ruby block.
* Search (and possibly replace) using a regular expression.
* Undo/redo.

Major bonus points for the following where gap buffers probably won't work:

* Only store changes to a file.
* Handle multiple efficient cursors in a text buffer.

Although the focus is on data structures and making the ruby DSL equivalent to
unix "ed" or DOS "edlin", a GUI could be added to make a full-screen/window text
editor.

Here is a benchmark for testing that needs the minimal implementation
(#insert_before, #insert_after, #delete_before, #delete_after, #left, #right,
#up, #down):

# edit_test.rb
# usage: ruby -r klass.rb edit_test.rb <iter> \
# [<constructor> [<lines> <columns>] ...]

require 'benchmark'
require 'test/unit/assertions'
include Test::Unit::Assertions

# char = byte pre 1.9, each_char already defined in 1.9
unless "".respond_to?:)each_char)
class String;alias_method:)each_char, :each_byte);end
end

iterations = ARGV.shift.to_i

while cursor = ARGV.shift
nlines = (ARGV.shift || 10000).to_i
ncolumns = (ARGV.shift || 100).to_i
n = nlines*ncolumns
chars = (?a..?z).to_a
line = (0...ncolumns).inject("") { |line, i|
line << chars[i%chars.length]
}
line[-1] = ?\n

iterations.times { eval(cursor).instance_eval {
Benchmark.benchmark(
"#{cursor}: #{nlines}x#{ncolumns}\n",16,nil,"total"
) { |b|

total = b.report("insert_before") {
nlines.times { line.each_char { |ch| insert_before(ch) } }
}
i = 0
total += b.report("left") { i += 1 while left }
assert_equal(n, i)
i = 0
total += b.report("right") { i += 1 while right }
assert_equal(n, i)
i = 0
total += b.report("up") { i += 1 while up }
assert_equal(nlines, i)
i = 0
total += b.report("down") { i += 1 while down }
assert_equal(nlines, i)
total += b.report("insert_after") {
nlines.times { line.each_char { |ch| insert_after(ch) } }
}
i = 0
total += b.report("delete_before") {
i += 1 while delete_before
}
assert_equal(n, i)
i = 0
total += b.report("delete_after") {
i += 1 while delete_after
}
assert_equal(n, i)

[total]

}
} }
end
 
T

Todd Burch

James said:
All of the basic operations occur around a text cursor. The minimal
operations
on/at the cursor should be:

James, thanks for putting all these challenges together, and thanks to
the contributors too.

Just a terminology thing - the proper term is "caret", not a "text
cursor".

Todd
 
T

Todd Burch

James said:
On Oct 26, 2007, at 8:15 AM, Todd Burch wrote:


While I probably would have called it a caret as well, the term
doesn't seem completely out of line:

Not out of line in the least. Without resorting to cheesey emoticons
(wink wink, smiley, grin, et al), mine was a passing comment. I am sure
no one got confused from your description.

Again, thanks.
 
J

James Edward Gray II

Not out of line in the least. Without resorting to cheesey emoticons
(wink wink, smiley, grin, et al), mine was a passing comment. I am
sure
no one got confused from your description.

Just to be clear, I didn't write that quiz. Eric Mahurin did.

James Edward Gray II
 
R

Rick DeNatale

While I probably would have called it a caret as well, the term
doesn't seem completely out of line:

http://en.wikipedia.org/wiki/Cursor

http://en.wikipedia.org/wiki/Cursor_(computers)

Actually, I'd say that caret is a UI/View related term (its the name
of the ^ character, traditionally used in blue-pencil editing to mark
a place to insert test.

Cursor is kind of both a view and a model term, and I think we're
talking about a model in this quiz.

That said, in general, when I've looked at similar code, the analogous
concept usually used is a selection, which represents a contiguous run
of zero or more characters within the buffer. The quiz seems to be
using a degenerate version of this where the length is constrained to
be zero.
 
J

Joel VanderWerf

Todd said:
James, thanks for putting all these challenges together, and thanks to
the contributors too.

Just a terminology thing - the proper term is "caret", not a "text
cursor".

Todd

When the Macintosh first came out, the little blinking vertical bar in
text editors was called the insertion point or caret, but, because of
its shape, it was also called the "stick".

It doesn't matter whether you see it as a caret or a stick, as long as
it provides sufficient motivation for the Quiz. ;) ;) ;)
 
E

Eric Mahurin

Actually, I'd say that caret is a UI/View related term (its the name
of the ^ character, traditionally used in blue-pencil editing to mark
a place to insert test.

Cursor is kind of both a view and a model term, and I think we're
talking about a model in this quiz.

Correct. I've also seen the term cursor being used with databases
(pointer to what row is being operated on). It has also been used to
mean the same as what rubyists call an "external iterator" (ruby
mainly uses "internal iterators" or what others might call visitors).

Before mice came on the scene the symbol representing the text
insertion point was also called a "cursor". I didn't realize that
"cursor" is now mainly used for the mouse now and the text cursor has
been demoted to "caret". I usually just say "mouse pointer" and "text
cursor". According to how the term "cursor" is used in data
structures, I think the text cursor/caret deserves to use this term
more than a mouse pointer/cursor does.
That said, in general, when I've looked at similar code, the analogous
concept usually used is a selection, which represents a contiguous run
of zero or more characters within the buffer. The quiz seems to be
using a degenerate version of this where the length is constrained to
be zero.

Don't let the test in this quiz stop you. I just provided a benchmark
for the simple stuff. Feel free to implement other text editor
operations in addition to the simple operations that the benchmark
uses. And don't let the benchmark restrict your API. If the model
that the benchmark assumes doesn't match your model (i.e. a selection
in your case?), change the benchmark test to what you need. Also,
achieving the best absolute performance on this benchmark isn't
necessarily the most important, but you should get reasonable big-O
performance. I'd like to see what data structures some of you come up
with and how far you can take them in terms of functionality. I think
there are lots of solutions with a variety of data structures.

Have fun!

Eric
 
E

Eric Mahurin

A simple string to represent text buffer isn't efficient
enough because inserting (i.e. typing) and deleting (backspace) in the middle
would result in moving all of the text to the end for each operation.

Here is an inefficient string-based solution:

class StringCaret
# cursor/caret is between char i-1 and i
def initialize(data="", i=0)
@data = data
@i = i
end
def insert_before(ch)
@data[@i,0] = ("" << ch)
@i += 1
end
def insert_after(ch)
@data[@i,0] = ("" << ch)
end
def delete_before
@i.nonzero? and @data.slice!(@i-=1)
end
def delete_after
@data.slice!(@i)
end
def left
@i.nonzero? and @i-=1
end
def right
@i<@data.length and @i+=1
end
def up
while @i.nonzero?
@i -= 1
break(true) if @data[@i]==?\n
end
end
def down
while @i<@data.length
break(@i+=1) if @data[@i]==?\n
@i += 1
end
end
end


The benchmark results look like this on my machine:
ruby -r string.rb test.rb 2 StringCaret.new 1000 100 StringCaret.new 2000 100
StringCaret.new: 1000x100
...
total 7.800000 0.030000 7.830000 ( 7.851117)
StringCaret.new: 2000x100
...
total 26.810000 0.100000 26.910000 ( 27.015445)

The run-time almost quadrupled when the data-size doubled, so this is O(n**2).

An O(n) solution with this minimal functionality is doable in about
the same number of lines as above. But, you have to use another data
structure...

Eric
 
P

Philip Gatt

Now that you've submitted a reference implementation, it looks much
more interesting to try to optimize. I bet you'll get a better
response now.

A simple string to represent text buffer isn't efficient
enough because inserting (i.e. typing) and deleting (backspace) in
the middle
would result in moving all of the text to the end for each operation.

Here is an inefficient string-based solution:

class StringCaret
# cursor/caret is between char i-1 and i
def initialize(data="", i=0)
@data = data
@i = i
end
def insert_before(ch)
@data[@i,0] = ("" << ch)
@i += 1
end
def insert_after(ch)
@data[@i,0] = ("" << ch)
end
def delete_before
@i.nonzero? and @data.slice!(@i-=1)
end
def delete_after
@data.slice!(@i)
end
def left
@i.nonzero? and @i-=1
end
def right
@i<@data.length and @i+=1
end
def up
while @i.nonzero?
@i -= 1
break(true) if @data[@i]==?\n
end
end
def down
while @i<@data.length
break(@i+=1) if @data[@i]==?\n
@i += 1
end
end
end


The benchmark results look like this on my machine:
ruby -r string.rb test.rb 2 StringCaret.new 1000 100
StringCaret.new 2000 100
StringCaret.new: 1000x100
...
total 7.800000 0.030000 7.830000 ( 7.851117)
StringCaret.new: 2000x100
...
total 26.810000 0.100000 26.910000 ( 27.015445)

The run-time almost quadrupled when the data-size doubled, so this
is O(n**2).

An O(n) solution with this minimal functionality is doable in about
the same number of lines as above. But, you have to use another data
structure...

Eric
 
E

Eric Mahurin

Now that you've submitted a reference implementation, it looks much
more interesting to try to optimize. I bet you'll get a better
response now.

Yes, I probably should have put something like this in the quiz to begin with.
 
E

Eric Mahurin

The task is to implement data structures for efficiently editing text. One
common data structure is a gap buffer:

http://en.wikipedia.org/wiki/Gap_buffer

Other options to consider include: ropes (see quiz #137), linked lists, simple
strings, or a multi-level combination of data structures (i.e. for lines vs.
characters in a line). There are many data structures that may work efficiently
with simple editing operations, but not all of those data structures will work
well for more complex functionality.

Since I haven't seen any solutions yet, I'd thought I'd elaborate on
some possible solutions I was thinking of:

* conventional gap buffer implemented with a string. The string would
have 3 parts (a couple indices might be used for the markers): before
the cursor, the "gap" (containing garbage), and after the cursor.
When you have no more space in the gap, copy the data to a new string
using a larger gap. Think about how the gap should be sized. I
believe most text editors use a gap buffer.

* gap buffer implemented with two strings. Represent the text before
the cursor with one string and the text after the cursor with another
string that is reversed. All operations around the cursor correspond
to operations at the ends of these strings - O(1). Unlike the
conventional gap buffer, you don't have to deal with reallocation (gap
filing up). Instead let ruby do its string reallocation when a
string's capacity fills up. The memory efficiency might not be as
good as the conventional gap buffer, though.

* gap buffer with deferred gap movement. The "gap" is really only
useful when making edits (insertions/deletions). For simple cursor
movements, you don't have to "move" the gap immediately. The cursor
position could be changed independently of the gap. But, you'll have
to synchronize the gap to the cursor position once an edit is needed.
Random access cursor movement can also be done in O(1) time, but
you'll still have to pay for it if you immediately make an edit at
that new position.

* use a rope implementation (quiz #137). You could start with the
inefficient string solution I gave earlier and operate on a rope
instead (making new ones if using an immutable implementation). It
may be difficult to make the solution as fast as the gap buffers for
simple stuff (likely slower by O(log(n))), but ropes will probably be
more powerful for complex functionality. I believe ropes were
invented for text editing (something called Coral I believe).

* linked lists. The text could be represented as a linked list of
characters. The cursor position might correspond to one of the linked
list nodes or links. All operations around the cursor should be O(1).
An advantage over gap buffers is that you can efficiently implement
multiple cursors (possibly from different editing panes or even
different users). The downside is the memory usage is higher and
random access might be higher.

* use a two level data structure. One data structure could be used
for lines and another for the characters within each line. All of the
above data structures apply to both. For managing lines, you'd need
to use arrays instead of strings in the data structure since the
element is a line instead of a character. Using a two-level data
structure will give better performance for line operations (i.e.
up/down).

Enjoy!

Eric
 
B

Bill Kelly

From: "Eric Mahurin said:
Since I haven't seen any solutions yet, I'd thought I'd elaborate on
some possible solutions I was thinking of: [...]
* linked lists. The text could be represented as a linked list of
characters. The cursor position might correspond to one of the linked
list nodes or links. All operations around the cursor should be O(1).
An advantage over gap buffers is that you can efficiently implement
multiple cursors (possibly from different editing panes or even
different users). The downside is the memory usage is higher and
random access might be higher.

I was hoping to use this quiz to play around with mmap, but I don't
think I'll have time this week. :(

For fun, I was going to read the file-to-be-edited into a memory
mapped file, breaking the data up into VM-page sized blocks (4K, on
my system I think.)

The blocks would be linked together, and a given block would not
be required to be full. Thus, similar to the gap buffer technique,
if an insert occurred in a full block, it could be split into two
or more partially full blocks. (Adjacent partially full blocks
could be coalesced in such a way as to prevent the accumulation of
lots of partially full blocks, wasting space. (Haven't fully
thought this through, but it doesn't seem like it would be
problematic, unless there's a snag of some sort I'm overlooking.))

Then, I wanted to play with a B-tree to support fast random access
into the block list. I didn't quite figure this out, but it seems
there should be a way to construct the B-tree such that one could
index into it using the desired file offset, and wind up at the
appropriate block. Seems like I would need to store the byte
count of how much data is held by a block in the B-tree leaf nodes,
and as inserts & deletes occurred these sizes would change, and
would propagate up to the parent, to the root of the B-tree. In
other words, each B-tree node would know the byte count of the
amount of data held by its children.

With a 4K page size, each B-tree node could have lots of children,
and thus the tree would tend to be very shallow. However because
I think I'd want to store the child node sizes in a relative way,
it would preclude doing a binary search at any given node level to
find a desired offset; one would have to sum the sizes at a given
node level to locate the child node one was interested in. . . .
Still the trade-offs seem probably reasonable because most
relative cursor movement could occur without searching the tree
(the blocks are linked together). So the B-tree would be used
just for random-access jumps through the file.

Alternatively, it might be neat to instead construct the B-tree to
organize the file by line offsets. Random access by line number
would probably make more sense in a text editor than random access
by byte offset anyway. :) :)

Oh well - anyway, thanks for the quiz. Even though I'll not
likely find time to try implementing the above, it's been fun to
think about!


Regards,

Bill
 
R

Robert Dober

From: "Eric Mahurin said:
Since I haven't seen any solutions yet, I'd thought I'd elaborate on
some possible solutions I was thinking of: [...]
* linked lists. The text could be represented as a linked list of
characters. The cursor position might correspond to one of the linked
list nodes or links. All operations around the cursor should be O(1).
An advantage over gap buffers is that you can efficiently implement
multiple cursors (possibly from different editing panes or even
different users). The downside is the memory usage is higher and
random access might be higher.

I was hoping to use this quiz to play around with mmap, but I don't
think I'll have time this week. :(
Exactly the same here, maybe this is one of the occasions to
prolongate for a week?
I wanted to finish what I began with ropes, and I have a long WE ahead ;).

Robert
 
J

James Edward Gray II

From: "Eric Mahurin said:
Since I haven't seen any solutions yet, I'd thought I'd elaborate on
some possible solutions I was thinking of: [...]
* linked lists. The text could be represented as a linked list of
characters. The cursor position might correspond to one of the
linked
list nodes or links. All operations around the cursor should be O
(1).
An advantage over gap buffers is that you can efficiently implement
multiple cursors (possibly from different editing panes or even
different users). The downside is the memory usage is higher and
random access might be higher.

I was hoping to use this quiz to play around with mmap, but I don't
think I'll have time this week. :(
Exactly the same here, maybe this is one of the occasions to
prolongate for a week?
I wanted to finish what I began with ropes, and I have a long WE ahead

I'm willing to extend this quiz a week if that's what people want.
In my experience that doesn't generally fetch more solutions, but I
too am strapped for time and would like to try it. All in favor?

James Edward Gray II
 
E

Eric Mahurin

I'm willing to extend this quiz a week if that's what people want.
In my experience that doesn't generally fetch more solutions, but I
too am strapped for time and would like to try it. All in favor?

I'm in favor. Another quiz during rubyconf may not be a good idea
anyways. I was hoping to at least see some simple solutions right
now. A minimal solution can be done in about 35 lines of code (most
of those lines are class/def/end).

Eric
 
J

James Edward Gray II

I'm in favor.

Three votes is good enough for me. We'll add a week for this quiz.

Robert, you are now honor bound to solve it. ;)

James Edward Gray II
 
R

Robert Dober

Three votes is good enough for me. We'll add a week for this quiz.

Robert, you are now honor bound to solve it. ;)
Not much to lose;) but I here the message ;)
R.
 
E

Eric Mahurin


Here is a strange solution of mine.

I start by partially implementing a string class that works equally
well on the front and the back (double ended queue string). My
array.c patch from a couple years ago did something like this, but the
data structure below is a bit more memory efficient since it can share
a single "gap" and recirculates unused space better. Basically, the
implementation below is a circular buffer that reallocates when it is
full. It would be great if the standard String and Array were
symmetrical like this. It would increase the ways to use simple
strings and arrays.

Next I use this double ended queue string to implement something like
a gap buffer. You can think of the gap as being at the ends of of
this double-ended string. Text before the cursor is at the end of the
string and text before the cursor is at the beginning.

The performance is O(n), but not nearly as fast as some a simple
solution (5X slower). If String was symmetical, I would use that and
it would be a different story.

If I have time, I may try this DQString in a linked-list data-structure...

dqstring.rb:


# characters can be in either of these ranges:
# gapped : @begin...length, 0...@end
# contiguous : @begin...@end
# inheriting String only for memory efficiency
# inherited methods won't necessarily make sense

class DQString < String
def initialize(data="")
@begin = 0
@end = data.length
super(data)
end
def inspect
"#<DQString: begin=#{@begin}, end=#{@end}, #{super}>"
end
def length
if (len = @end-@begin)>=0
len
else
super+len
end
end
def << (ch)
if @end==size
if @begin>1
# use the front, making a gap in the middle
self[0] = ch
@end = 1
self
else
@end += 1
# let ruby realloc when needed
super(ch)
end
else
self[@end] = ch
@end += 1
if @end==@begin
# double the size when the gap becomes empty
@begin += size
self.replace(self+self)
else
self
end
end
end
def >> (ch)
if @begin.zero?
@begin = size-1
if @end<@begin
# use the back, making a gap in the middle
self
else
# double the size to make room at the front
@end += size
self.replace(self+self)
end
else
@begin -= 1
if @begin==@end
# double the size when the gap becomes empty
@begin += size
self.replace(self+self)
else
self
end
end
ensure
self[@begin] = ch
end
def pop
if @end==@begin
nil
else
@end -= 1
ch = self[@end]
if (len = @end-@begin)>=0
# remove excess trailing space if too much
self.slice!(-len,len) if size-@end>(len<<1)
elsif @end.zero?
len = (@end=size)-@begin
if @begin>(len<<1)
# remove excess leading space
self.slice!(0, len)
@begin -= len
@end -= len
end
end
ch
end
end
def shift
if @begin==@end
nil
else
ch = self[@begin]
@begin += 1
if (len = @end-@begin)>=0
if @begin>(len<<1)
# remove excess leading space
self.slice!(0, len)
@begin -= len
@end -= len
end
elsif @begin==size
# remove excess trailing space if too much
self.slice!(-@end,@end) if (@end<<1)+len<0
@begin = 0
end
ch
end
end
end


dqcursor.rb:


require 'dqstring'

class DQCursor
def initialize
@data = DQString.new
@nafter = 0
end
def insert_before(ch)
@data << ch
end
def insert_after(ch)
@nafter += 1
@data >> ch
end
def delete_before
@data.length>@nafter and @data.pop
end
def delete_after
@nafter.nonzero? and (@nafter-=1; @data.shift)
end
def left
@data.length>@nafter and (@nafter+=1; @data >> @data.pop)
end
def right
@nafter.nonzero? and (@nafter-=1; @data << @data.shift)
end
def up
nbefore = @data.length-@nafter
while nbefore.nonzero?
nbefore -= 1
@data >> ([email protected])
return(true) if ch==?\n
end
ensure
@[email protected]
end
def down
while @nafter.nonzero?
@nafter -= 1
@data << ([email protected])
return(true) if ch==?\n
end
end
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top