Strings vs arrays

L

Luke Worth

Hi.

I am wanting to parse large files (of source code) by reading byte by
byte and converting certain patterns into different strings. I can think
of a couple of ways to do this (starting with the source code in a
String):
1. String#split(//) -> array, then use Array#shift (maybe slow to
convert to array?)
2. Keep a counter and use String#[] (which seems unrubyish to me)

Which of these would be preferable? Does anyone know of any better ways?
 
A

Austin Ziegler

I am wanting to parse large files (of source code) by reading byte by
byte and converting certain patterns into different strings. I can think
of a couple of ways to do this (starting with the source code in a
String):
1. String#split(//) -> array, then use Array#shift (maybe slow to
convert to array?)
2. Keep a counter and use String#[] (which seems unrubyish to me)

Regex. You know what these patterns are in advance.

-austin
--=20
Austin Ziegler * (e-mail address removed)
* Alternate: (e-mail address removed)
 
R

Robert Klemme

Luke Worth said:
Hi.

I am wanting to parse large files (of source code) by reading byte by
byte and converting certain patterns into different strings. I can think
of a couple of ways to do this (starting with the source code in a
String):
1. String#split(//) -> array, then use Array#shift (maybe slow to
convert to array?)
2. Keep a counter and use String#[] (which seems unrubyish to me)

Which of these would be preferable? Does anyone know of any better ways?

I'd do none of them. If you know that your patterns stay on single lines
and don't extend to following lines I'd read the file line by line and use
String#gsub. Something along the lines of

while ( line = gets )
line.gsub!( /pattern/, 'replacement' )
# or line.gsub!( /pattern/ ) {|match| create replacement }
puts line
end

If patterns extend lines I'd first try to slurp the whole file into mem and
then use gsub (possibly with option /m for multiline). Only if that does
not work (because of file size for example) I'd resort to more complicated
solutions like the ones you described.

Kind regards

robert
 
D

Daniel Brockman

Luke Worth said:
Hi.

I am wanting to parse large files (of source code) by reading byte by
byte and converting certain patterns into different strings. I can think
of a couple of ways to do this (starting with the source code in a
String):
1. String#split(//) -> array, then use Array#shift (maybe slow to
convert to array?)
2. Keep a counter and use String#[] (which seems unrubyish to me)

Which of these would be preferable? Does anyone know of any better ways?

Why not use String#shift, which is O(1)?
 
L

Luke Worth

Why not use String#shift, which is O(1)?
Because it doesn't exist :)

irb(main):001:0> "aoeu".shift
NoMethodError: undefined method `shift' for "aoeu":String
from (irb):1

To the people suggesting regex, sorry I did think of that. However it
won't work because of the problem: I want to convert all instances of
dot-space into a single space, and all instances of space-dot into a
single dot. All dots not followed or preceded by spaces (excluding
newline) must be turned into newlines.
For example, how would you solve the following:
"bla.h+aoeu.++test.\n+.hello" (where + represents a space)
i think it's hard.

Thanks for your help everyone, I think I've figured out how to do it
with a queue structure though.
 
J

Jim Weirich

Because it doesn't exist :)

irb(main):001:0> "aoeu".shift
NoMethodError: undefined method `shift' for "aoeu":String
from (irb):1

To the people suggesting regex, sorry I did think of that. However it
won't work because of the problem: I want to convert all instances of
dot-space into a single space, and all instances of space-dot into a
single dot. All dots not followed or preceded by spaces (excluding
newline) must be turned into newlines.
For example, how would you solve the following:
"bla.h+aoeu.++test.\n+.hello" (where + represents a space)
i think it's hard.

Will this work? (I used + instead of space as in your example)

str = "bla.h+aoeu.++test.\n+.hello"
p str.gsub(/\.\+|\+\.|\./) { |s|
case s
when '+.' then '.'
when '.+' then '+'
else "\n"
end
}
==> "bla\nh+aoeu++test\n\n.hello"
 
L

Luke Worth

Will this work? (I used + instead of space as in your example)

str = "bla.h+aoeu.++test.\n+.hello"
p str.gsub(/\.\+|\+\.|\./) { |s|
case s
when '+.' then '.'
when '.+' then '+'
else "\n"
end
}
==> "bla\nh+aoeu++test\n\n.hello"

Hey that's cool, I didn't know you could do that.
I think i'll stick with my queue solution though. The thing i failed to
mention is that i want to tokenize across spaces not preceded or
followed by a dot. I'm also forming it into a tree structure at the same
time....
Thanks for the gsub tip though!
 
D

Devin Mullins

Luke said:
Hey that's cool, I didn't know you could do that.
I think i'll stick with my queue solution though. The thing i failed to
mention is that i want to tokenize across spaces not preceded or
followed by a dot. I'm also forming it into a tree structure at the same
time....
Thanks for the gsub tip though!
Consider StringScanner.

Devin
 
D

Daniel Brockman

Luke Worth said:
Because it doesn't exist :)

Oh, that's as good a reason as any. :)

Who said there were no subtle differences between String and Array?
I would like that person to explain this logic.
 
E

Eric Mahurin

--- Daniel Brockman said:
=20
Oh, that's as good a reason as any. :)
=20
Who said there were no subtle differences between String and
Array?
I would like that person to explain this logic.

I agree. We should have shift for String. But, more
importantly, slice!(0) and slice!(0,m) should be O(1) like
Array#shift is. But, I've discussed this before in a previous
thread.

I do have an implementation written in ruby for Array/String
where all insertions/deletions at the front of are O(1), but
this doesn't show its advantage until you get to about 50K
elements or so. This is because the ruby interpret time still
dominates over the fast C copy until that many elements. But,
after that point it is quite obvious.



=09
____________________________________________________
Sell on Yahoo! Auctions =96 no fees. Bid on great items. =20
http://auctions.yahoo.com/
 
D

David Brady

Daniel said:
Why not use String#shift, which is O(1)?
EDIT: Daniel later redacted this to Array#shift. My question remains,
however.

Many times when I hear an answer, I am more interested in how the answer
was obtained than the answer itself, so that I can go find the answers
to related questions myself. This is a perfect example of this.

How do you know Array#shift is O(1)? I'm not challenging you; I'm
asking because I really want to know how to find this out. Is it
documented somewhere? Did you profile it? Did you read the source code?

-dB
 
L

Levin Alexander

Eric Mahurin said:
I agree. We should have shift for String.

Disagree. What wold the semantics of String#shift be?
What would it shift? Bytes? Characters? Words? Lines?

There is no clear notion of an "Element" in a String.

You can always use split or implement shift
yourself if you need it for a particular application.

-Levin
 
D

Daniel Brockman

Eric Mahurin said:
Daniel Brockman said:
Who said there were no subtle differences between String and Array?
I would like that person to explain [why there is no String#shift].

I agree. We should have shift for String. But, more importantly,
slice!(0) and slice!(0,m) should be O(1) like Array#shift is.
But, I've discussed this before in a previous thread.

Yes, I agree with this as well.
I do have an implementation written in ruby for Array/String
where all insertions/deletions at the front of are O(1),

I would like to see this implementation. Particularly the Ruby
implementation of an O(1) String#shift.
 
D

Daniel Brockman

David Brady said:
EDIT: Daniel later redacted this to Array#shift. My question
remains, however.

Well, not really. My answer is unaffected, however. :)
Many times when I hear an answer, I am more interested in how the
answer was obtained than the answer itself, so that I can go find
the answers to related questions myself. This is a perfect example
of this.

I'm sorry. Had there not been a thread about this particular subject
only a week ago, I would have explained.
How do you know Array#shift is O(1)? I'm not challenging you; I'm
asking because I really want to know how to find this out.

I appreciate you making this clear. :)
Is it documented somewhere? Did you profile it? Did you read the
source code?

I did not profile it, and I don't think it's documented --- although I
would agree that it should be.

I know because Eric Mahurin posted about this a week ago in


and a quick glance at the source confirms it:

VALUE
rb_ary_shift(ary)
VALUE ary;
{
VALUE top;

rb_ary_modify_check(ary);
if (RARRAY(ary)->len == 0) return Qnil;
top = RARRAY(ary)->ptr[0];
ary_make_shared(ary);
RARRAY(ary)->ptr++; /* shift ptr */
RARRAY(ary)->len--;

return top;
}

Note also that Array#unshift is amortized O(1); it allocates new
memory in chunks (allocating one additional byte would be insane).

Eric's thread was about similar calls, such as Array#slice!(0, n),
having needlessly large O(n) complexity.
 
D

Daniel Brockman

Levin Alexander said:
Disagree. What wold the semantics of String#shift be?
What would it shift? Bytes? Characters?

Whatever String#[] and all the other String methods index, of course.

Please --- don't be silly.

I still can't believe String#each does this by default.
There is no clear notion of an "Element" in a String.

If this is true, then we have a serious problem. Before doing much
anything about its API, we need to decide whether String is a byte
array or a character array. (Presumably, matz & co. already have.)
You can always use split or implement shift
yourself if you need it for a particular application.

Implementing an O(1) String#shift seems non-trivial to me.
(Eric said he has done it, but did not yet post the code.)

Anyway, this is irrelevant since, after all, you can implement most
of the Ruby standard library yourself ``if you need it.''
 
D

Devin Mullins

Daniel said:
Whatever String#[] and all the other String methods index, of course.
Depending on the parameter you pass, #[] can return a String or an Integer.
If this is true, then we have a serious problem. Before doing much
anything about its API, we need to decide whether String is a byte
array or a character array. (Presumably, matz & co. already have.)
There's no such thing as a character in Ruby. (See any discussion on
Unicode, etc. in Ruby.) Strings are Objects (stored in C as char*s, I'd
guess). Call the right methods on them, and you can get an Integer
representing the byte value at a given position ("Hello"[0]), or another
String object representing some manipulation of the String
("Hello"[0..0]). Those are your only means of inspection. That was just
a long-winded way of saying "this is true."

Anything I didn't reply to, I probably agree with. Since the String
methods don't have a consistent notion of an "element," it doesn't seem
it would hurt to choose whichever notion we want for a potential #shift
method.

Devin
 
L

Levin Alexander

Devin Mullins said:
There's no such thing as a character in Ruby. (See any discussion on =20
Unicode, etc. in Ruby.)

What about Regular Expressions?

str =3D "=E4"
str.scan(/./).length #=3D> 2
$KCODE =3D 'u'
str.scan(/./).length #=3D> 1
Anything I didn't reply to, I probably agree with. Since the String =20
methods don't have a consistent notion of an "element," it doesn't seem= =20
it would hurt to choose whichever notion we want for a potential #shift= =20
method.

The possibilities would be:

(a)
str =3D "foo"
c =3D str.shift #=3D> 102
c.class #=3D> Fixnum
str #=3D> "oo"

(b)
str =3D "foo"
c =3D str.shift #=3D> "f"
c.class #=3D> String
str #=3D> "oo"

(c)
# make String#shift act like split(/./).shift

(a) would probably most consistent.

-Levin
 
D

Daniel Brockman

Devin Mullins said:
Daniel said:
Whatever String#[] and all the other String methods index, of course.

Depending on the parameter you pass, #[] can return a String or an
an Integer.
Correct.
If this is true, then we have a serious problem. Before doing much
anything about its API, we need to decide whether String is a byte
array or a character array. (Presumably, matz & co. already have.)

There's no such thing as a character in Ruby.

Not currently, no.
(See any discussion on Unicode, etc. in Ruby.)

Indeed, you have managed to poke my interest, and I'm going to look
into this further.
Strings are Objects (stored in C as char*s, I'd guess).
Obviously.

Call the right methods on them, and you can get an Integer
representing the byte value at a given position ("Hello"[0]),

Yes, but is this subject to change? If so, I guess String is destined
to become a character array. If not, a byte array.
or another String object representing some manipulation of the
String ("Hello"[0..0]).

I'm not talking about subranges.
Those are your only means of inspection. That was just a
long-winded way of saying "this is true."

Anything I didn't reply to, I probably agree with. Since the String
methods don't have a consistent notion of an "element," it doesn't
seem it would hurt to choose whichever notion we want for a
potential #shift method.

No, it does not. If the semantics of String are normalized in a
future version, the issue of String#shift will be one among many.
 
D

David A. Black

Hi --

Daniel said:
Whatever String#[] and all the other String methods index, of course.
Depending on the parameter you pass, #[] can return a String or an Integer.
If this is true, then we have a serious problem. Before doing much
anything about its API, we need to decide whether String is a byte
array or a character array. (Presumably, matz & co. already have.)

My understanding (sneaking in a reply to Daniel's post in this reply
:) was that the conceptual and design decision was that Strings are
not arrays, and are therefore not obliged or constrained to have an
Array-like API (any more than arrays are obliged to have a String-like
API).
There's no such thing as a character in Ruby. (See any discussion on Unicode,
etc. in Ruby.) Strings are Objects (stored in C as char*s, I'd guess). Call
the right methods on them, and you can get an Integer representing the byte
value at a given position ("Hello"[0]), or another String object representing
some manipulation of the String ("Hello"[0..0]). Those are your only means of
inspection. That was just a long-winded way of saying "this is true."

Anything I didn't reply to, I probably agree with. Since the String methods
don't have a consistent notion of an "element," it doesn't seem it would hurt
to choose whichever notion we want for a potential #shift method.

That's true only if there's an imperative to have a String instance
method called "shift". I don't think there is. Maybe a left
chop/chomp operation would be a good idea, but I think it should be
called lchop, which would be consistent with other string method
naming (rather than with array method naming).


David
 
D

Daniel Brockman

Wow, some mixup --- Daniel, David, Devin and Levin. :)

Levin> There is no clear notion of an "Element" in a String.

Daniel> If this is true, then we have a serious problem. Before
Daniel> doing much anything about its API, we need to decide whether
Daniel> String is a byte array or a character array. (Presumably,
Daniel> matz & co. already have.)

Black> My understanding was that the conceptual and design decision
Black> was that Strings are not arrays, and are therefore not
Black> obliged or constrained to have an Array-like API (any more
Black> than arrays are obliged to have a String-like API).

To me, the term _array_ means ``sequence of elements stored in a
contiguous chunk of memory,'' just like the term _list_ means
``sequence of elements stored in a linked list of cells.''

As long as strings are implemented as arrays, I reserve the right to
refer to them as such, regardless of whether String < Array.
(Of course, I won't do it needlessly since it is confusing.)

In Haskell, strings are not arrays, but lists. In many languages,
strings are immutable, which deemphasises their nature as arrays.
In Ruby, however, strings are mutable arrays (of bytes?). So far I
have not been able to understand the desire to obfuscate this fact.

What's the harm of admitting that strings are arrays? What's the harm
of making them at least _quack_ alike?

Devin> Since the String methods don't have a consistent notion of an
Devin> "element," it doesn't seem it would hurt to choose whichever
Devin> notion we want for a potential #shift method.

Black> That's true only if there's an imperative to have a String
Black> instance method called "shift". I don't think there is.

Why not? This thread started with an example of the need for one.
Of course you can use string.slice!(0), but the same goes for arrays.

The terms `shift' and `unshift' are general and well-understood.
I can't see any reason why they shouldn't be applied to strings.

Black> Maybe a left chop/chomp operation would be a good idea, but I
Black> think it should be called lchop, which would be consistent
Black> with other string method naming (rather than with array
Black> method naming).

I fail to see the point in that. String#chop is meant to chop off
end-of-line characters; String#lchop wouldn't be. So using that name
would be _inconsistent_ with other string method naming.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top