Strings vs arrays

Discussion in 'Ruby' started by Luke Worth, Jul 9, 2005.

  1. Luke Worth

    Luke Worth Guest

    Hi.

    I am wanting to parse large files (of source code) by reading byte by
    byte and converting certain patterns into different strings. I can think
    of a couple of ways to do this (starting with the source code in a
    String):
    1. String#split(//) -> array, then use Array#shift (maybe slow to
    convert to array?)
    2. Keep a counter and use String#[] (which seems unrubyish to me)

    Which of these would be preferable? Does anyone know of any better ways?
    --
    Luke Worth
    Luke Worth, Jul 9, 2005
    #1
    1. Advertising

  2. On 7/9/05, Luke Worth <> wrote:
    > I am wanting to parse large files (of source code) by reading byte by
    > byte and converting certain patterns into different strings. I can think
    > of a couple of ways to do this (starting with the source code in a
    > String):
    > 1. String#split(//) -> array, then use Array#shift (maybe slow to
    > convert to array?)
    > 2. Keep a counter and use String#[] (which seems unrubyish to me)


    Regex. You know what these patterns are in advance.

    -austin
    --=20
    Austin Ziegler *
    * Alternate:
    Austin Ziegler, Jul 9, 2005
    #2
    1. Advertising

  3. "Luke Worth" <> schrieb im Newsbeitrag
    news:...
    > Hi.
    >
    > I am wanting to parse large files (of source code) by reading byte by
    > byte and converting certain patterns into different strings. I can think
    > of a couple of ways to do this (starting with the source code in a
    > String):
    > 1. String#split(//) -> array, then use Array#shift (maybe slow to
    > convert to array?)
    > 2. Keep a counter and use String#[] (which seems unrubyish to me)
    >
    > Which of these would be preferable? Does anyone know of any better ways?


    I'd do none of them. If you know that your patterns stay on single lines
    and don't extend to following lines I'd read the file line by line and use
    String#gsub. Something along the lines of

    while ( line = gets )
    line.gsub!( /pattern/, 'replacement' )
    # or line.gsub!( /pattern/ ) {|match| create replacement }
    puts line
    end

    If patterns extend lines I'd first try to slurp the whole file into mem and
    then use gsub (possibly with option /m for multiline). Only if that does
    not work (because of file size for example) I'd resort to more complicated
    solutions like the ones you described.

    Kind regards

    robert
    Robert Klemme, Jul 9, 2005
    #3
  4. Luke Worth <> writes:

    > Hi.
    >
    > I am wanting to parse large files (of source code) by reading byte by
    > byte and converting certain patterns into different strings. I can think
    > of a couple of ways to do this (starting with the source code in a
    > String):
    > 1. String#split(//) -> array, then use Array#shift (maybe slow to
    > convert to array?)
    > 2. Keep a counter and use String#[] (which seems unrubyish to me)
    >
    > Which of these would be preferable? Does anyone know of any better ways?


    Why not use String#shift, which is O(1)?

    --
    Daniel Brockman <>
    Daniel Brockman, Jul 9, 2005
    #4
  5. Luke Worth

    Luke Worth Guest

    On Sun, 2005-07-10 at 00:06 +0900, Daniel Brockman wrote:
    > Why not use String#shift, which is O(1)?

    Because it doesn't exist :)

    irb(main):001:0> "aoeu".shift
    NoMethodError: undefined method `shift' for "aoeu":String
    from (irb):1

    To the people suggesting regex, sorry I did think of that. However it
    won't work because of the problem: I want to convert all instances of
    dot-space into a single space, and all instances of space-dot into a
    single dot. All dots not followed or preceded by spaces (excluding
    newline) must be turned into newlines.
    For example, how would you solve the following:
    "bla.h+aoeu.++test.\n+.hello" (where + represents a space)
    i think it's hard.

    Thanks for your help everyone, I think I've figured out how to do it
    with a queue structure though.
    --
    Luke Worth
    Luke Worth, Jul 9, 2005
    #5
  6. Luke Worth

    Jim Weirich Guest

    On Saturday 09 July 2005 11:25 am, Luke Worth wrote:
    > On Sun, 2005-07-10 at 00:06 +0900, Daniel Brockman wrote:
    > > Why not use String#shift, which is O(1)?

    >
    > Because it doesn't exist :)
    >
    > irb(main):001:0> "aoeu".shift
    > NoMethodError: undefined method `shift' for "aoeu":String
    > from (irb):1
    >
    > To the people suggesting regex, sorry I did think of that. However it
    > won't work because of the problem: I want to convert all instances of
    > dot-space into a single space, and all instances of space-dot into a
    > single dot. All dots not followed or preceded by spaces (excluding
    > newline) must be turned into newlines.
    > For example, how would you solve the following:
    > "bla.h+aoeu.++test.\n+.hello" (where + represents a space)
    > i think it's hard.


    Will this work? (I used + instead of space as in your example)

    str = "bla.h+aoeu.++test.\n+.hello"
    p str.gsub(/\.\+|\+\.|\./) { |s|
    case s
    when '+.' then '.'
    when '.+' then '+'
    else "\n"
    end
    }
    ==> "bla\nh+aoeu++test\n\n.hello"

    --
    -- Jim Weirich http://onestepback.org
    -----------------------------------------------------------------
    "Beware of bugs in the above code; I have only proved it correct,
    not tried it." -- Donald Knuth (in a memo to Peter van Emde Boas)
    Jim Weirich, Jul 9, 2005
    #6
  7. Luke Worth

    Luke Worth Guest

    On Sun, 2005-07-10 at 00:47 +0900, Jim Weirich wrote:
    > Will this work? (I used + instead of space as in your example)
    >
    > str = "bla.h+aoeu.++test.\n+.hello"
    > p str.gsub(/\.\+|\+\.|\./) { |s|
    > case s
    > when '+.' then '.'
    > when '.+' then '+'
    > else "\n"
    > end
    > }
    > ==> "bla\nh+aoeu++test\n\n.hello"


    Hey that's cool, I didn't know you could do that.
    I think i'll stick with my queue solution though. The thing i failed to
    mention is that i want to tokenize across spaces not preceded or
    followed by a dot. I'm also forming it into a tree structure at the same
    time....
    Thanks for the gsub tip though!
    --
    Luke Worth
    Luke Worth, Jul 9, 2005
    #7
  8. Luke Worth wrote:

    >Hey that's cool, I didn't know you could do that.
    >I think i'll stick with my queue solution though. The thing i failed to
    >mention is that i want to tokenize across spaces not preceded or
    >followed by a dot. I'm also forming it into a tree structure at the same
    >time....
    >Thanks for the gsub tip though!
    >
    >

    Consider StringScanner.

    Devin
    Devin Mullins, Jul 9, 2005
    #8
  9. Luke Worth <> writes:

    > On Sun, 2005-07-10 at 00:06 +0900, Daniel Brockman wrote:
    >
    >> Why not use String#shift, which is O(1)?

    >
    > Because it doesn't exist :)


    Oh, that's as good a reason as any. :)

    Who said there were no subtle differences between String and Array?
    I would like that person to explain this logic.

    --
    Daniel Brockman <>
    Daniel Brockman, Jul 9, 2005
    #9
  10. Luke Worth

    Eric Mahurin Guest

    --- Daniel Brockman <> wrote:

    > Luke Worth <> writes:
    >=20
    > > On Sun, 2005-07-10 at 00:06 +0900, Daniel Brockman wrote:
    > >
    > >> Why not use String#shift, which is O(1)?

    > >
    > > Because it doesn't exist :)

    >=20
    > Oh, that's as good a reason as any. :)
    >=20
    > Who said there were no subtle differences between String and
    > Array?
    > I would like that person to explain this logic.


    I agree. We should have shift for String. But, more
    importantly, slice!(0) and slice!(0,m) should be O(1) like
    Array#shift is. But, I've discussed this before in a previous
    thread.

    I do have an implementation written in ruby for Array/String
    where all insertions/deletions at the front of are O(1), but
    this doesn't show its advantage until you get to about 50K
    elements or so. This is because the ruby interpret time still
    dominates over the fast C copy until that many elements. But,
    after that point it is quite obvious.



    =09
    ____________________________________________________
    Sell on Yahoo! Auctions =96 no fees. Bid on great items. =20
    http://auctions.yahoo.com/
    Eric Mahurin, Jul 9, 2005
    #10
  11. Luke Worth

    David Brady Guest

    Daniel Brockman wrote:

    >Why not use String#shift, which is O(1)?
    >

    EDIT: Daniel later redacted this to Array#shift. My question remains,
    however.

    Many times when I hear an answer, I am more interested in how the answer
    was obtained than the answer itself, so that I can go find the answers
    to related questions myself. This is a perfect example of this.

    How do you know Array#shift is O(1)? I'm not challenging you; I'm
    asking because I really want to know how to find this out. Is it
    documented somewhere? Did you profile it? Did you read the source code?

    -dB

    --
    David Brady

    I'm having a really surreal day... OR AM I?
    David Brady, Jul 9, 2005
    #11
  12. Eric Mahurin <> wrote:

    > I agree. We should have shift for String.


    Disagree. What wold the semantics of String#shift be?
    What would it shift? Bytes? Characters? Words? Lines?

    There is no clear notion of an "Element" in a String.

    You can always use split or implement shift
    yourself if you need it for a particular application.

    -Levin
    Levin Alexander, Jul 9, 2005
    #12
  13. Eric Mahurin <> writes:

    > Daniel Brockman <> wrote:
    >
    >> Who said there were no subtle differences between String and Array?
    >> I would like that person to explain [why there is no String#shift].

    >
    > I agree. We should have shift for String. But, more importantly,
    > slice!(0) and slice!(0,m) should be O(1) like Array#shift is.
    > But, I've discussed this before in a previous thread.


    Yes, I agree with this as well.

    > I do have an implementation written in ruby for Array/String
    > where all insertions/deletions at the front of are O(1),


    I would like to see this implementation. Particularly the Ruby
    implementation of an O(1) String#shift.

    > but this doesn't show its advantage until you get to about 50K
    > elements or so. This is because the ruby interpret time still
    > dominates over the fast C copy until that many elements.
    > But, after that point it is quite obvious.


    --
    Daniel Brockman <>

    So really, we all have to ask ourselves:
    Am I waiting for RMS to do this? --TTN.
    Daniel Brockman, Jul 9, 2005
    #13
  14. David Brady <> writes:

    > Daniel Brockman wrote:
    >
    >> Why not use String#shift, which is O(1)?

    >
    > EDIT: Daniel later redacted this to Array#shift. My question
    > remains, however.


    Well, not really. My answer is unaffected, however. :)

    > Many times when I hear an answer, I am more interested in how the
    > answer was obtained than the answer itself, so that I can go find
    > the answers to related questions myself. This is a perfect example
    > of this.


    I'm sorry. Had there not been a thread about this particular subject
    only a week ago, I would have explained.

    > How do you know Array#shift is O(1)? I'm not challenging you; I'm
    > asking because I really want to know how to find this out.


    I appreciate you making this clear. :)

    > Is it documented somewhere? Did you profile it? Did you read the
    > source code?


    I did not profile it, and I don't think it's documented --- although I
    would agree that it should be.

    I know because Eric Mahurin posted about this a week ago in

    <>,

    and a quick glance at the source confirms it:

    VALUE
    rb_ary_shift(ary)
    VALUE ary;
    {
    VALUE top;

    rb_ary_modify_check(ary);
    if (RARRAY(ary)->len == 0) return Qnil;
    top = RARRAY(ary)->ptr[0];
    ary_make_shared(ary);
    RARRAY(ary)->ptr++; /* shift ptr */
    RARRAY(ary)->len--;

    return top;
    }

    Note also that Array#unshift is amortized O(1); it allocates new
    memory in chunks (allocating one additional byte would be insane).

    Eric's thread was about similar calls, such as Array#slice!(0, n),
    having needlessly large O(n) complexity.

    --
    Daniel Brockman <>

    So really, we all have to ask ourselves:
    Am I waiting for RMS to do this? --TTN.
    Daniel Brockman, Jul 9, 2005
    #14
  15. "Levin Alexander" <> writes:

    > Eric Mahurin <> wrote:
    >
    >> I agree. We should have shift for String.

    >
    > Disagree. What wold the semantics of String#shift be?
    > What would it shift? Bytes? Characters?


    Whatever String#[] and all the other String methods index, of course.

    > Words?


    Please --- don't be silly.

    > Lines?


    I still can't believe String#each does this by default.

    > There is no clear notion of an "Element" in a String.


    If this is true, then we have a serious problem. Before doing much
    anything about its API, we need to decide whether String is a byte
    array or a character array. (Presumably, matz & co. already have.)

    > You can always use split or implement shift
    > yourself if you need it for a particular application.


    Implementing an O(1) String#shift seems non-trivial to me.
    (Eric said he has done it, but did not yet post the code.)

    Anyway, this is irrelevant since, after all, you can implement most
    of the Ruby standard library yourself ``if you need it.''

    --
    Daniel Brockman <>

    So really, we all have to ask ourselves:
    Am I waiting for RMS to do this? --TTN.
    Daniel Brockman, Jul 9, 2005
    #15
  16. Daniel Brockman wrote:

    >Whatever String#[] and all the other String methods index, of course.
    >
    >

    Depending on the parameter you pass, #[] can return a String or an Integer.

    >>There is no clear notion of an "Element" in a String.
    >>
    >>

    >If this is true, then we have a serious problem. Before doing much
    >anything about its API, we need to decide whether String is a byte
    >array or a character array. (Presumably, matz & co. already have.)
    >
    >

    There's no such thing as a character in Ruby. (See any discussion on
    Unicode, etc. in Ruby.) Strings are Objects (stored in C as char*s, I'd
    guess). Call the right methods on them, and you can get an Integer
    representing the byte value at a given position ("Hello"[0]), or another
    String object representing some manipulation of the String
    ("Hello"[0..0]). Those are your only means of inspection. That was just
    a long-winded way of saying "this is true."

    Anything I didn't reply to, I probably agree with. Since the String
    methods don't have a consistent notion of an "element," it doesn't seem
    it would hurt to choose whichever notion we want for a potential #shift
    method.

    Devin
    Devin Mullins, Jul 9, 2005
    #16
  17. Devin Mullins <> wrote:

    >>> There is no clear notion of an "Element" in a String.
    >>>

    >> If this is true, then we have a serious problem. Before doing much
    >> anything about its API, we need to decide whether String is a byte
    >> array or a character array. (Presumably, matz & co. already have.)
    >>

    > There's no such thing as a character in Ruby. (See any discussion on =20
    > Unicode, etc. in Ruby.)


    What about Regular Expressions?

    str =3D "=E4"
    str.scan(/./).length #=3D> 2
    $KCODE =3D 'u'
    str.scan(/./).length #=3D> 1

    > Anything I didn't reply to, I probably agree with. Since the String =20
    > methods don't have a consistent notion of an "element," it doesn't seem=

    =20
    > it would hurt to choose whichever notion we want for a potential #shift=

    =20
    > method.


    The possibilities would be:

    (a)
    str =3D "foo"
    c =3D str.shift #=3D> 102
    c.class #=3D> Fixnum
    str #=3D> "oo"

    (b)
    str =3D "foo"
    c =3D str.shift #=3D> "f"
    c.class #=3D> String
    str #=3D> "oo"

    (c)
    # make String#shift act like split(/./).shift

    (a) would probably most consistent.

    -Levin
    Levin Alexander, Jul 9, 2005
    #17
  18. Devin Mullins <> writes:

    > Daniel Brockman wrote:
    >
    >> Whatever String#[] and all the other String methods index, of course.

    >
    > Depending on the parameter you pass, #[] can return a String or an
    > an Integer.


    Correct.

    >>> There is no clear notion of an "Element" in a String.

    >>
    >> If this is true, then we have a serious problem. Before doing much
    >> anything about its API, we need to decide whether String is a byte
    >> array or a character array. (Presumably, matz & co. already have.)

    >
    > There's no such thing as a character in Ruby.


    Not currently, no.

    > (See any discussion on Unicode, etc. in Ruby.)


    Indeed, you have managed to poke my interest, and I'm going to look
    into this further.

    > Strings are Objects (stored in C as char*s, I'd guess).


    Obviously.

    > Call the right methods on them, and you can get an Integer
    > representing the byte value at a given position ("Hello"[0]),


    Yes, but is this subject to change? If so, I guess String is destined
    to become a character array. If not, a byte array.

    > or another String object representing some manipulation of the
    > String ("Hello"[0..0]).


    I'm not talking about subranges.

    > Those are your only means of inspection. That was just a
    > long-winded way of saying "this is true."
    >
    > Anything I didn't reply to, I probably agree with. Since the String
    > methods don't have a consistent notion of an "element," it doesn't
    > seem it would hurt to choose whichever notion we want for a
    > potential #shift method.


    No, it does not. If the semantics of String are normalized in a
    future version, the issue of String#shift will be one among many.

    --
    Daniel Brockman <>

    So really, we all have to ask ourselves:
    Am I waiting for RMS to do this? --TTN.
    Daniel Brockman, Jul 9, 2005
    #18
  19. Hi --

    On Sun, 10 Jul 2005, Devin Mullins wrote:

    > Daniel Brockman wrote:
    >
    >> Whatever String#[] and all the other String methods index, of course.
    >>

    > Depending on the parameter you pass, #[] can return a String or an Integer.
    >
    >>> There is no clear notion of an "Element" in a String.
    >>>

    >> If this is true, then we have a serious problem. Before doing much
    >> anything about its API, we need to decide whether String is a byte
    >> array or a character array. (Presumably, matz & co. already have.)


    My understanding (sneaking in a reply to Daniel's post in this reply
    :) was that the conceptual and design decision was that Strings are
    not arrays, and are therefore not obliged or constrained to have an
    Array-like API (any more than arrays are obliged to have a String-like
    API).

    > There's no such thing as a character in Ruby. (See any discussion on Unicode,
    > etc. in Ruby.) Strings are Objects (stored in C as char*s, I'd guess). Call
    > the right methods on them, and you can get an Integer representing the byte
    > value at a given position ("Hello"[0]), or another String object representing
    > some manipulation of the String ("Hello"[0..0]). Those are your only means of
    > inspection. That was just a long-winded way of saying "this is true."
    >
    > Anything I didn't reply to, I probably agree with. Since the String methods
    > don't have a consistent notion of an "element," it doesn't seem it would hurt
    > to choose whichever notion we want for a potential #shift method.


    That's true only if there's an imperative to have a String instance
    method called "shift". I don't think there is. Maybe a left
    chop/chomp operation would be a good idea, but I think it should be
    called lchop, which would be consistent with other string method
    naming (rather than with array method naming).


    David

    --
    David A. Black
    David A. Black, Jul 9, 2005
    #19
  20. Wow, some mixup --- Daniel, David, Devin and Levin. :)

    Levin> There is no clear notion of an "Element" in a String.

    Daniel> If this is true, then we have a serious problem. Before
    Daniel> doing much anything about its API, we need to decide whether
    Daniel> String is a byte array or a character array. (Presumably,
    Daniel> matz & co. already have.)

    Black> My understanding was that the conceptual and design decision
    Black> was that Strings are not arrays, and are therefore not
    Black> obliged or constrained to have an Array-like API (any more
    Black> than arrays are obliged to have a String-like API).

    To me, the term _array_ means ``sequence of elements stored in a
    contiguous chunk of memory,'' just like the term _list_ means
    ``sequence of elements stored in a linked list of cells.''

    As long as strings are implemented as arrays, I reserve the right to
    refer to them as such, regardless of whether String < Array.
    (Of course, I won't do it needlessly since it is confusing.)

    In Haskell, strings are not arrays, but lists. In many languages,
    strings are immutable, which deemphasises their nature as arrays.
    In Ruby, however, strings are mutable arrays (of bytes?). So far I
    have not been able to understand the desire to obfuscate this fact.

    What's the harm of admitting that strings are arrays? What's the harm
    of making them at least _quack_ alike?

    Devin> Since the String methods don't have a consistent notion of an
    Devin> "element," it doesn't seem it would hurt to choose whichever
    Devin> notion we want for a potential #shift method.

    Black> That's true only if there's an imperative to have a String
    Black> instance method called "shift". I don't think there is.

    Why not? This thread started with an example of the need for one.
    Of course you can use string.slice!(0), but the same goes for arrays.

    The terms `shift' and `unshift' are general and well-understood.
    I can't see any reason why they shouldn't be applied to strings.

    Black> Maybe a left chop/chomp operation would be a good idea, but I
    Black> think it should be called lchop, which would be consistent
    Black> with other string method naming (rather than with array
    Black> method naming).

    I fail to see the point in that. String#chop is meant to chop off
    end-of-line characters; String#lchop wouldn't be. So using that name
    would be _inconsistent_ with other string method naming.

    --
    Daniel Brockman <>

    So really, we all have to ask ourselves:
    Am I waiting for RMS to do this? --TTN.
    Daniel Brockman, Jul 9, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Alexandra Stehman
    Replies:
    5
    Views:
    30,607
    Chris Smith
    Jun 17, 2004
  2. Bill Reyn
    Replies:
    3
    Views:
    2,253
    Bob Hairgrove
    Jun 22, 2004
  3. Mantorok Redgormor

    initializing arrays of arrays

    Mantorok Redgormor, Sep 10, 2003, in forum: C Programming
    Replies:
    4
    Views:
    566
  4. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    757
    Malcolm
    Jun 24, 2006
  5. Philipp
    Replies:
    21
    Views:
    1,127
    Philipp
    Jan 20, 2009
Loading...

Share This Page