Random Access using IO#pos in code blocks

Discussion in 'Ruby' started by Arun Kumar, Apr 28, 2009.

  1. Arun Kumar

    Arun Kumar Guest

    Hello everyone,
    I'm 20 days new to Ruby, please forgive if I make any mistakes. I'm
    on a project where I'm indexing certain words in a text document. So
    I'm also storing the file position where the word occurs. But the
    Problem is:
    The IO#pos points to the end of the file all the while... Below is
    the code I'm working on:

    File.open(file_name) do |f|

    f.readlines("\r\n\r\n").each do |para|

    para.scan(/\b\w+\b/).each do |word|

    word =3D word.downcase.stem
    if (!stoplist.include? word) && (!word.empty?) #excludes empty
    and frequent words

    unless freq.has_key?(word)
    freq[word] =3D [1,f.pos,file_name] # freq is a hash, that
    stores an array containing index, position of word (THE PROBLEM)..
    else
    freq[word].to_a[0] +=3D 1
    freq[word].to_a<< f.pos << file_name
    end

    unless wfreq.has_key?(word)
    wfreq[word] =3D [1,f.pos,file_name]
    else
    wfreq[word].to_a[0] +=3D 1
    wfreq[word].to_a<< f.pos << file_name
    end

    end
    end
    end


    File.open(file_name+".yaml","w"){|f| YAML.dump(freq,f)}

    Also it would be great if someone told me the replacement for the
    deprecated 'to_a' method used above :)

    Any help is greatly appreciated


    ---------------



    --=20
    || =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
    =95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
    =E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||
     
    Arun Kumar, Apr 28, 2009
    #1
    1. Advertising

  2. On 28.04.2009 22:32, Arun Kumar wrote:
    > Hello everyone,
    > I'm 20 days new to Ruby, please forgive if I make any mistakes. I'm
    > on a project where I'm indexing certain words in a text document. So
    > I'm also storing the file position where the word occurs. But the
    > Problem is:
    > The IO#pos points to the end of the file all the while... Below is
    > the code I'm working on:
    >
    > File.open(file_name) do |f|
    >
    > f.readlines("\r\n\r\n").each do |para|


    The reason is in the line above.

    > para.scan(/\b\w+\b/).each do |word|
    >
    > word = word.downcase.stem
    > if (!stoplist.include? word) && (!word.empty?) #excludes empty
    > and frequent words
    >
    > unless freq.has_key?(word)
    > freq[word] = [1,f.pos,file_name] # freq is a hash, that
    > stores an array containing index, position of word (THE PROBLEM)..
    > else
    > freq[word].to_a[0] += 1
    > freq[word].to_a<< f.pos << file_name
    > end
    >
    > unless wfreq.has_key?(word)
    > wfreq[word] = [1,f.pos,file_name]
    > else
    > wfreq[word].to_a[0] += 1
    > wfreq[word].to_a<< f.pos << file_name
    > end
    >
    > end
    > end
    > end
    >
    >
    > File.open(file_name+".yaml","w"){|f| YAML.dump(freq,f)}
    >
    > Also it would be great if someone told me the replacement for the
    > deprecated 'to_a' method used above :)


    Why do you convert an Array into an Array?

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Apr 29, 2009
    #2
    1. Advertising

  3. Arun Kumar wrote:

    > unless freq.has_key?(word)
    > freq[word] = [1,f.pos,file_name] # freq is a hash, that
    > stores an array containing index, position of word (THE PROBLEM)..
    > else
    > freq[word].to_a[0] += 1
    > freq[word].to_a<< f.pos << file_name
    > end


    BTW, you can replace all that by:

    freq[word] ||= [0]
    freq[word][0] += 1
    freq[word] << f.pos << file_name

    As for the pos, since you've already slurped in the data you'll need to
    remember where you are within your buffer. Your outer loop could become
    something like this:

    para_pos = 0
    f.readlines("\r\n\r\n").each do |para|
    ...
    para_pos += para.size + 4
    end

    Unfortunately, I don't think string#scan will give you offsets into the
    strings found.

    In ruby 1.8 you can write this:

    pos = 0
    while md = /\b\w+\b/.match(para[pos..-1])
    word = md[0]
    puts "Match #{word} at #{para_pos+pos+md.begin(0)}"
    pos += md.end(0)
    ...
    end

    In ruby 1.9 (but not 1.8.6/1.8.7), Regexp.match takes a start pos, so
    you could optimise it to this:

    pos = 0
    while md = /\b\w+\b/.match(para, pos)
    word = md[0]
    puts "Match #{word} at #{para_pos+md.begin(0)}"
    pos = md.end(0)
    ...
    end

    However in ruby 1.9 the offsets used will be in terms of number of
    characters, not number of bytes. It would be up to you to convert this
    back into byte offsets into the file, if that's what you're after.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Apr 29, 2009
    #3
  4. 2009/4/29 Brian Candler <>:
    > Unfortunately, I don't think string#scan will give you offsets into the
    > strings found.
    >
    > In ruby 1.8 you can write this:
    >
    > =A0pos =3D 0
    > =A0while md =3D /\b\w+\b/.match(para[pos..-1])
    > =A0 =A0word =3D md[0]
    > =A0 =A0puts "Match #{word} at #{para_pos+pos+md.begin(0)}"
    > =A0 =A0pos +=3D md.end(0)
    > =A0 =A0...
    > =A0end
    >
    > In ruby 1.9 (but not 1.8.6/1.8.7), Regexp.match takes a start pos, so
    > you could optimise it to this:
    >
    > =A0pos =3D 0
    > =A0while md =3D /\b\w+\b/.match(para, pos)
    > =A0 =A0word =3D md[0]
    > =A0 =A0puts "Match #{word} at #{para_pos+md.begin(0)}"
    > =A0 =A0pos =3D md.end(0)
    > =A0 =A0...
    > =A0end


    String#scan is likely faster than manually matching portions with
    #match. In both versions of Ruby you can do this to get the
    /character/ offset:

    irb(main):001:0> s=3D%{foo bar baz}
    =3D> "foo bar baz"
    irb(main):002:0> s.scan(/\w+/) { p $`.length }
    0
    4
    8
    =3D> "foo bar baz"

    > However in ruby 1.9 the offsets used will be in terms of number of
    > characters, not number of bytes. It would be up to you to convert this
    > back into byte offsets into the file, if that's what you're after.


    This is an important point to remember!

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Apr 29, 2009
    #4
  5. Arun Kumar

    Arun Kumar Guest

    Thank you so much for the kind responses! I'm pleased to be part of such a
    kind community :)

    On Wed, Apr 29, 2009 at 4:32 PM, Robert Klemme
    <>wrote:

    > 2009/4/29 Brian Candler <>:
    > > Unfortunately, I don't think string#scan will give you offsets into the
    > > strings found.
    > >
    > > In ruby 1.8 you can write this:
    > >
    > > pos =3D 0
    > > while md =3D /\b\w+\b/.match(para[pos..-1])
    > > word =3D md[0]
    > > puts "Match #{word} at #{para_pos+pos+md.begin(0)}"
    > > pos +=3D md.end(0)
    > > ...
    > > end
    > >
    > > In ruby 1.9 (but not 1.8.6/1.8.7), Regexp.match takes a start pos, so
    > > you could optimise it to this:
    > >
    > > pos =3D 0
    > > while md =3D /\b\w+\b/.match(para, pos)
    > > word =3D md[0]
    > > puts "Match #{word} at #{para_pos+md.begin(0)}"
    > > pos =3D md.end(0)
    > > ...
    > > end

    >
    > String#scan is likely faster than manually matching portions with
    > #match. In both versions of Ruby you can do this to get the
    > /character/ offset:
    >
    > irb(main):001:0> s=3D%{foo bar baz}
    > =3D> "foo bar baz"
    > irb(main):002:0> s.scan(/\w+/) { p $`.length }
    > 0
    > 4
    > 8
    > =3D> "foo bar baz"
    >
    > > However in ruby 1.9 the offsets used will be in terms of number of
    > > characters, not number of bytes. It would be up to you to convert this
    > > back into byte offsets into the file, if that's what you're after.

    >
    > This is an important point to remember!
    >
    > Kind regards
    >
    > robert
    >
    > --
    > remember.guy do |as, often| as.you_can - without end
    > http://blog.rubybestpractices.com/
    >
    >



    --=20
    || =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
    =95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
    =E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||
     
    Arun Kumar, Apr 29, 2009
    #5
  6. Arun Kumar

    Arun Kumar Guest

    [Note: parts of this message were removed to make it a legal post.]

    > > unless freq.has_key?(word)
    > > freq[word] = [1,f.pos,file_name] # freq is a hash, that
    > > stores an array containing index, position of word (THE PROBLEM)..
    > > else
    > > freq[word].to_a[0] += 1
    > > freq[word].to_a<< f.pos << file_name
    > > end

    >
    > BTW, you can replace all that by:
    >
    > freq[word] ||= [0]
    > freq[word][0] += 1
    > freq[word] << f.pos << file_name
    >
    >

    I've tried doing it but since 'freq' is a hash it gives the following error:

    preprocessor.rb:32:in `calc_frequency_word_list': undefined method `[]='
    for 0:Fixnum (NoMethodError)
    from copy of preprocessor.rb:25:in `scan'
    from copy of preprocessor.rb:25:in `calc_frequency_word_list'
    from copy of preprocessor.rb:23:in `each'
    from copy of preprocessor.rb:23:in `calc_frequency_word_list'
    from copy of preprocessor.rb:61
     
    Arun Kumar, Apr 29, 2009
    #6
  7. Arun Kumar wrote:
    >> freq[word] ||= [0]
    >> freq[word][0] += 1
    >> freq[word] << f.pos << file_name
    >>
    >>

    > I've tried doing it but since 'freq' is a hash it gives the following
    > error:


    Show your actual code. The following code works just fine:

    freq = {}
    %w{foo bar baz bar}.each do |word|
    freq[word] ||= [0]
    freq[word][0] += 1
    freq[word] << "pos" << "name"
    end
    puts freq.inspect

    The error suggests that you have initialized freq[word] to 0, not to
    [0].

    Or perhaps you set freq = Hash.new(0), which is wrong in this case,
    because the default element needs to be [0] not 0.

    An alternative is to auto-initialize each hash element like this:

    freq = Hash.new { |h,k| h[k] = [0] }
    %w{foo bar baz bar}.each do |word|
    freq[word][0] += 1
    freq[word] << "pos" << "name"
    end
    puts freq.inspect
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Apr 29, 2009
    #7
  8. Arun Kumar

    Arun Kumar Guest

    >
    >
    > Or perhaps you set freq =3D Hash.new(0), which is wrong in this case,
    > because the default element needs to be [0] not 0.
    >
    > An alternative is to auto-initialize each hash element like this:
    >
    > freq =3D Hash.new { |h,k| h[k] =3D [0] }
    > %w{foo bar baz bar}.each do |word|
    > freq[word][0] +=3D 1
    > freq[word] << "pos" << "name"
    > end
    > puts freq.inspect
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    > Exactly the mistake I had done! So silly of me! Thank you SO MUCH :=

    )


    --=20
    || =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
    =95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
    =E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||
     
    Arun Kumar, Apr 29, 2009
    #8
  9. 2009/4/29 Arun Kumar <>:
    >>
    >>
    >> Or perhaps you set freq =3D Hash.new(0), which is wrong in this case,
    >> because the default element needs to be [0] not 0.
    >>
    >> An alternative is to auto-initialize each hash element like this:
    >>
    >> freq =3D Hash.new { |h,k| h[k] =3D [0] }
    >> %w{foo bar baz bar}.each do |word|
    >> =A0freq[word][0] +=3D 1
    >> =A0freq[word] << "pos" << "name"
    >> end
    >> puts freq.inspect


    This is a typical case where I would introduce a separate class or
    even multiple classes because it makes life so much more readable.

    WordPositon =3D Struct.new :file, :pos

    WordStats =3D Struct.new :word, :positions do
    def count; positions.size; end
    end

    freq =3D Hash.new {|h,word| h[word.freeze] =3D WordStat.new(word, [])}
    ...
    freq[word].positions << WordPosition.new(file_name, pos)
    ...

    Then you can do

    freq.sort_by {|w,stat| stat.count}

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Apr 29, 2009
    #9
  10. Robert Klemme wrote:
    > String#scan is likely faster than manually matching portions with
    > #match. In both versions of Ruby you can do this to get the
    > /character/ offset:
    >
    > irb(main):001:0> s=%{foo bar baz}
    > => "foo bar baz"
    > irb(main):002:0> s.scan(/\w+/) { p $`.length }
    > 0
    > 4
    > 8
    > => "foo bar baz"


    Well, my guess is that would be *less* efficient for large paragraphs,
    since $` forces allocation of a new string containing all the text from
    the start to the current point. But that reminds me, there is a global
    variable containing a MatchData object: $~

    So you can write:

    irb(main):001:0> s=%{foo bar baz}
    => "foo bar baz"
    irb(main):002:0> s.scan(/\w+/) { p $~.begin(0) }
    0
    4
    8
    => "foo bar baz"

    Regards,

    Brian.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Apr 30, 2009
    #10
  11. Arun Kumar

    Arun Kumar Guest

    Thank you so much for all the responses :)

    On Thu, Apr 30, 2009 at 12:37 PM, Brian Candler <> wrote=
    :

    > Robert Klemme wrote:
    > > String#scan is likely faster than manually matching portions with
    > > #match. In both versions of Ruby you can do this to get the
    > > /character/ offset:
    > >
    > > irb(main):001:0> s=3D%{foo bar baz}
    > > =3D> "foo bar baz"
    > > irb(main):002:0> s.scan(/\w+/) { p $`.length }
    > > 0
    > > 4
    > > 8
    > > =3D> "foo bar baz"

    >
    > Well, my guess is that would be *less* efficient for large paragraphs,
    > since $` forces allocation of a new string containing all the text from
    > the start to the current point. But that reminds me, there is a global
    > variable containing a MatchData object: $~
    >
    > So you can write:
    >
    > irb(main):001:0> s=3D%{foo bar baz}
    > =3D> "foo bar baz"
    > irb(main):002:0> s.scan(/\w+/) { p $~.begin(0) }
    > 0
    > 4
    > 8
    > =3D> "foo bar baz"
    >
    > Regards,
    >
    > Brian.
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    >



    --=20
    || =E0=A4=B6=E0=A5=8D=E0=A4=B0=E0=A5=80 =E0=A4=9C=E0=A4=BE=E0=A4=A8=E0=A4=
    =95=E0=A5=80=E0=A4=B0=E0=A4=98=E0=A5=81=E0=A4=A8=E0=A4=BE=E0=A4=A5=E0=A5=8B=
    =E0=A4=B5=E0=A4=BF=E0=A4=9C=E0=A4=AF=E0=A4=A4=E0=A5=87 ||
     
    Arun Kumar, Apr 30, 2009
    #11
  12. 2009/4/30 Brian Candler <>:
    > Robert Klemme wrote:
    >> String#scan is likely faster than manually matching portions with
    >> #match. =A0In both versions of Ruby you can do this to get the
    >> /character/ offset:
    >>
    >> irb(main):001:0> s=3D%{foo bar baz}
    >> =3D> "foo bar baz"
    >> irb(main):002:0> s.scan(/\w+/) { p $`.length }
    >> 0
    >> 4
    >> 8
    >> =3D> "foo bar baz"

    >
    > Well, my guess is that would be *less* efficient for large paragraphs,
    > since $` forces allocation of a new string containing all the text from
    > the start to the current point.


    Last time I checked the actual string buffer was shared so the
    overhead is just a single instance. I do have to admit though that I
    do not know when the object is allocated (i.e. at time of match or
    when referencing $`).

    > But that reminds me, there is a global
    > variable containing a MatchData object: $~
    >
    > So you can write:
    >
    > irb(main):001:0> s=3D%{foo bar baz}
    > =3D> "foo bar baz"
    > irb(main):002:0> s.scan(/\w+/) { p $~.begin(0) }
    > 0
    > 4
    > 8
    > =3D> "foo bar baz"


    Also a good variant! (Btw, MatchData might be even more heavyweight
    than a sub string.)

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Apr 30, 2009
    #12
  13. Robert Klemme wrote:
    > I do have to admit though that I
    > do not know when the object is allocated (i.e. at time of match or
    > when referencing $`).


    Experiment suggests the MatchData is created immediately on the match,
    and the string is instantiated lazily from that. This makes sense; it
    would be very inefficient to allocate strings for $`, $1, $2, $3, ... $'
    when maybe none of them will be used. But the MatchData object has the
    original string plus all the offsets.

    def count(klass)
    c = 0
    ObjectSpace.each_object(klass) { c += 1 }
    c
    end

    str = " foo bar baz "

    c1 = [count(MatchData), count(String)]

    str =~ /(\w+)/

    c2 = [count(MatchData), count(String)]

    x = $~

    c3 = [count(MatchData), count(String)]

    y = $`

    c4 = [count(MatchData), count(String)]

    puts [c1,c2,c3,c4].inspect
    # [[0, 188], [1, 188], [1, 188], [1, 189]]
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Apr 30, 2009
    #13
  14. 2009/4/30 Brian Candler <>:
    > Robert Klemme wrote:
    >> I do have to admit though that I
    >> do not know when the object is allocated (i.e. at time of match or
    >> when referencing $`).

    >
    > Experiment suggests the MatchData is created immediately on the match,
    > and the string is instantiated lazily from that. This makes sense; it
    > would be very inefficient to allocate strings for $`, $1, $2, $3, ... $'
    > when maybe none of them will be used. But the MatchData object has the
    > original string plus all the offsets.


    Ah, good to know! Thanks for the experimenting!

    "Tune in next week when you'll hear Dr. Brian say: what's this fuse for?"
    ;-)

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Apr 30, 2009
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kevin
    Replies:
    19
    Views:
    1,138
    Tris Orendorff
    Feb 13, 2006
  2. globalrev
    Replies:
    4
    Views:
    773
    Gabriel Genellina
    Apr 20, 2008
  3. matt
    Replies:
    1
    Views:
    272
    George Ogata
    Aug 6, 2004
  4. VK
    Replies:
    15
    Views:
    1,177
    Dr J R Stockton
    May 2, 2010
  5. prati
    Replies:
    0
    Views:
    450
    prati
    Oct 27, 2012
Loading...

Share This Page