File position and buffers

Cee Joe · Apr 27, 2011

Hi all,

In a bit of a rut. Have a file with a lot of text. I want to seperate
the text in this file as entries. Each entry that I would be seperating,
would be done so using IO.pos and when that cursor reaches a certain
character in the file, it will ideally place all the content before that
character into a buffer. Then the cursor will continue reading until it
hits that same character again and put that content into a buffer, so on
and so forth. (Character I'll be reading would be a greater than symbol)

Would I use a do iterator or use a while loop with a gets method? Or
readlines perhaps?

File:

entry 1

rubyrubyrubyrubyrubyrubyrubyruby
(newline here which I don't want)

entry 2

rubyrubyrubyrubyrubyrubyrubyruby

Entry1 and entry2 will be in seperate buffers which I would be able to
access again.

buffer1 = >entry 1
rubyrubyrubyrubyrubyrubyrubyruby

buffer2 = >entry 2
rubyrubyrubyrubyrubyrubyrubyruby

PS. The file is huge, so I don't want to read it into memory. What is
the best way to approach this? Any suggestions or comments would be
helpful. Thanks!

Jesús Gabriel y Galán · Apr 27, 2011

You could use foreach checking if each line starts with '>'. If it doesn't
you accumulate in a buffer; if it does you do something with the current
buffer and start a new one.

Jesus
El 27/04/2011 22:04, "Cee Joe" <[email protected]> escribi=F3:

jake kaiden · Apr 27, 2011

hi Cee -

this may well be WAY to simple for your needs, but it seems to me you
could so something like this:

(0text.txt is a file with 7 lines that say rubyrubyrubyetc.)

f = "0text.txt"
file = File.open(f)
buffer = []
bufferindex = 0

file.each_line{|line|
buffer[bufferindex] = line.chomp
bufferindex += 1
}

p buffer[0]
p buffer[1]
p buffer[2]
#etc...

of course you could also set a maximum number of lines per buffer:

f = "0text.txt"
file = File.open(f)
buffer = Hash.new{|key, value| key[value]= []}
bufferkey = 0
maxbuflength = 3

file.each_line{|line|
if buffer[bufferkey].length == maxbuflength
bufferkey +=1
buffer[bufferkey] << line.chomp
else
buffer[bufferkey] << line.chomp
end
}

p buffer[0]
p buffer[1]
p buffer[2]

if the file's extremely long i guess you'd want to write a method to
dump the buffers at some point too.

maybe this is dumb, i hope not!
cheers,

-j

7stud -- · Apr 28, 2011

Cee Joe wrote in post #995381:

Hi all,

In a bit of a rut. Have a file with a lot of text. I want to seperate
the text in this file as entries. Each entry that I would be seperating,
would be done so using IO.pos and when that cursor reaches a certain
character in the file, it will ideally place all the content before that
character into a buffer. Then the cursor will continue reading until it
hits that same character again and put that content into a buffer, so on
and so forth. (Character I'll be reading would be a greater than symbol)

There is absolutely no reason to use pos() to read that file.

Would I use a do iterator or use a while loop with a gets method? Or
readlines perhaps?

File:
rubyrubyrubyrubyrubyrubyrubyruby
(newline here which I don't want)

chomp() removes one newline, if present, at the end of a string.

PS. The file is huge, so I don't want to read it into memory. What is
the best way to approach this? Any suggestions or comments would be
helpful. Thanks!

Well, then you have to tell us what you want to do with the segments of
the file. If you store each chunk in a variable, then you will have
read the whole file into memory.

You say your file looks like this:

entry 1 <---WHAT'S AT THE END OF THIS LINE??

rubyrubyrubyrubyruby <---WHAT'S AT THE END OF THIS LINE??
(newline here which I don't want)

Those look like newlines. Are you saying that your data is organized
into paragraphs, i.e. separated by two newlines? Like this:

entry1\n rubyrubyruby\n
\n
rubyrubyruby\n
\n
entry3

A paragraph is defined as two consective newlines between lines. Note
that in ruby the default line separator is one newline. But you can
change that to two newlines--or any other character:

require 'stringio'

str =<<ENDOFSTRING

entry1 11111111111

22222222222

entry3

33333333333
ENDOFSTRING

input = StringIO.new(str)
$/ = "\n\n"

input.each do |para|
p para.sub(/\n+ \z/xms, "")
end

--output:--
">entry1\n11111111111"
">entry2\n22222222222"
">entry3\n33333333333"

7stud -- · Apr 28, 2011

This shows the output better:

e = input.enum_for

each) #You can do this for a File too.

e.each_slice(2) do |buffer1, buffer2|
puts "buffer1: #{buffer1.inspect}"
puts "buffer2: #{buffer2.inspect}"
puts "-" * 10
end

--output:--
buffer1: ">entry1\n11111111111\n\n"
buffer2: ">entry2\n22222222222\n\n"
----------
buffer1: ">entry3\n33333333333\n"
buffer2: nil
----------

Before doing the sub() on buffer2, you will have to check if its nil:

if buffer2.nil?
#don't do a sub()
else
#do the sub()
end

Robert Klemme · Apr 28, 2011

Hi all,

In a bit of a rut. Have a file with a lot of text. I want to seperate
the text in this file as entries. Each entry that I would be seperating,
would be done so using IO.pos and when that cursor reaches a certain
character in the file, it will ideally place all the content before that
character into a buffer. Then the cursor will continue reading until it
hits that same character again and put that content into a buffer, so on
and so forth. (Character I'll be reading would be a greater than symbol)

=A0Would I use a do iterator or use a while loop with a gets method? Or
readlines perhaps?

File:
rubyrubyrubyrubyrubyrubyrubyruby
(newline here which I don't want)
rubyrubyrubyrubyrubyrubyrubyruby

Entry1 and entry2 will be in seperate buffers which I would be able to
access again.

buffer1 =3D >entry 1
rubyrubyrubyrubyrubyrubyrubyruby

buffer2 =3D >entry 2
rubyrubyrubyrubyrubyrubyrubyruby

PS. The file is huge, so I don't want to read it into memory. What is
the best way to approach this? Any suggestions or comments would be
helpful. Thanks!

One of the simplest approaches is to use Ruby's ability to use
arbitrary record delimiters:

File.foreach file_name, ">" do |chunk|
chunk.chomp! ">"
chunk.gsub! /\r\n?|\n/, '' # remove line terminators
# if you need the leading ">":
# chunk[0,0] =3D ">"
p chunk
end

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Cee Joe · Apr 28, 2011

Thanks guys for your helpful comments. I will be more descriptive. I am
an intern and my mentor wants me to use the IO.pos to read the
characters of the file until the character reaches the ">" symbol. SO
upon the cursor reaching the ">" symbol(which is the start of a new
entry), he wants me to place that previous entry in a buffer. Here is
the actual test file I am working with:

gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA\n AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG\n
\n
gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA\n GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG\n
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG\n
\n
gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA\n

CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG\n
TTAGTCGCTGACGCATGCACG\n
\n

7stud, you are right there are two consecutive newlines which I failed
to mention. This should be the output of a buffer for one entry:

gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA\n

GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG <-- no "\n"
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG <-- no "\n"

Notice how the newlines are gone. So with the exception of the header in
each entry, the newlines should be gone and be placed in a buffer. I am
lost on how to use the IO.pos and a file iterator to make sure each
respective entry goes into a buffer without the file being indexed into
memory.

Thanks in advance, I'm new to the language and trying to wrap my head
around it.

7stud -- · Apr 28, 2011

You still have not told us what you are supposed to do with the stuff =

you read in?? You can read a file line by line and print out each line =

as you go and the maximum amount of memory used will be one line's =

worth. However, if you are supposed to store all the lines in an =

array, then you will read the whole file into memory.

Thanks guys for your helpful comments. I will be more
descriptive. I am an intern and my mentor wants me to
use the IO.pos to read the characters of the file
until the character reaches the ">" symbol.

What problems is that giving you? You can create a loop, read the =

character at pos(i), then increment i, and do what Jes=C3=BAs Gabriel y G=
al=C3=A1n =

suggested.

-- =

Posted via http://www.ruby-forum.com/.=

7stud -- · Apr 28, 2011

Robert K. wrote in post #995478:

One of the simplest approaches is to use Ruby's ability to use
arbitrary record delimiters:

File.foreach file_name, ">" do |chunk|
chunk.chomp! ">"
chunk.gsub! /\r\n?|\n/, '' # remove line terminators

Cee Joe, are you reading the file in binary mode or text mode?

7stud -- · Apr 28, 2011

7stud -- wrote in post #995589:

Cee Joe, are you reading the file in binary mode or
text mode?

If you don't know, then show us the line in your code where you open the
file.

Cee Joe · Apr 28, 2011

7stud -- wrote in post #995581:

You still have not told us what you are supposed to do with the stuff
you read in?? You can read a file line by line and print out each line=

as you go and the maximum amount of memory used will be one line's
worth. However, if you are supposed to store all the lines in an
array, then you will read the whole file into memory.

I am extracting text from each entry I read in, something I have figured =

out already. I want to read the file line by line and just store each =

entry into a buffer when it reaches the ">" symbol. THen extract =

specific info from it later. The entry lengths all vary as there long =

and short lengths. File is in text mode.

What problems is that giving you? You can create a loop, read the
character at pos(i), then increment i, and do what Jes=C3=BAs Gabriel y= Gal=C3=A1n
suggested.

Could you show me a simple example or refer me to a link?

-- =

Posted via http://www.ruby-forum.com/.=

Cee Joe · Apr 28, 2011

7stud -- wrote in post #995596:

7stud -- wrote in post #995589:

If you don't know, then show us the line in your code where you open the
file.

f = File.open("test.fasta", "r")

Where test.fasta contains the entries i posted earlier..

7stud -- · Apr 29, 2011

Cee Joe wrote in post #995597:

my mentor wants me to use the IO.pos to read the
characters of the file until the character reaches the ">" symbol.

IO.pos() does not read in data, so you are going to have to ask your =

mentor what he means. You should also ask your mentor if this is a =

lesson in how not to do things. If he doesn't reply in the affirmative, =

then you should find a new mentor.

I am extracting text from each entry I read in, something I have figure= d
out already. I want to read the file line by line and just store each
entry into a buffer when it reaches the ">" symbol. THen extract
specific info from it later.

You told us you were not supposed to read the whole file into memory. =

If you store every line in an array, then you will have read the whole =

file into memory. Once again, you are not being clear on what you want =

to do with the data. You need to tell us which of the following you =

want to do:

1) Store every entry in an array, and "extract specific info from it =

later".

2) Read one entry, do something to the entry, then discard it and read =

in the next entry.

The entry lengths all vary as there long
and short lengths. File is in text mode.

Ok.

You could use each_byte to read the file char by char (that assumes your =

file contains all ascii characters), then when you find a '>', seek() =

back to the start of the file, and use IO.sysread() to read:

old_pos =3D 0
pos() - old_pos

number of characters. Then do something like:

old_pos =3D pos()

and keep doing that. But, you will be reading every entry twice, which =

is stupid.

-- =

Posted via http://www.ruby-forum.com/.=

Cee Joe · Apr 29, 2011

2) Read one entry, do something to the entry, then discard it and read

in the next entry.

This is what I want to do. Read one entry, extract information from it,
then read next entry. He says using an array will take up a lot of
memory so he said use a buffer.

But, you will end up reading every entry twice, which
is stupid. The easiest way to read in the file and prepare each entry
is to set the input separator to "\n\n", then use each() to read in a
paragraph, then use split("\n") to split each entry into lines, then add
back a \n to the first line.

Also, are you aware that this:

GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG <-- no "\n"
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG <-- no "\n"

is equivalent to:

GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATGCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG

Yes I am aware of that - I just put "no \n" for emphasis. Regarding the
pos(), I think he said to use it as a guide to help with the detection
of each ">" . Thanks for being patient and helping out.

7stud -- · Apr 29, 2011

If you don't have to use pos(), then see my first post.

jake kaiden · Apr 29, 2011

hi Cee -

copying the text you posted above into the file "0text.txt" and
running this:

f = "0text.txt"
file = File.open(f)
buffer = []
bufferindex = 0

file.each_line(sep=">"){|line|
buffer[bufferindex] = line.chomp
bufferkey+=1
}

p buffer[0]
p buffer[1]
p buffer[2]
p buffer[3]

i get this as output:

#=> ">"
#=> "gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895),
mRNA\\n\nAGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG\\n\n\\n\n>"
#=> "gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895),
mRNA\\n\nGTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG\\n\nCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG\\n\n\\n\n>"
#=> "gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895),
mRNA\\n\nCGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG\\n\nTTAGTCGCTGACGCATGCACG\\n\n\\n"

does this work for you? you could easily write ways to deal with,
dump, and reset the buffers when they fill up. you can of course also
clean up all the "\n"'s...

i agree with 7stud that using #.pos and #.gets seems like a long walk
off a short pier. i'm pretty green myself, and there are probably
better ways to iterate through the file, but #.each_line(sep=">") works
just fine, and doesn't eat up memory.

- j

Cee Joe · Apr 29, 2011

7stud -- wrote in post #995683:

If you don't have to use pos(), then see my first post. At some point,
you might ask him why he thinks that pos() would be of any help at all!

Thanks jake and 7stud for replying. I tried this in irb for your first
post:
=> "\n\n"

Before doing the sub() on buffer2, you will have to check if it's nil:

if buffer2.nil?
#don't do a sub()
else
#do the sub()
end

Output:
">gi|329299107|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895),
mRNA\nAGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG\n\n"
">gi|329299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895),
mRNA\nGTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG\nCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG\n\n"
">gi|329299107|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895),
mRNA\nCGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG\nTTAGTCGCTGACGCATGCACG\n"
nil
Done
=> nil

It still returns nil, am I doing what you suggested wrong?

7stud -- · Apr 29, 2011

The first thing everyone in this thread needs to realize is that '>' is
not the separator you want to look for. That's because you don't care
what character marks the beginning of every entry, rather you care what
character marks the end of every entry. The end of every entry is
marked by the string "\n\n", so you should use that has your input line
terminator. Remember, ruby uses "\n" for the input line separator by
default, which means that when you read a file using IO#each, ruby reads
lines--where the end of a line is marked by a newline. However, you can
change the input line separator to the string "\n\n" (or any other
string):

$/ = "\n\n"

Once you have an entry, then you just need to do a little housekeeping
and remove some "\n" characters.

require 'stringio'

str =<<ENDOFSTRING

gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG

gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG

gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA

CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG
TTAGTCGCTGACGCATGCACG

ENDOFSTRING

input = StringIO.new(str) #Now input is just like a File

input.each(sep = "\n\n") do |para|
buffer = ''

lines = para.split("\n")
buffer << lines.shift << "\n"
lines.each do |line|
buffer << line
end

puts buffer
puts "-" * 20
end

p $/

--output:--

gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG

CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAGTTAGTCGCTGACGCATGCACG
--------------------
"\n"

Note that specifying the new input line separator as an argument to
each() serves to restore the original input line separator once the
block has finished--which is a good thing.

Cee Joe · Apr 29, 2011

7stud -- wrote in post #995821:

I suggest that people never use irb because it has too many quirks.

The first thing you need to realize is that '>' is
not the separator you want to look for. That is the second bit of
erroneous advice your mentor gave you. That's because you don't care
what character marks the beginning of every entry, rather you care what
character marks the end of every entry. The end of every entry in your
file is marked by the string "\n\n", so you should use that as your
input line terminator. Remember, ruby uses "\n" for the input line
separator by default, which means that when you read a file using
IO#each, ruby reads lines--where the end of a line is marked by a
newline.

I understand the logic, it makes sense. What if the file looked like
this, where there is one newline seperating the entries? :

gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG
gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG
gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA

CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG
TTAGTCGCTGACGCATGCACG

Would an if-else(regarding"\n" and "\n\n") do the trick? I wanted to
write my code to where it would handle both scenarios. Or maybe:

case
when "\n\n"
<code>
when "\n"
<code>
end

something to that extent? Suggestions?

jake kaiden · Apr 29, 2011

hi Cee -

hmm, i'm getting a bit confused as to what exactly you're trying to do
- but if you want to load all this stuff into a buffer without the
newlines, and regardless of how many newlines you have between each
entry (assuming that an "entry" is something that starts with ">") - i
don't see why this wouldn't work:

f = "0text.txt"
file = File.open(f)
buffer = []
bufferindex = 0

file.each(sep = ">"){|line|
buffer[bufferindex] = line
bufferindex += 1
}

## here you would do something more interesting
buffer.collect{|line|
line = line.delete("\n")
p ">#{line}"
}

which will return...
">>"
">gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895),
mRNAAGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG>"
">gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895),
mRNAGTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATGCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG>"
">gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895),
mRNACGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAGTTAGTCGCTGACGCATGCACG"

...whether you have 0 or 100,000 newlines between each entry. is this
not what you're looking for?

-j

fstream Buffers	26	May 17, 2012
How to manage static buffers	2	May 11, 2012
zipfile.is_zipfile() and string buffers	1	Dec 16, 2008
streambuffs and their associated buffers	1	Mar 26, 2008
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Bit operations on buffers	3	Dec 16, 2007
Database schema for file organizer.	1	May 17, 2022
buffers, streams, confusion	6	Jun 20, 2006

File position and buffers

Cee Joe

Jesús Gabriel y Galán

jake kaiden

7stud --

7stud --

Robert Klemme

Cee Joe

7stud --

7stud --

7stud --

Cee Joe

Cee Joe

7stud --

Cee Joe

7stud --

jake kaiden

Cee Joe

7stud --

Cee Joe

jake kaiden

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads