Help with arrays of strings

J

Jon Smirl

I only have a passing acquaintance with Python and I need to modify some
existing code. This code is going to get called with 10GB of data so it
needs to be fairly fast.

http://cvs2svn.tigris.org/ is code for converting a CVS repository to
Subversion. I'm working on changing it to convert from CVS to git.

The existing Python RCS parser provides me with the CVS deltas as
strings.I need to get these deltas into an array of lines so that I can
apply the diff commands that add/delete lines (like 10 d20, etc). What is
the most most efficient way to do this? The data structure needs to be
able to apply the diffs efficently too.

The strings have embedded @'s doubled as an escape sequence, is there an
efficient way to convert these back to single @'s?

After each diff is applied I need to convert the array of lines back into
a string, generate a sha-1 over it and then compress it with zlib and
finally write it to disk.

The 10GB of data is Mozilla CVS when fully expanded.

Thanks for any tips on how to do this.

Jon Smirl
(e-mail address removed)
 
S

Simon Forman

Jon said:
I only have a passing acquaintance with Python and I need to modify some
existing code. This code is going to get called with 10GB of data so it
needs to be fairly fast.

http://cvs2svn.tigris.org/ is code for converting a CVS repository to
Subversion. I'm working on changing it to convert from CVS to git.

The existing Python RCS parser provides me with the CVS deltas as
strings.I need to get these deltas into an array of lines so that I can
apply the diff commands that add/delete lines (like 10 d20, etc). What is
the most most efficient way to do this? The data structure needs to be
able to apply the diffs efficently too.

The strings have embedded @'s doubled as an escape sequence, is there an
efficient way to convert these back to single @'s?

After each diff is applied I need to convert the array of lines back into
a string, generate a sha-1 over it and then compress it with zlib and
finally write it to disk.

The 10GB of data is Mozilla CVS when fully expanded.

Thanks for any tips on how to do this.

Jon Smirl
(e-mail address removed)

Splitting a string into a list (array) of lines is easy enough, if you
want to discard the line endings,

lines = s.splitlines()

or, if you want to keep them,

lines = s.splitlines(True)

replacing substrings in a string is also easy,

s = s.replace('@@', '@')

For efficiency, you'll probably want to do the replacement first, then
split:

lines = s.replace('@@', '@').splitlines()


Once you've got your list of lines, python's awesome list manipulation
should makes applying diffs very easy. For instance, to replace lines
3 to 7 (starting at zero) you could assign a list (containing the
replacement lines) to a "slice" of the list of lines:

lines[3:8] = replacement_lines

Where replacement_lines is a list containing the replacement lines.
There's a lot more to this, read up on python's lists.


To convert the list back into one string use the join() method; if you
kept the line endings,

s = "".join(lines)

or if you threw them away,

s = "\n".join(lines)

Python has standard modules for sha-1 digest, sha, and zlib
compression, zlib. See http://docs.python.org/lib/lib.html

HTH, enjoy,
~Simon
 
J

Jon Smirl

Splitting a string into a list (array) of lines is easy enough, if you
want to discard the line endings,

Thanks for the pointers, that should be enough to get me started. I had
started off in the wrong direction looking for arrays instead of lists.

I'll code this up and give it try. Hopefully it can run though the 10GB of
data in a few hours and not take days.

Jon Smirl
(e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top