Help with arrays of strings

Discussion in 'Python' started by Jon Smirl, Aug 1, 2006.

  1. Jon Smirl

    Jon Smirl Guest

    I only have a passing acquaintance with Python and I need to modify some
    existing code. This code is going to get called with 10GB of data so it
    needs to be fairly fast.

    http://cvs2svn.tigris.org/ is code for converting a CVS repository to
    Subversion. I'm working on changing it to convert from CVS to git.

    The existing Python RCS parser provides me with the CVS deltas as
    strings.I need to get these deltas into an array of lines so that I can
    apply the diff commands that add/delete lines (like 10 d20, etc). What is
    the most most efficient way to do this? The data structure needs to be
    able to apply the diffs efficently too.

    The strings have embedded @'s doubled as an escape sequence, is there an
    efficient way to convert these back to single @'s?

    After each diff is applied I need to convert the array of lines back into
    a string, generate a sha-1 over it and then compress it with zlib and
    finally write it to disk.

    The 10GB of data is Mozilla CVS when fully expanded.

    Thanks for any tips on how to do this.

    Jon Smirl
    Jon Smirl, Aug 1, 2006
    #1
    1. Advertising

  2. Jon Smirl

    Simon Forman Guest

    Jon Smirl wrote:
    > I only have a passing acquaintance with Python and I need to modify some
    > existing code. This code is going to get called with 10GB of data so it
    > needs to be fairly fast.
    >
    > http://cvs2svn.tigris.org/ is code for converting a CVS repository to
    > Subversion. I'm working on changing it to convert from CVS to git.
    >
    > The existing Python RCS parser provides me with the CVS deltas as
    > strings.I need to get these deltas into an array of lines so that I can
    > apply the diff commands that add/delete lines (like 10 d20, etc). What is
    > the most most efficient way to do this? The data structure needs to be
    > able to apply the diffs efficently too.
    >
    > The strings have embedded @'s doubled as an escape sequence, is there an
    > efficient way to convert these back to single @'s?
    >
    > After each diff is applied I need to convert the array of lines back into
    > a string, generate a sha-1 over it and then compress it with zlib and
    > finally write it to disk.
    >
    > The 10GB of data is Mozilla CVS when fully expanded.
    >
    > Thanks for any tips on how to do this.
    >
    > Jon Smirl
    >


    Splitting a string into a list (array) of lines is easy enough, if you
    want to discard the line endings,

    lines = s.splitlines()

    or, if you want to keep them,

    lines = s.splitlines(True)

    replacing substrings in a string is also easy,

    s = s.replace('@@', '@')

    For efficiency, you'll probably want to do the replacement first, then
    split:

    lines = s.replace('@@', '@').splitlines()


    Once you've got your list of lines, python's awesome list manipulation
    should makes applying diffs very easy. For instance, to replace lines
    3 to 7 (starting at zero) you could assign a list (containing the
    replacement lines) to a "slice" of the list of lines:

    lines[3:8] = replacement_lines

    Where replacement_lines is a list containing the replacement lines.
    There's a lot more to this, read up on python's lists.


    To convert the list back into one string use the join() method; if you
    kept the line endings,

    s = "".join(lines)

    or if you threw them away,

    s = "\n".join(lines)

    Python has standard modules for sha-1 digest, sha, and zlib
    compression, zlib. See http://docs.python.org/lib/lib.html

    HTH, enjoy,
    ~Simon
    Simon Forman, Aug 1, 2006
    #2
    1. Advertising

  3. Jon Smirl

    Jon Smirl Guest

    On Mon, 31 Jul 2006 18:33:34 -0700, Simon Forman wrote:
    > Splitting a string into a list (array) of lines is easy enough, if you
    > want to discard the line endings,


    Thanks for the pointers, that should be enough to get me started. I had
    started off in the wrong direction looking for arrays instead of lists.

    I'll code this up and give it try. Hopefully it can run though the 10GB of
    data in a few hours and not take days.

    Jon Smirl
    Jon Smirl, Aug 1, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Alexandra Stehman
    Replies:
    5
    Views:
    30,506
    Chris Smith
    Jun 17, 2004
  2. Bill Reyn
    Replies:
    3
    Views:
    2,215
    Bob Hairgrove
    Jun 22, 2004
  3. Mantorok Redgormor

    initializing arrays of arrays

    Mantorok Redgormor, Sep 10, 2003, in forum: C Programming
    Replies:
    4
    Views:
    547
  4. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    724
    Malcolm
    Jun 24, 2006
  5. Philipp
    Replies:
    21
    Views:
    1,093
    Philipp
    Jan 20, 2009
Loading...

Share This Page