cPickle alternative?

Discussion in 'Python' started by Drochom, Aug 15, 2003.

  1. Drochom

    Drochom Guest

    Hello,
    I have a huge problem with loading very simple structure into memory
    it is a list of tuples, it has 6MB and consists of 100000 elements

    >import cPickle


    >plik = open("mealy","r")
    >mealy = cPickle.load(plik)
    >plik.close()


    this takes about 30 seconds!
    How can I accelerate it?

    Thanks in adv.
     
    Drochom, Aug 15, 2003
    #1
    1. Advertising

  2. Drochom wrote:

    > Hello,
    > I have a huge problem with loading very simple structure into memory
    > it is a list of tuples, it has 6MB and consists of 100000 elements
    >
    >>import cPickle

    >
    >>plik = open("mealy","r")
    >>mealy = cPickle.load(plik)
    >>plik.close()

    >
    > this takes about 30 seconds!
    > How can I accelerate it?
    >
    > Thanks in adv.


    What protocol did you pickle your data with? The default (protocol 0,
    ASCII text) is the slowest. I suggest you upgrade to Python 2.3 and
    save your data with the new protocol 2 -- it's likely to be fastest.


    Alex
     
    Alex Martelli, Aug 15, 2003
    #2
    1. Advertising

  3. Hi,

    I have no idea! I used a similar scheme the other day and made some
    benchmarks (I *like* benchmarks!)

    About 6 MB took 4 seconds dumping as well as loading on a 800 Mhz P3 Laptop.
    When using binary mode it went down to about 1.5 seconds (And space to 2 MB)

    THis is o.k., because I generally have problems beeing faster than 1 MB/sec
    with my 2" drive, processor and Python ;-)

    Python 2.3 seems to have even a more effective "protocoll mode 2".

    May be your structures are *very* complex??

    Kindly
    Michael P



    "Drochom" <> schrieb im Newsbeitrag
    news:bhiqlg$9qj$...
    > Hello,
    > I have a huge problem with loading very simple structure into memory
    > it is a list of tuples, it has 6MB and consists of 100000 elements
    >
    > >import cPickle

    >
    > >plik = open("mealy","r")
    > >mealy = cPickle.load(plik)
    > >plik.close()

    >
    > this takes about 30 seconds!
    > How can I accelerate it?
    >
    > Thanks in adv.
    >
    >
    >
     
    Michael Peuser, Aug 15, 2003
    #3
  4. Drochom wrote:
    >>What protocol did you pickle your data with? The default (protocol 0,
    >>ASCII text) is the slowest. I suggest you upgrade to Python 2.3 and
    >>save your data with the new protocol 2 -- it's likely to be fastest.
    >>
    >>
    >>Alex
    >>

    >
    > Thanks:)
    > i'm using default protocol, i'm not sure if i can upgrade so simply, because
    > i'm using many modules for Py2.2


    Then use protocol 1 instead -- that has been the binary pickle protocol
    for a long time, and works perfectly on Python 2.2.x :)
    (and it's much faster than protocol 0 -- the text protocol)

    --Irmen
     
    Irmen de Jong, Aug 15, 2003
    #4
  5. Drochom wrote:
    > Thanks for help:)
    > Here is simple example:
    > frankly speaking it's a graph with 100000 nodes:
    > STRUCTURE:
    > [(('k', 5, 0),), (('*', 0, 0),), (('t', 1, 1),), (('o', 2, 0),), (('t', 3,
    > 0),), (('a', 4, 0), ('o', 2, 0))]


    Perhaps this matches your spec:

    from random import randrange
    import pickle, cPickle, time

    source = [(chr(randrange(33, 127)), randrange(100000), randrange(i+50))
    for i in range(100000)]


    def timed(module, flag, name='file.tmp'):
    start = time.time()
    dest = file(name, 'wb')
    module.dump(source, dest, flag)
    dest.close()
    mid = time.time()
    dest = file(name, 'rb')
    result = module.load(dest)
    dest.close()
    stop = time.time()
    assert source == result
    return mid-start, stop-mid

    On 2.2:
    timed(pickle, 0): (7.8, 5.5)
    timed(pickle, 1): (9.5, 6.2)
    timed(cPickle, 0): (0.41, 4.9)
    timed(cPickle, 1): (0.15, .53)

    On 2.3:
    timed(pickle, 0): (6.2, 5.3)
    timed(pickle, 1): (6.6, 5.4)
    timed(pickle, 2): (6.5, 3.9)

    timed(cPickle, 0): (6.2, 5.3)
    timed(pickle, 1): (.88, .69)
    timed(pickle, 2): (.80, .67)

    (Not tightly controlled -- I'd gues 1.5 digits)

    -Scott David Daniels
     
    Scott David Daniels, Aug 15, 2003
    #5
  6. Drochom

    Drochom Guest

    "Michael Peuser" <> wrote in message
    news:bhj56t$1d8$03$-online.com...
    > o.k - I modified my testprogram - let it run at your machine.
    > It took 1.5 seconds - I made it 11 Million records to get to 2 Mbyte.
    > Kindly
    > Michael
    > ------------------
    > import cPickle as Pickle
    > from time import clock
    >
    > # generate 1.000.000 records
    > r=[(('k', 5, 0),), (('*', 0, 0),), (('t', 1, 1),), (('o', 2, 0),), (('t',

    3,
    > 0),), (('a', 4, 0), ('o', 2, 0))]
    >
    > x=[]
    >
    > for i in xrange(1000000):
    > x.append(r)
    >
    > print len(x), "records"
    >
    > t0=clock()
    > f=open ("test","w")
    > Pickle.dump(x,f,1)
    > f.close()
    > print "out=", clock()-t0
    >
    > t0=clock()
    > f=open ("test")
    > x=Pickle.load(f)
    > f.close()
    > print "in=", clock()-t0
    > ---------------------


    Hi, i'm really grateful for your help,
    i've modyfied your code a bit, check your times and tell me what are they

    TRY THIS:

    import cPickle as Pickle
    from time import clock
    from random import randrange


    x=[]

    for i in xrange(20000):
    c = []
    for j in xrange(randrange(2,25)):
    c.append((chr(randrange(33,120)),randrange(1,100000),randrange(1,3)))
    c = tuple(c)
    x.append(c)
    if i%1000==0: print i #it will help you to survive waiting...
    print len(x), "records"

    t0=clock()
    f=open ("test","w")
    Pickle.dump(x,f,0)
    f.close()
    print "out=", clock()-t0


    t0=clock()
    f=open ("test")
    x=Pickle.load(f)
    f.close()
    print "in=", clock()-t0

    Thanks once again:)
     
    Drochom, Aug 15, 2003
    #6
  7. Drochom

    Drochom Guest

    Hello,

    > If speed is important, you may want to do different things depending on

    e.g.,
    > what is in those tuples, and whether they are all the same length, etc.

    E.g.,
    > if they were all fixed length tuples of integers, you could do hugely

    better
    > than store the data as a list of tuples.

    Those tuples have different length indeed.

    > You could store the whole thing in a mmap image, with a length-prefixed

    pickle
    > string in the front representing index info.

    If i only knew how do to it...:)

    > Find a way to avoid doing it? Or doing much of it?
    > What are your access needs once the data is accessible?

    My structure stores a finite state automaton with polish dictionary (lexicon
    to be more precise) and it should be loaded
    once but fast!

    Thx
    Regards,
    Przemo Drochomirecki
     
    Drochom, Aug 15, 2003
    #7
  8. Drochom

    Drochom Guest

    I forgot to explain you why i use tuples instead of lists
    i was squeezing a lexicon => minimalization of automaton => using a
    dictionary => using hashable objects =>using tuples(lists aren't hashable)


    Regards,
    Przemo Drochomirecki
     
    Drochom, Aug 15, 2003
    #8
  9. Drochom

    Drochom Guest


    > Perhaps this matches your spec:
    >
    > from random import randrange
    > import pickle, cPickle, time
    >
    > source = [(chr(randrange(33, 127)), randrange(100000), randrange(i+50))
    > for i in range(100000)]
    >
    >
    > def timed(module, flag, name='file.tmp'):
    > start = time.time()
    > dest = file(name, 'wb')
    > module.dump(source, dest, flag)
    > dest.close()
    > mid = time.time()
    > dest = file(name, 'rb')
    > result = module.load(dest)
    > dest.close()
    > stop = time.time()
    > assert source == result
    > return mid-start, stop-mid
    >
    > On 2.2:
    > timed(pickle, 0): (7.8, 5.5)
    > timed(pickle, 1): (9.5, 6.2)
    > timed(cPickle, 0): (0.41, 4.9)
    > timed(cPickle, 1): (0.15, .53)
    >
    > On 2.3:
    > timed(pickle, 0): (6.2, 5.3)
    > timed(pickle, 1): (6.6, 5.4)
    > timed(pickle, 2): (6.5, 3.9)
    >
    > timed(cPickle, 0): (6.2, 5.3)
    > timed(pickle, 1): (.88, .69)
    > timed(pickle, 2): (.80, .67)
    >
    > (Not tightly controlled -- I'd gues 1.5 digits)
    >
    > -Scott David Daniels
    >
    >

    Hello, and Thanks, your code was extremely helpful:)

    Regards
    Przemo Drochomirecki
     
    Drochom, Aug 15, 2003
    #9
  10. On Sat, 16 Aug 2003 00:41:42 +0200, "Drochom" <> wrote:

    >Hello,
    >
    >> If speed is important, you may want to do different things depending on

    >e.g.,
    >> what is in those tuples, and whether they are all the same length, etc.

    >E.g.,
    >> if they were all fixed length tuples of integers, you could do hugely

    >better
    >> than store the data as a list of tuples.

    >Those tuples have different length indeed.
    >
    >> You could store the whole thing in a mmap image, with a length-prefixed

    >pickle
    >> string in the front representing index info.

    >If i only knew how do to it...:)
    >
    >> Find a way to avoid doing it? Or doing much of it?
    >> What are your access needs once the data is accessible?

    >My structure stores a finite state automaton with polish dictionary (lexicon
    >to be more precise) and it should be loaded
    >once but fast!
    >

    I wonder how much space it would take to store the Polish complete language word
    list with one entry each in a Python dictionary. 300k words of 6-7 characters avg?
    Say 2MB plus the dict hash stuff. I bet it would be fast.

    Is that in effect what you are doing, except sort of like a regex state machine
    to match words character by character?

    Regards,
    Bengt Richter
     
    Bengt Richter, Aug 16, 2003
    #10
  11. Hi Drochem,

    (1) Your dataset seems to break the binary cPickle mode ;-) (I tried it with
    the "new Pickle" in 2.3 - same result: "EOF error" when loading back...) May
    be there is someone interested in fixing this ....


    (2) I run your code and - as you noticed - it takes some time to *generate*
    the datastructure. To be fair pickle has to do the same so it cannot be
    *significantly* faster!!!
    The size of the file was 5,5 MB

    (3) Timings (2.2):
    Generation of data: 18 secs
    Dunping: 3,2 secs
    Loading: 19,4 sec

    (4) I couldn't refrain from running it under 2.3
    Generation of data: 8,5 secs !!!!
    Dumping: 6,4 secs !!!!
    Loading: 5,7 secs


    So your programming might really improve when changing to 2.3 - and if
    anyone can fix the cPickle bug, protocol mode 2 will be even more efficient.

    Kindly
    Michael

    "Drochom" <> schrieb im Newsbeitrag
    news:bhjn6v$pi8$...
    >

    [....]
    > TRY THIS:
    >
    > import cPickle as Pickle
    > from time import clock
    > from random import randrange
    >
    >
    > x=[]
    >
    > for i in xrange(20000):
    > c = []
    > for j in xrange(randrange(2,25)):
    > c.append((chr(randrange(33,120)),randrange(1,100000),randrange(1,3)))
    > c = tuple(c)
    > x.append(c)
    > if i%1000==0: print i #it will help you to survive waiting...
    > print len(x), "records"
    >
    > t0=clock()
    > f=open ("test","w")
    > Pickle.dump(x,f,0)
    > f.close()
    > print "out=", clock()-t0
    >
    >
    > t0=clock()
    > f=open ("test")
    > x=Pickle.load(f)
    > f.close()
    > print "in=", clock()-t0
    >
    > Thanks once again:)
    >
    >
    >
     
    Michael Peuser, Aug 16, 2003
    #11
  12. Drochom wrote:

    >> import cPickle
    >>
    >> plik = open("mealy","r")
    >> mealy = cPickle.load(plik)
    >> plik.close()

    >
    > this takes about 30 seconds!
    > How can I accelerate it?


    Perhaps it's worth looking into PyTables:

    <http://pytables.sourceforge.net/doc/PyCon.html#section4>


    Cheers,

    // Klaus

    --
    ><> unselfish actions pay back better
     
    Klaus Alexander Seistrup, Aug 16, 2003
    #12
  13. Drochom

    Tim Evans Guest

    "Michael Peuser" <> writes:

    > Hi Drochem,
    >
    > (1) Your dataset seems to break the binary cPickle mode ;-) (I tried it with
    > the "new Pickle" in 2.3 - same result: "EOF error" when loading back...) May
    > be there is someone interested in fixing this ....

    [snip]
    > > f=open ("test","w")

    [snip]
    > > f=open ("test")

    [snip]

    Note that on windows, you must open binary files using binary mode
    when reading and writing them, like so:

    f = open('test', 'wb')
    f = open('test', 'rb')
    ^^^^

    If you don't do this binary data will be corrupted by the automatic
    conversion of '\n' to '\r\n' by win32. This is very likely what is
    causing the above error.

    --
    Tim Evans
     
    Tim Evans, Aug 16, 2003
    #13
  14. So stupid of me :-(((

    Now here are the benchmarks I got from Drochems dataset. I think it should
    sufice to use the binary mode of 2.2. (I checked the 2.3 data on a different
    disk the other day - that made them not comparable!! I now use the same disk
    for the tests.)

    Timings (2.2.2):
    Generation of data: 18 secs
    Dunping: 3 secs
    Loading: 18,5 sec
    Filesize: 5,5 MB

    Binary dump: 2,4
    Binary load: 3
    Filesize: 2,8 MB

    2.3
    Generation of data: 9 secs
    Dumping: 2,4
    Loading: 2,8


    Binary dump: 1
    Binary load: 1,9
    Filesize: 2,8 MB

    Mode 2 dump: 0,9
    Mode 2 load: 1,7
    Filesize: 2,6 MB

    The musch faster time for generating the data in 2.3 could be due to an
    improved random generator (?) That had alwys been quite slow..

    Kindly
    Michael P



    "Tim Evans" <> schrieb im Newsbeitrag
    news:...
    > "Michael Peuser" <> writes:
    >
    > > Hi Drochem,
    > >
    > > (1) Your dataset seems to break the binary cPickle mode ;-) (I tried it

    with
    > > the "new Pickle" in 2.3 - same result: "EOF error" when loading back...)

    May
    > > be there is someone interested in fixing this ....

    > [snip]
    > > > f=open ("test","w")

    > [snip]
    > > > f=open ("test")

    > [snip]
    >
    > Note that on windows, you must open binary files using binary mode
    > when reading and writing them, like so:
    >
    > f = open('test', 'wb')
    > f = open('test', 'rb')
    > ^^^^
    >
    > If you don't do this binary data will be corrupted by the automatic
    > conversion of '\n' to '\r\n' by win32. This is very likely what is
    > causing the above error.
    >
    > --
    > Tim Evans
     
    Michael Peuser, Aug 17, 2003
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Carsten Gips

    Jython: jythonc and cPickle

    Carsten Gips, Sep 9, 2003, in forum: Python
    Replies:
    0
    Views:
    429
    Carsten Gips
    Sep 9, 2003
  2. Guenter Walser
    Replies:
    0
    Views:
    609
    Guenter Walser
    Oct 15, 2003
  3. paul

    cPickle from 2.2 to 2.1

    paul, Dec 14, 2003, in forum: Python
    Replies:
    0
    Views:
    389
  4. Tim Peters

    RE: cPickle from 2.2 to 2.1

    Tim Peters, Dec 14, 2003, in forum: Python
    Replies:
    6
    Views:
    463
    Bengt Richter
    Dec 15, 2003
  5. Jesse Bloom

    problem using pickle / cPickle

    Jesse Bloom, Jan 2, 2004, in forum: Python
    Replies:
    1
    Views:
    389
    Vojin Jovanovic
    Jan 3, 2004
Loading...

Share This Page