Re: Writing huge Sets() to disk

Discussion in 'Python' started by =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=, Jan 10, 2005.

  1. Tim Peters wrote:
    > [Martin MOKREJÅ ]
    >
    >> just imagine, you want to compare how many words are in English, German,
    >>Czech, Polish disctionary. You collect words from every language and record
    >>them in dict or Set, as you wish.

    >
    >
    > Call the set of all English words E; G, C, and P similarly.
    >
    >
    >> Once you have those Set's or dict's for those 4 languages, you ask
    >>for common words

    >
    >
    > This Python expression then gives the set of words common to all 4:
    >
    > E & G & C & P
    >
    >
    >>and for those unique to Polish.

    >
    >
    > P - E - G - C
    >
    > is a reasonably efficient way to compute that.


    Nice, is it equivalent to common / unique methods of Sets?

    >
    >
    >>I have no estimates
    >>of real-world numbers, but we might be in range of 1E6 or 1E8?
    >>I believe in any case, huge.

    >
    >
    > No matter how large, it's utterly tiny compared to the number of
    > character strings that *aren't* words in any of these languages.
    > English has a lot of words, but nobody estimates it at over 2 million
    > (including scientific jargon, like names for chemical compounds):
    >
    > http://www.worldwidewords.org/articles/howmany.htm


    As I've said, I analyze in real something else then languages.
    However, it can be described with the example of words in different languages.

    But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw
    data. Will sets be appropriate you think?

    >>My concern is actually purely scientific, not really related to analysis
    >>of these 4 languages, but I believe it describes my intent quite well.
    >>
    >> I wanted to be able to get a list of words NOT found in say Polish,
    >>and therefore wanted to have a list of all, theoretically existing words.
    >>In principle, I can drop this idea of having ideal, theoretical lexicon.
    >>But have to store those real-world dictionaries anyway to hard drive.

    >
    >
    > Real-word dictionaries shouldn't be a problem. I recommend you store
    > each as a plain text file, one word per line. Then, e.g., to convert
    > that into a set of words, do
    >
    > f = open('EnglishWords.txt')
    > set_of_English_words = set(f)


    I'm aware I can't keep set_of_English_words in memory.

    > f.close()


    M.
     
    =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=, Jan 10, 2005
    #1
    1. Advertising

  2. Martin MOKREJÅ  wrote:


    > But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw
    > data. Will sets be appropriate you think?


    You started out with 20E20 then cut back to 1E15 keys
    now it is down to one million but you claim that these
    will take 1.5 GB.

    On my system storing 1 million words of length 15
    as keys of a python dictionary is around 75MB.

    I.
     
    Istvan Albert, Jan 10, 2005
    #2
    1. Advertising

  3. Istvan Albert wrote:
    > Martin MOKREJÅ  wrote:
    >
    >
    >> But nevertheless, imagine 1E6 words of size 15. That's maybe 1.5GB of raw
    >> data. Will sets be appropriate you think?

    >
    >
    > You started out with 20E20 then cut back to 1E15 keys
    > now it is down to one million but you claim that these
    > will take 1.5 GB.


    I gave up the theoretical approach. Practically, I might need up
    to store maybe those 1E15 keys.

    So you say 1 million words is better to store in dictionary than
    in a set and use your own function to get out those unique or common
    words?

    >
    > On my system storing 1 million words of length 15
    > as keys of a python dictionary is around 75MB.


    Fine, that's what I wanted to hear. How do you improve the algorithm?
    Do you delay indexing to the very latest moment or do you let your
    computer index 999 999 times just for fun?

    >
    > I.
     
    =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=, Jan 10, 2005
    #3
  4. Martin MOKREJÅ  wrote:

    > Istvan Albert wrote:


    > So you say 1 million words is better to store in dictionary than
    > in a set and use your own function to get out those unique or common
    > words?


    I have said nothing even remotely like that.

    > Fine, that's what I wanted to hear. How do you improve the algorithm?
    > Do you delay indexing to the very latest moment or do you let your
    > computer index 999 999 times just for fun?


    I think that you need to first understand how dictionaries work.
    The time needed to insert a key is independent of
    the number of values in the dictionary.

    Istvan.
     
    Istvan Albert, Jan 10, 2005
    #4
  5. =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=

    Tim Peters Guest

    [Martin MOKREJÅ ]
    > ...
    >
    > I gave up the theoretical approach. Practically, I might need up
    > to store maybe those 1E15 keys.


    We should work on our multiplication skills here <wink>. You don't
    have enough disk space to store 1E15 keys. If your keys were just one
    byte each, you would need to have 4 thousand disks of 250GB each to
    store 1E15 keys. How much disk space do you actually have? I'm
    betting you have no more than one 250GB disk.

    ....

    [Istvan Albert]
    >> On my system storing 1 million words of length 15
    >> as keys of a python dictionary is around 75MB.


    > Fine, that's what I wanted to hear. How do you improve the algorithm?
    > Do you delay indexing to the very latest moment or do you let your
    > computer index 999 999 times just for fun?


    It remains wholly unclear to me what "the algorithm" you want might
    be. As I mentioned before, if you store keys in sorted text files,
    you can do intersection and difference very efficiently just by using
    the Unix `comm` utiltity.
     
    Tim Peters, Jan 10, 2005
    #5
  6. Tim Peters wrote:
    > [Martin MOKREJÅ ]
    >
    >>...
    >>
    >>I gave up the theoretical approach. Practically, I might need up
    >>to store maybe those 1E15 keys.

    >
    >
    > We should work on our multiplication skills here <wink>. You don't
    > have enough disk space to store 1E15 keys. If your keys were just one
    > byte each, you would need to have 4 thousand disks of 250GB each to
    > store 1E15 keys. How much disk space do you actually have? I'm
    > betting you have no more than one 250GB disk.


    Can get about 700GB on raid5, but this doesn't help here of course. :(
    I definitely appreciate your comments, I somehow forgot to make sure
    I can store it. I was mainly distracted by the fact I might anyway
    hit almost the same size, if there's too few words used in reality.
    Therefore when looking for those words not found in such a dictionary,
    I'd be close to the teoretical, maximal size of say in order of E15
    or E14.


    > ...
    >
    > [Istvan Albert]
    >
    >>>On my system storing 1 million words of length 15
    >>>as keys of a python dictionary is around 75MB.

    >
    >
    >>Fine, that's what I wanted to hear. How do you improve the algorithm?
    >>Do you delay indexing to the very latest moment or do you let your
    >>computer index 999 999 times just for fun?

    >
    >
    > It remains wholly unclear to me what "the algorithm" you want might


    My intent is do some analysis in protein science. But it can be really
    thought of as analysing those 4 different languages.

    > be. As I mentioned before, if you store keys in sorted text files,
    > you can do intersection and difference very efficiently just by using
    > the Unix `comm` utiltity.


    Now I got your point. I understand the comm(1) is written in C, but it still
    has to scan file1 once and file2 n-times, where n is a number of lines
    in file1, right? Without any index ... I'll consider it, actually will test,
    thanks!


    I was really hoping I'll get an answer how to alter the indexes for dictionaries
    in python.

    You convinced me not to even try to construct to theoretical dictionary,
    as it will take ages just to create. Even if I'd manage, I couldn't
    save it (the theoretical and possibly not even the dict(theoret) - dict(real)
    result).

    Still, before I give the whole project, once more:

    I'll parse some text files, isolates separate words and add them to
    either Set(), list, dict, flatfile line by line.

    Depending on the above, I'll compare them and look for those unique
    to some "language". I need to keep track of frequencies used
    in every such language, so the dict approach would be the best.
    The number stored as a value vould be a float ^H^H^H^H^H^H Decimal()
    type - very small number.

    Once more, I expect to have between E4 or E5 to E8??? words
    stored in 20 dictionaries (remember words of sizes in range 1-20?
    Every of those 20 dictionaries should be, I believe, indexed just once.
    The indexing method should know all entries in a given file are of same
    size, i.e. 5 chars, 15 chars, 20 chars etc.

    I already did implement the logic to walk through those 20 different
    dictionaries from language a and from language b and find uout those
    unique to a or common to both of them. Why I went to ask on this list
    was to make sure I took right approach. Sets seemed to be better solution
    for the simple comparison (common, unique). To keep track of those
    very small frequencies, I anyway have to have dictionaries. I say
    that again, how can I improve speed of dictionary indexing?
    It doesn't matter here if I have 10E4 keys in dictionary or
    10E8 keys in a dictionary.

    Or I wanted to hear: go for Sets(), having set with 1E8 keys
    might take 1/10 of the size you need for a dictionary ... but
    you cannot dump them efficiently on a disk. Using shelve will
    cost you maybe 2x the size of dictionary approach and will
    be also slover that writing dictionary directly.

    Something along these words. I really appreciate your input!
    Martin
     
    =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=, Jan 10, 2005
    #6
  7. =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=

    Tim Peters Guest

    Re: Writing huve ge Sets() to disk

    [Tim Peters]
    >> As I mentioned before, if you store keys in sorted text files,
    >> you can do intersection and difference very efficiently just by using
    >> the Unix `comm` utiltity.


    [Martin MOKREJÅ ]
    > Now I got your point. I understand the comm(1) is written in C, but it still
    > has to scan file1 once and file2 n-times, where n is a number of lines
    > in file1, right? Without any index ... I'll consider it, actually will test,
    > thanks!


    `comm` is much more efficient than that. Note that the input files
    have to be sorted. Then, in a single pass over both files (not 2
    passes, not 3, not n, just 1), it can compute any subset of these
    three (do `man comm`):

    1. The words common to both files.
    2. The words unique to "the first" file.
    3. The words unique to "the second" file.

    It's essentially just doing a merge on two sorted lists, and how it
    works should be obvious if you give it some thought. It takes time
    proportional to the sum of the lengths of the input files, and nothing
    *can* be faster than that.

    > I was really hoping I'll get an answer how to alter the indexes for dictionaries
    > in python.


    Sorry, I don't have a guess for what that might mean.

    > You convinced me not to even try to construct to theoretical dictionary,
    > as it will take ages just to create. Even if I'd manage, I couldn't
    > save it (the theoretical and possibly not even the dict(theoret) - dict(real)
    > result).


    Worse, it didn't sound like a good approach even if you could save it.

    > Still, before I give the whole project, once more:
    >
    > I'll parse some text files, isolates separate words and add them to
    > either Set(), list, dict, flatfile line by line.
    >
    > Depending on the above, I'll compare them and look for those unique
    > to some "language". I need to keep track of frequencies used
    > in every such language,


    Frequencies of what? "A language" here can contain some words
    multiple times each?

    > so the dict approach would be the best. The number stored as a value vould
    > be a float ^H^H^H^H^H^H Decimal() type - very small number.


    Integers are the natural type for counting, and require less memory
    than floats or decimals.

    > Once more, I expect to have between E4 or E5 to E8??? words
    > stored in 20 dictionaries (remember words of sizes in range 1-20?
    > Every of those 20 dictionaries should be, I believe, indexed just once.
    > The indexing method should know all entries in a given file are of same
    > size, i.e. 5 chars, 15 chars, 20 chars etc.


    I think you're making this more complicated than it needs to be.

    > I already did implement the logic to walk through those 20 different
    > dictionaries from language a and from language b and find uout those
    > unique to a or common to both of them. Why I went to ask on this list
    > was to make sure I took right approach. Sets seemed to be better solution
    > for the simple comparison (common, unique). To keep track of those
    > very small frequencies, I anyway have to have dictionaries. I say
    > that again, how can I improve speed of dictionary indexing?
    > It doesn't matter here if I have 10E4 keys in dictionary or
    > 10E8 keys in a dictionary.


    What reason do you have for believing that the speed of dictionary
    indexing is worth any bother at all to speed up? Dict lookup is
    generally considered to be extremely fast already. If you're just
    *guessing* that indexing is the bottleneck, you're probably guessing
    wrong -- run a profiler to find out where the real bottleneck is.
     
    Tim Peters, Jan 10, 2005
    #7
  8. Re: Writing huve ge Sets() to disk

    Tim Peters wrote:
    > [Tim Peters]
    >
    >>>As I mentioned before, if you store keys in sorted text files,
    >>>you can do intersection and difference very efficiently just by using
    >>>the Unix `comm` utiltity.

    >
    >
    > [Martin MOKREJÅ ]
    >
    >>Now I got your point. I understand the comm(1) is written in C, but it still
    >>has to scan file1 once and file2 n-times, where n is a number of lines
    >>in file1, right? Without any index ... I'll consider it, actually will test,
    >>thanks!

    >
    >
    > `comm` is much more efficient than that. Note that the input files
    > have to be sorted. Then, in a single pass over both files (not 2
    > passes, not 3, not n, just 1), it can compute any subset of these
    > three (do `man comm`):
    >
    > 1. The words common to both files.
    > 2. The words unique to "the first" file.
    > 3. The words unique to "the second" file.


    I read the manpage actually before answering to the list.

    > It's essentially just doing a merge on two sorted lists, and how it
    > works should be obvious if you give it some thought. It takes time
    > proportional to the sum of the lengths of the input files, and nothing
    > *can* be faster than that.


    Well, I think I understand now because "Scott David Daniels" wrote
    probably the very same logic in python code in this thread.

    >>I was really hoping I'll get an answer how to alter the indexes for dictionaries
    >>in python.

    >
    >
    > Sorry, I don't have a guess for what that might mean.


    I'm not an expert, mysql for example givs you ability to index say
    first 10 characters of a text column, which are typically varchar.
    Just for curiosity I'd like to know how do it in python.

    When importing data from a flatfile into mysql table, there's an
    option to delay indexing to the very last moment, when all keys are
    loaded (it doesn't make sense to re-create index after each new
    row into table is added). I believe it's exactly same waste of cpu/io
    in this case - when I create a dictionary and fill it with data,
    I want to create index afterward, not after every key/value pair
    is recorded.

    >>You convinced me not to even try to construct to theoretical dictionary,
    >>as it will take ages just to create. Even if I'd manage, I couldn't
    >>save it (the theoretical and possibly not even the dict(theoret) - dict(real)
    >>result).

    >
    >
    > Worse, it didn't sound like a good approach even if you could save it.
    >
    >
    >>Still, before I give the whole project, once more:
    >>
    >>I'll parse some text files, isolates separate words and add them to
    >>either Set(), list, dict, flatfile line by line.
    >>
    >>Depending on the above, I'll compare them and look for those unique
    >>to some "language". I need to keep track of frequencies used
    >>in every such language,

    >
    >
    > Frequencies of what? "A language" here can contain some words
    > multiple times each?


    Yes, to compute the frequency of each word used in say "language A",
    I'll count number of occurencies and them compute frequency of it.
    Frequency number should be recorded as a value in the dictionary,
    where the keys are unique and represent the word itself (or it's hash
    as recommended already).

    Once more, the dictionary will contain every word only once, it will be really
    a unique key.

    >>so the dict approach would be the best. The number stored as a value vould
    >>be a float ^H^H^H^H^H^H Decimal() type - very small number.

    >
    >
    > Integers are the natural type for counting, and require less memory
    > than floats or decimals.


    I hoped I was clear ... when I said I'll compute frequency of those words.
    The algorithm to compute it will be subject to change during developemnt. ;)


    >>Once more, I expect to have between E4 or E5 to E8??? words
    >>stored in 20 dictionaries (remember words of sizes in range 1-20?
    >>Every of those 20 dictionaries should be, I believe, indexed just once.
    >>The indexing method should know all entries in a given file are of same
    >>size, i.e. 5 chars, 15 chars, 20 chars etc.

    >
    >
    > I think you're making this more complicated than it needs to be.


    I hope the algoritm can save some logic. For example, can turn off locking support,
    index only part of the key etc.

    >
    >
    >>I already did implement the logic to walk through those 20 different
    >>dictionaries from language a and from language b and find uout those
    >>unique to a or common to both of them. Why I went to ask on this list
    >>was to make sure I took right approach. Sets seemed to be better solution
    >>for the simple comparison (common, unique). To keep track of those
    >>very small frequencies, I anyway have to have dictionaries. I say
    >>that again, how can I improve speed of dictionary indexing?
    >>It doesn't matter here if I have 10E4 keys in dictionary or
    >>10E8 keys in a dictionary.

    >
    >
    > What reason do you have for believing that the speed of dictionary
    > indexing is worth any bother at all to speed up? Dict lookup is
    > generally considered to be extremely fast already. If you're just
    > *guessing* that indexing is the bottleneck, you're probably guessing
    > wrong -- run a profiler to find out where the real bottleneck is.


    For example, looking up a key with it's value only once in the whole for loop tells me
    I don't need an index. Yes, I'll do this 4 times for those 4 languages, but
    still I think it's faster to live without an index, when I can sort
    records. I think I like the sorted text file approach, and the dictionary
    approach without an index would be almost the same, especially if I manage
    to tell the db layout not to move the cursor randomly but just to walk down the
    pre-sorted data.

    As an extra, I could record those frequences within a dictionary.
    Actually, I need them, that's why I do all this work.

    The idea of using Sets() I had just to find more quickly unique/common
    words because of the possible speed gain. But I don't see a way to solve
    this without dictionaries.
     
    =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=, Jan 11, 2005
    #8
  9. Re: Writing huve ge Sets() to disk

    Martin MOKREJÅ  wrote:

    > Tim Peters wrote:
    >

    ....

    >>> I was really hoping I'll get an answer how to alter the indexes for
    >>> dictionaries
    >>> in python.

    >>
    >>
    >>
    >> Sorry, I don't have a guess for what that might mean.

    >
    >
    > I'm not an expert, mysql for example givs you ability to index say
    > first 10 characters of a text column, which are typically varchar.
    > Just for curiosity I'd like to know how do it in python.
    >
    > When importing data from a flatfile into mysql table, there's an
    > option to delay indexing to the very last moment, when all keys are
    > loaded (it doesn't make sense to re-create index after each new
    > row into table is added). I believe it's exactly same waste of cpu/io
    > in this case - when I create a dictionary and fill it with data,
    > I want to create index afterward, not after every key/value pair
    > is recorded.


    Okay, you seem to be missing this key idea:

    A hash-table (dict) is approximately the same level of abstraction
    as a btree index.

    Your MySQL "index" is likely implemented as a btree. A hash-table could
    just as readily be used to implement the index. When you insert into
    either of these structures (btree or hash), you are not creating an
    "index" separate from the structure, the structure *is* an "index" of
    the type you are thinking about. These structures are both efficient
    representations that map from a data-value to some other data-value.
    Hashes with a good hash function (such as Python's dicts) are basically
    O(1) (worst case O(n), as Tim notes), while Btrees (such as common
    database indices) are O(log(n)) (or something of that type, basically it
    grows much more slowly than n).

    >>> Once more, I expect to have between E4 or E5 to E8??? words
    >>> stored in 20 dictionaries (remember words of sizes in range 1-20?
    >>> Every of those 20 dictionaries should be, I believe, indexed just once.
    >>> The indexing method should know all entries in a given file are of same
    >>> size, i.e. 5 chars, 15 chars, 20 chars etc.

    >>
    >>
    >>
    >> I think you're making this more complicated than it needs to be.

    >
    >
    > I hope the algoritm can save some logic. For example, can turn off
    > locking support,
    > index only part of the key etc.


    I'd tend to agree with Tim. You're making this all far too complex in
    what appears to be the absence of any real need. There's a maxim in
    computer programming that you avoid, wherever possible, what is called
    "premature optimisation". You are here trying to optimise away a
    bottleneck that doesn't exist (indexing overhead, and locking support
    are basically nil for a dictionary). It is almost a certainty that
    these are not *real* bottlenecks in your code (what with not existing),
    so your time would be better spent writing a "dumb" working version of
    the code and profiling it to see where the *real* bottlenecks are.

    > For example, looking up a key with it's value only once in the whole
    > for loop tells me
    > I don't need an index. Yes, I'll do this 4 times for those 4
    > languages, but
    > still I think it's faster to live without an index, when I can sort
    > records. I think I like the sorted text file approach, and the dictionary
    > approach without an index would be almost the same, especially if I
    > manage
    > to tell the db layout not to move the cursor randomly but just to walk
    > down the
    > pre-sorted data.


    Again, you don't really have a cursor with a dictionary (and since it's
    randomly ordered, a cursor wouldn't mean anything). A *btree* has an
    order, but not a dictionary. You could construct a sorted list in
    memory that would approximate what I *think* you're thinking of as a
    dictionary-without-an-index.

    Good luck,
    Mike

    ________________________________________________
    Mike C. Fletcher
    Designer, VR Plumber, Coder
    http://www.vrplumber.com
    http://blog.vrplumber.com
     
    Mike C. Fletcher, Jan 11, 2005
    #9
  10. Tim Peters wrote:
    > [Martin MOKREJÅ ]
    >
    >>...
    >>
    >>I gave up the theoretical approach. Practically, I might need up
    >>to store maybe those 1E15 keys.

    >
    >
    > We should work on our multiplication skills here <wink>. You don't
    > have enough disk space to store 1E15 keys. If your keys were just one
    > byte each, you would need to have 4 thousand disks of 250GB each to
    > store 1E15 keys. How much disk space do you actually have? I'm
    > betting you have no more than one 250GB disk.
    >
    > ...
    >
    > [Istvan Albert]
    >
    >>>On my system storing 1 million words of length 15
    >>>as keys of a python dictionary is around 75MB.

    >
    >
    >>Fine, that's what I wanted to hear. How do you improve the algorithm?
    >>Do you delay indexing to the very latest moment or do you let your
    >>computer index 999 999 times just for fun?

    >
    >
    > It remains wholly unclear to me what "the algorithm" you want might
    > be. As I mentioned before, if you store keys in sorted text files,
    > you can do intersection and difference very efficiently just by using
    > the Unix `comm` utiltity.


    This comm(1) approach doesn't work for me. It somehow fails to detect
    common entries when the offset is too big.

    file 1:

    A
    F
    G
    I
    K
    M
    N
    R
    V
    AA
    AI
    FG
    FR
    GF
    GI
    GR
    IG
    IK
    IN
    IV
    KI
    MA
    NG
    RA
    RI
    VF
    AIK
    FGR
    FRA
    GFG
    GIN
    GRI
    IGI
    IGR
    IKI
    ING
    IVF
    KIG
    MAI
    NGF
    RAA
    RIG


    file 2:

    W
    W
    W
    W
    W
    W
    W
    W
    W
    W
    AA
    AI
    FG
    FR
    GF
    GI
    GR
    IG
    IK
    IN
    IV
    KI
    MA
    NG
    RA
    RI
    VF
    AAAAA
    AAAAA
    AAAAA
    AAAAA
    AAAAA
    AAAAA
    AAAAA
    AAAAA
    AAAAA
    AAAAA
    AAAAA
    AAAAA
     
    =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=, Jan 14, 2005
    #10
  11. =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=

    Tim Peters Guest

    [Martin MOKREJÅ ]
    > This comm(1) approach doesn't work for me. It somehow fails to
    > detect common entries when the offset is too big.
    >
    > file 1:
    >
    > A
    > F
    > G
    > I
    > K
    > M
    > N
    > R
    > V
    > AA
    > AI
    > FG
    > FR
    > GF
    > GI
    > GR
    > IG
    > IK
    > IN
    > IV
    > KI
    > MA
    > NG
    > RA
    > RI
    > VF
    > AIK
    > FGR
    > FRA
    > GFG
    > GIN
    > GRI
    > IGI
    > IGR
    > IKI
    > ING
    > IVF
    > KIG
    > MAI
    > NGF
    > RAA
    > RIG
    >
    > file 2:
    >
    > W
    > W
    > W
    > W
    > W
    > W
    > W
    > W
    > W
    > W
    > AA
    > AI
    > FG
    > FR
    > GF
    > GI
    > GR
    > IG
    > IK
    > IN
    > IV
    > KI
    > MA
    > NG
    > RA
    > RI
    > VF
    > AAAAA
    > AAAAA
    > AAAAA
    > AAAAA
    > AAAAA
    > AAAAA
    > AAAAA
    > AAAAA
    > AAAAA
    > AAAAA
    > AAAAA
    > AAAAA


    I'll repeat:

    >> As I mentioned before, if you store keys in sorted text files ...


    Those files aren't in sorted order, so of course `comm` can't do
    anything useful with them. Do `man sort`; sorting is not optional
    here.
     
    Tim Peters, Jan 14, 2005
    #11
  12. Tim Peters wrote:
    > [Martin MOKREJÅ ]
    >
    >>This comm(1) approach doesn't work for me. It somehow fails to
    >>detect common entries when the offset is too big.


    [...]
    > I'll repeat:
    >
    >
    >>>As I mentioned before, if you store keys in sorted text files ...

    >
    >
    > Those files aren't in sorted order, so of course `comm` can't do
    > anything useful with them. Do `man sort`; sorting is not optional
    > here.


    I did read the manpage, but somehow it seems I did not execute sort(1)
    from within my python code, so it was unsorted and did did not realize
    it yet. Thanks!
    m.
     
    =?UTF-8?B?TWFydGluIE1PS1JFSsWg?=, Jan 14, 2005
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?ISO-8859-2?Q?Martin_MOKREJ=A9?=

    Writing huge Sets() to disk

    =?ISO-8859-2?Q?Martin_MOKREJ=A9?=, Jan 10, 2005, in forum: Python
    Replies:
    7
    Views:
    295
    Bengt Richter
    Jan 11, 2005
  2. =?ISO-8859-15?Q?Martin_MOKREJ=A6?=

    Re: Writing huge Sets() to disk

    =?ISO-8859-15?Q?Martin_MOKREJ=A6?=, Jan 10, 2005, in forum: Python
    Replies:
    4
    Views:
    370
    Paul Rubin
    Jan 11, 2005
  3. Tim Peters

    Re: Writing huge Sets() to disk

    Tim Peters, Jan 10, 2005, in forum: Python
    Replies:
    3
    Views:
    517
    Scott David Daniels
    Jan 11, 2005
  4. =?ISO-8859-2?Q?Martin_MOKREJ=A9?=

    Re: Writing huge Sets() to disk

    =?ISO-8859-2?Q?Martin_MOKREJ=A9?=, Jan 17, 2005, in forum: Python
    Replies:
    5
    Views:
    343
    Duncan Booth
    Jan 17, 2005
  5. Ara.T.Howard
    Replies:
    0
    Views:
    85
    Ara.T.Howard
    Oct 2, 2004
Loading...

Share This Page