Set like feature

H

Hari Pulapaka

Hi,

I have a list of space delimited strings ending in a newline.
Eg: a = ['a sfds sdf s df 34 ew\n', 'df sdf s f s ssf\n']

Now inside each row, I have a space delimited list of fields.

Now I want to compare the fields in each row of the array and see which
fields do not match.

Think of it as a 2 dimensional array of size mn, and comparing each
each element on a column by column basis.

I am using python2.2 so no sets. Can anyone think of an efficient way
to do this?

Thanks,

Hari
 
M

Mitja

Hi,

I have a list of space delimited strings ending in a newline.
Eg: a = ['a sfds sdf s df 34 ew\n', 'df sdf s f s ssf\n']

Now inside each row, I have a space delimited list of fields.

Now I want to compare the fields in each row of the array and see which
fields do not match.

Think of it as a 2 dimensional array of size mn, and comparing each
each element on a column by column basis.

I am using python2.2 so no sets. Can anyone think of an efficient way
to do this?

If I understand the problem correctly, splitting the lines up and sorting
them before comparison _is_ much better than a naive approach, though I
don't know if that's what's best.
 
A

Alex Martelli

Hari Pulapaka said:
I have a list of space delimited strings ending in a newline.
Eg: a = ['a sfds sdf s df 34 ew\n', 'df sdf s f s ssf\n']

Now inside each row, I have a space delimited list of fields.

Now I want to compare the fields in each row of the array and see which
fields do not match.

Think of it as a 2 dimensional array of size mn, and comparing each
each element on a column by column basis.

I am using python2.2 so no sets. Can anyone think of an efficient way
to do this?

Do you want to compare corresponding fields? That's the only way I can
read that 'column by column basis', and thus I don't see what sets could
possibly have to do with it.

Do you want to compare each row with every other row? I also note in
your example that the number of fields in each row appear to be
variable, so how do you want to deal with 'missing' fields?

Too many unanswered questions, I guess. But for some specified set of
answers to those question, you might do...:

def compare_fields(i, j, base, other):
for k, f1, f2 in zip(xrange(sys.maxint), base, other):
if f1 != f2:
print 'DIFF', i, j, k, repr(f1), repr(f2)

def lots_of_compares(list_of_strings):
list_of_lists_of_fields = [row.split() for row in list_of_strings]
num_rows = len(list_of_lists_of_fields)
for i in xrange(num_rows):
base_row = list_of_lists_of_fields
for j in xrange(i+1, num_rows):
compare_fields(i, j, base_row, list_of_lists_of_fields[j])

You can do better with enumerate, itertools and other things which 2.2
didn't have, but sets wouldn't help. Now, I hope this clarifies the
many unanswered questions which your 'specs' leave open, so you can work
out exactly what you want.

And, btw: upgrate to 2.4. Sets or no sets, the performance enhancement
by itself will be vastly sufficient to repay whatever inconvenience you
think the upgrade might cause.


Alex
 
H

Hari Pulapaka

Alex said:
Do you want to compare corresponding fields? That's the only way I can
read that 'column by column basis', and thus I don't see what sets could
possibly have to do with it.

Do you want to compare each row with every other row? I also note in
your example that the number of fields in each row appear to be
variable, so how do you want to deal with 'missing' fields?

I want to comapre every element in each row with the element in the
remaining rows having the same column position. The rows need not have
the same number of elements, in which case I have to do some more
thinking :)

I was thinking of making each row of the array as a set and then
comparing each row of the array with the compare function being the set
intersection operation.

You have pretty much captured what I was thinking, and my solution is
also similar to what you showed.

Too many unanswered questions, I guess. But for some specified set of
answers to those question, you might do...:

def compare_fields(i, j, base, other):
for k, f1, f2 in zip(xrange(sys.maxint), base, other):
if f1 != f2:
print 'DIFF', i, j, k, repr(f1), repr(f2)

def lots_of_compares(list_of_strings):
list_of_lists_of_fields = [row.split() for row in list_of_strings]
num_rows = len(list_of_lists_of_fields)
for i in xrange(num_rows):
base_row = list_of_lists_of_fields
for j in xrange(i+1, num_rows):
compare_fields(i, j, base_row, list_of_lists_of_fields[j])


Thanks for your help.

You can do better with enumerate, itertools and other things which 2.2
didn't have, but sets wouldn't help. Now, I hope this clarifies the
many unanswered questions which your 'specs' leave open, so you can work
out exactly what you want.

And, btw: upgrate to 2.4. Sets or no sets, the performance enhancement
by itself will be vastly sufficient to repay whatever inconvenience you
think the upgrade might

Not in my hands.

- Hari
 
A

Alex Martelli

Hari Pulapaka said:
I want to comapre every element in each row with the element in the
remaining rows having the same column position. The rows need not have
the same number of elements, in which case I have to do some more
thinking :)

I was thinking of making each row of the array as a set and then
comparing each row of the array with the compare function being the set
intersection operation.

Sets have no order, so that just woudln't work the way you state it.
Rows 'a b' and 'b a' would appear identical, so the "having the same
column position" condition would not be respected.

You could maybe use a set(enumerate(therow.split())) -- but intersecting
such sets would be of dubious utility. Maybe you mean symmetric
difference (union minus intersection), but even then you'd still have to
proceed in order to investigate which item of that difference comes from
which of the two rows (assuming you do care -- hard to tell from here).

I believe gadfly comes with a fast C-coded extension called kjbuckets
which might help with this kind of things (and a Python-coded
'fallback', not all that fast but easily portable, too). You might want
to investigate that, if there's a chance you could get C-coded
extensions installed on your Python 2.2 installation.


Alex
 
A

Alex Martelli

Mitja said:
If I understand the problem correctly, splitting the lines up and sorting
them before comparison _is_ much better than a naive approach, though I
don't know if that's what's best.

Splitting, sure. Sorting would destroy the 'column by column basis'.


Alex
 
M

Mitja

Splitting, sure. Sorting would destroy the 'column by column basis'.

I wasn't sure what OP really wanted; I saw both the "column by column"
thing and the bit about sets, which is contradicting, so I assumed he was
after sets-like behavior. [wrongly, as later posts clarified]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top