prefix search on a large file

J

js

Hello, list.

I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.

Let's say the list has a sentence below

1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "

2. "Python has been an important part of Google"

3. "important part of Google"

As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.
(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)

So I decided to clean up the list.

I tried to do this simple brute-force manner, like

---------------------------------------------------------------------------
sorted_list = sorted(file('thelist'), key=len)
for line in sorted_list[:]
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
sorted_list = list(set(sorted_list) - (unneeded))
.....
---------------------------------------------------------------------------

This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.

I'm not familiar with algorithms/data structure and large-scale data processing,
so any advice, suggestions and recommendations will be appreciated.

Thank you in advance.
 
J

John Machin

js said:
Hello, list.

I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.

Let's say the list has a sentence below

1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "

2. "Python has been an important part of Google"

3. "important part of Google"

As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.

Are you 100% sure that you wnat to throw away the *longer* sentence?
(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)

So I decided to clean up the list.

I tried to do this simple brute-force manner, like

Please don't waste time and bandwith with "like"; post the exact code
that you executed.
# won't compile, missing a colon at end
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
sorted_list = list(set(sorted_list) - (unneeded))
# can't do set - list
# Presuming you mean - set(unneeded), but then produces an empty list
(see later).
# Naming the output "sorted_list" is misleading advertising.
# With sorted_list[:] in an inner loop, it's no wonder that it runs
slowly.
# Test your code on small samples of data and ensure that it is working
correctly before you run it on huge amounts of data.
# Be careful of end cases; note that in my solutions below, the first
or last item needs special handling.
....
---------------------------------------------------------------------------

This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.

Here's one way of doing it, tested to the extent shown:

C:\junk>type prefixdel.py
from pprint import pprint as pp
data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz',
]

"""
sorted_list = sorted(file('thelist'), key=len)
for line in sorted_list[:]
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
sorted_list = list(set(sorted_list) - set(unneeded))
"""
slist = sorted(data, key=len)
print "OP trial 1 input"; pp(slist)
for line in slist[:]:
unneeded = [line2 for line2 in slist[:] if line2.startswith(line)]
slist = list(set(slist) - set(unneeded))
print "OP trial 1 output"; pp(slist)

ilist = sorted(data)
print "sorted input"; pp(ilist)

olist = []
for i in xrange(len(ilist)-1, 0, -1):
if ilist.startswith(ilist[i-1]):
continue
olist.append(ilist)
olist.append(ilist[0])
print "output (keeping shorter)"; pp(olist)

olist = []
for i in xrange(len(ilist)-1):
if ilist[i+1].startswith(ilist):
continue
olist.append(ilist)
olist.append(ilist[-1])
print "output (keeping longer)"; pp(olist)


C:\junk>prefixdel.py
OP trial 1 input
['foo',
'food',
'food',
'zzzz',
'xyzzy',
'plugh',
'foo bar',
'foo bar baz',
'xyzzy and plugh are magic']
OP trial 1 output
[]
sorted input
['foo',
'foo bar',
'foo bar baz',
'food',
'food',
'plugh',
'xyzzy',
'xyzzy and plugh are magic',
'zzzz']
output (keeping shorter)
['zzzz', 'xyzzy', 'plugh', 'food', 'foo']
output (keeping longer)
['foo bar baz', 'food', 'plugh', 'xyzzy and plugh are magic', 'zzzz']

HTH,
John
 
J

js

Thank you for the quick reply.

Here're the exact code I executed. (including your code)

#!/usr/bin/env python
from pprint import pprint as pp

data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz',
]

data_sorted_by_len = sorted(data, key=len)
data_sorted_by_asc = sorted(data)
print "OP trial 1 input"; pp(data)

def prefixdel_stupidly(alist):
for line in alist[:]:
unneeded = [no for no, line2 in enumerate(alist[1:]) if
line2.startswith(line) and line != line2]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1
return alist

def prefixdel_recursively(alist):
if len(alist) < 2:
return alist

unneeded = [no for no, line in enumerate(alist[1:]) if
line.startswith(alist[0])]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [alist[0]] + prefixdel_recursively(alist[1:])

def prefixdel_by_john(alist):
olist = []
for i in xrange(len(alist)-1, 0, -1):
if alist.startswith(alist[i-1]):
continue
olist.append(alist)
olist.append(alist[0])
return olist

if __name__ == '__main__':
from timeit import Timer
print sorted(prefixdel_stupidly(data_sorted_by_len[:]))
print sorted(prefixdel_recursively(data_sorted_by_len[:]))
print sorted(prefixdel_by_john(data_sorted_by_asc[:]))

t = Timer("prefixdel_stupidly(data_sorted_by_len)", "from __main__
import prefixdel_stupidly, data_sorted_by_len")
print t.timeit(number=100000)
t = Timer("prefixdel_recursively(data_sorted_by_len)", "from
__main__ import prefixdel_recursively, data_sorted_by_len")
print t.timeit(number=100000)
t = Timer("prefixdel_by_john(data_sorted_by_asc)", "from __main__
import prefixdel_by_john, data_sorted_by_asc")
print t.timeit(number=100000)

The output is the following:

$ python2.5 myprefixdel.py
OP trial 1 input
['foo bar baz',
'foo bar',
'foo',
'food',
'food',
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz']
['foo', 'plugh', 'xyzzy', 'zzzz']
['foo', 'plugh', 'xyzzy', 'zzzz']
['foo', 'food', 'plugh', 'xyzzy', 'zzzz']
1.17837095261
1.21182584763
0.737352132797

Your code is much faster than ones I wrote
but the result is a little bit different.('food' shoud be removed
because 'foo' is in the list)

As you pointed out, list[:] must be a big evil but without duplicating the list,
the problem become much harder to solve to me.

Any practical solution, anyone?
 
J

js

By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recursively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist) if line.startswith(first)]
adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recursively(alist)


process stime
prefixdel_stupidly : 11.9247150421
prefixdel_recursively : 14.6975700855
prefixdel_recursively2 : 0.408113956451
prefixdel_by_john : 7.60227012634
 
P

Peter Otten

js said:
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recursively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist) if
line.startswith(first)] adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recursively(alist)


process stime
prefixdel_stupidly : 11.9247150421
prefixdel_recursively : 14.6975700855
prefixdel_recursively2 : 0.408113956451
prefixdel_by_john : 7.60227012634

Those are suspicious results. Time it again with number=1, or a fresh copy
of the data for every iteration.

I also have my doubts whether sorting by length is a good idea. To take it
to the extreme: what if your data file contains an empty line?

Peter
 
S

Scott David Daniels

Try:

def leaders(sorted_list):
held = None
for phrase in held:
if held is None or not phrase.startswith(held):
held = row
yield held

print list(leaders(sorted(data)))

--Scott David Daniels
(e-mail address removed)
 
T

Tim Chase

def leaders(sorted_list):
held = None
for phrase in held:

Ummm...that

for phrase in None:

doesn't do a whole lot other than give a traceback. :*)

I suspect you mean

for phrase in sorted_list:

no?

-tkc
 
S

Scott David Daniels

Tim said:
... I suspect you mean
for phrase in sorted_list:
no?

Ooops -- renaming code before posting should get its own test.
You are absolutely correct.


--Scott David Daniels
(e-mail address removed)
 
J

js

I did the test the way you suggested. It took not so long to realize
stupid mistakes I made. Thank you.

The following is the result of test with timeit(number=10000)
using fresh copy of the list for every iteration

0.331462860107
0.19401717186
0.186257839203
0.0762069225311

I tried my recursive-function to fix up my big-messed-list.
It stops immediately because of 'RuntimeError: maximum recursion limit exceeded'

I hope this trial-and-errors getting me good place...

anyway, thank you.

js said:
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recursively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist) if
line.startswith(first)] adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recursively(alist)


process stime
prefixdel_stupidly : 11.9247150421
prefixdel_recursively : 14.6975700855
prefixdel_recursively2 : 0.408113956451
prefixdel_by_john : 7.60227012634

Those are suspicious results. Time it again with number=1, or a fresh copy
of the data for every iteration.

I also have my doubts whether sorting by length is a good idea. To take it
to the extreme: what if your data file contains an empty line?

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,602
Members
45,184
Latest member
ZNOChrista

Latest Threads

Top