Python List – Efficiently Check if an Item is in a List

listpython

I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.

However, my input is pretty big (about 600 millions lines), and checking if an element belongs to a list is a O(n) operation according to the Python documentation.

My code is something like:

words_in_line = []
for word in line:
    if word in my_list:
        words_in_line.append(word)

As it takes too much time (days actually), I wanted to improve that part which is taking most of the time. I have a look at Python collections, and, more precisely, at deque. However, the only give a O(1) operation time access to the head and the tail of a list, not in the middle.

Do someone has an idea about how to do that in a better way?

Best Answer

You might consider a trie or a DAWG or a database. There are several Python implementations of the same.

Here is some relative timings for you to consider of a set vs a list:

import timeit
import random

with open('/usr/share/dict/words','r') as di:  # UNIX 250k unique word list 
    all_words_set={line.strip() for line in di}

all_words_list=list(all_words_set)    # slightly faster if this list is sorted...      

test_list=[random.choice(all_words_list) for i in range(10000)] 
test_set=set(test_list)

def set_f():
    count = 0
    for word in test_set:
        if word in all_words_set: 
           count+=1
    return count

def list_f():
    count = 0
    for word in test_list:
        if word in all_words_list: 
           count+=1
    return count

def mix_f():
    # use list for source, set for membership testing
    count = 0
    for word in test_list:
        if word in all_words_set: 
           count+=1
    return count    

print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs" 
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"

Prints:

list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs

ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.

For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.

Conclusion: Use better data structures for 600 million lines of text!

Best Answer

Related Solutions

Related Question