[Python] Search value in list of dictionary benchmark
This is a test how fast we can search value in a very large list
of dict
by various way
Test code:
# test python search performance in a large list of dict
# ref:
# - https://stackoverflow.com/questions/8653516/python-list-of-dictionaries-search
# - https://stackoverflow.com/a/72721642/2533787
from random import randint
import timeit
from ducks import Dex
duct_index = None
def create_random_data(length=1000):
data = []
for i in range(1, length+1):
data.append({
"index": i,
"value": randint(1000, 999999999),
"value2": randint(1000, 999999999),
"value3": randint(1000, 999999999),
"value4": randint(1000, 999999999),
})
return data
def search_basic(search_value: int, search_list: list):
# normal iteration over all elements
for item in search_list:
if item["value"] == search_value:
return True
return False
def search_generator(search_value: int, search_list: list):
# use generator
for dict_ in (x for x in search_list if x["value"] == search_value):
return True
return False
def search_list(search_value: int, search_list: list):
# use list
for dict_ in [x for x in search_list if x["value"] == search_value]:
return True
return False
def search_filter(search_value: int, search_list: list):
# use filter
for dict_ in filter(lambda x: x["value"] == search_value, search_list):
return True
return False
def search_next(search_value: int, search_list: list):
return next((item for item in search_list if item["value"] == search_value), None) is not None
def search_duck(search_value: int, search_list: list):
# required install external library
global duct_index
if not duct_index:
# build index
duct_index = Dex(search_list, {"value": int})
return len(duct_index[{"value": search_value}]) > 0
if __name__ == "__main__":
data_length = 10000000
print("=" * 80)
print(f"- Data length: {data_length}")
t0 = timeit.default_timer()
search_data = create_random_data(data_length)
print("- Generate data done..., time diff: ", timeit.default_timer() - t0)
to_search = search_data[randint(0, data_length)]["value"]
print(f"- Search value: {to_search}")
t0 = timeit.default_timer()
r1 = search_basic(to_search, search_data)
t1 = timeit.default_timer()
print(f"- search_basic() done, result: {r1}, time diff: ", t1 - t0)
r2 = search_generator(to_search, search_data)
t2 = timeit.default_timer()
print(f"- search_generator() done, result: {r2}, time diff: ", t2 - t1)
r3 = search_list(to_search, search_data)
t3 = timeit.default_timer()
print(f"- search_list() done, result: {r3}, time diff: ", t3 - t2)
r4 = search_filter(to_search, search_data)
t4 = timeit.default_timer()
print(f"- search_filter() done, result: {r4}, time diff: ", t4 - t3)
r5 = search_next(to_search, search_data)
t5 = timeit.default_timer()
print(f"- search_next() done, result: {r5}, time diff: ", t5 - t4)
r6 = search_duck(to_search, search_data) # first time will required create index
t6 = timeit.default_timer()
print(f"- search_duck() 1st done, result: {r6}, time diff: ", t6 - t5)
r7 = search_duck(to_search, search_data) # very fast after index is done
t7 = timeit.default_timer()
print(f"- search_duck() 2nd done, result: {r7}, time diff: ", t7 - t6)
Test result (Windows 10, CPU AMD 5600G, Python 3.9.7):
Output 1:
- Data length: 10000000
- Generate data done..., time diff: 17.4976254
- Search value: 68645981
- search_basic() done, result: True, time diff: 0.25917069999999853
- search_generator() done, result: True, time diff: 0.25991790000000137
- search_list() done, result: True, time diff: 0.3885101999999989
- search_filter() done, result: True, time diff: 0.4272285999999994
- search_next() done, result: True, time diff: 0.2582653000000015
- search_duck() 1st done, result: True, time diff: 42.6337855
- search_duck() 2nd done, result: True, time diff: 0.0004143999999968173
Output 2:
- Data length: 10000000
- Generate data done..., time diff: 18.0281099
- Search value: 670718026
- search_basic() done, result: True, time diff: 0.1303572000000024
- search_generator() done, result: True, time diff: 0.12219189999999713
- search_list() done, result: True, time diff: 0.3952600000000004
- search_filter() done, result: True, time diff: 0.20178010000000057
- search_next() done, result: True, time diff: 0.1236163000000019
- search_duck() 1st done, result: True, time diff: 43.3433916
- search_duck() 2nd done, result: True, time diff: 0.0004522999999991839
Conclusion:
- Using
next()
or normal iteration in case we don't need repeat the search on same data set - Using external library like
ducks
will give beneficial if we need repeat the search many times on same data set