## [Python] Search value in list of dictionary benchmark

This is a test how fast we can search value in a very large `list` of `dict` by various way

Test code:

``````# test python search performance in a large list of dict
# ref:
#   - https://stackoverflow.com/questions/8653516/python-list-of-dictionaries-search
#   - https://stackoverflow.com/a/72721642/2533787

from random import randint
import timeit
from ducks import Dex

duct_index = None

def create_random_data(length=1000):
data = []
for i in range(1, length+1):
data.append({
"index": i,
"value": randint(1000, 999999999),
"value2": randint(1000, 999999999),
"value3": randint(1000, 999999999),
"value4": randint(1000, 999999999),
})
return data

def search_basic(search_value: int, search_list: list):
# normal iteration over all elements
for item in search_list:
if item["value"] == search_value:
return True
return False

def search_generator(search_value: int, search_list: list):
# use generator
for dict_ in (x for x in search_list if x["value"] == search_value):
return True
return False

def search_list(search_value: int, search_list: list):
# use list
for dict_ in [x for x in search_list if x["value"] == search_value]:
return True
return False

def search_filter(search_value: int, search_list: list):
# use filter
for dict_ in filter(lambda x: x["value"] == search_value, search_list):
return True
return False

def search_next(search_value: int, search_list: list):
return next((item for item in search_list if item["value"] == search_value), None) is not None

def search_duck(search_value: int, search_list: list):
# required install external library
global duct_index
if not duct_index:
# build index
duct_index = Dex(search_list, {"value": int})
return len(duct_index[{"value": search_value}]) > 0

if __name__ == "__main__":
data_length = 10000000
print("=" * 80)
print(f"- Data length: {data_length}")
t0 = timeit.default_timer()
search_data = create_random_data(data_length)
print("- Generate data done..., time diff: ", timeit.default_timer() - t0)
to_search = search_data[randint(0, data_length)]["value"]
print(f"- Search value: {to_search}")

t0 = timeit.default_timer()
r1 = search_basic(to_search, search_data)
t1 = timeit.default_timer()
print(f"- search_basic() done, result: {r1}, time diff: ", t1 - t0)

r2 = search_generator(to_search, search_data)
t2 = timeit.default_timer()
print(f"- search_generator() done, result: {r2}, time diff: ", t2 - t1)

r3 = search_list(to_search, search_data)
t3 = timeit.default_timer()
print(f"- search_list() done, result: {r3}, time diff: ", t3 - t2)

r4 = search_filter(to_search, search_data)
t4 = timeit.default_timer()
print(f"- search_filter() done, result: {r4}, time diff: ", t4 - t3)

r5 = search_next(to_search, search_data)
t5 = timeit.default_timer()
print(f"- search_next() done, result: {r5}, time diff: ", t5 - t4)

r6 = search_duck(to_search, search_data)    # first time will required create index
t6 = timeit.default_timer()
print(f"- search_duck() 1st done, result: {r6}, time diff: ", t6 - t5)

r7 = search_duck(to_search, search_data)    # very fast after index is done
t7 = timeit.default_timer()
print(f"- search_duck() 2nd done, result: {r7}, time diff: ", t7 - t6)
``````

Test result (Windows 10, CPU AMD 5600G, Python 3.9.7):

Output 1:

``````- Data length: 10000000
- Generate data done..., time diff:  17.4976254
- Search value: 68645981
- search_basic() done, result: True, time diff:  0.25917069999999853
- search_generator() done, result: True, time diff:  0.25991790000000137
- search_list() done, result: True, time diff:  0.3885101999999989
- search_filter() done, result: True, time diff:  0.4272285999999994
- search_next() done, result: True, time diff:  0.2582653000000015
- search_duck() 1st done, result: True, time diff:  42.6337855
- search_duck() 2nd done, result: True, time diff:  0.0004143999999968173
``````

Output 2:

``````- Data length: 10000000
- Generate data done..., time diff:  18.0281099
- Search value: 670718026
- search_basic() done, result: True, time diff:  0.1303572000000024
- search_generator() done, result: True, time diff:  0.12219189999999713
- search_list() done, result: True, time diff:  0.3952600000000004
- search_filter() done, result: True, time diff:  0.20178010000000057
- search_next() done, result: True, time diff:  0.1236163000000019
- search_duck() 1st done, result: True, time diff:  43.3433916
- search_duck() 2nd done, result: True, time diff:  0.0004522999999991839
``````

Conclusion:

• Using `next()` or normal iteration in case we don't need repeat the search on same data set
• Using external library like `ducks` will give beneficial if we need repeat the search many times on same data set
Tags:
#python