I frequently find myself working with large lists where I need to apply the same time-consuming function to each element in the list without concern for the order that these calculations are made. I’ve written a small class using Python’s multiprocessing module to help speed things up.
It will accept a list, break it up into a list of lists the size of the number of processes you want to run in parallel, and then process each of the sublists as a separate process. Finally, it will return a list containing all the results.
def __init__(self, num_processes=4):
self.num_processes = num_processes
def split_list(self, data_list):
list_of_lists = 
for i in range(0, len(data_list), self.num_processes):
def map_reduce(self, function, data_list):
split_data = self.split_list(data_list)
processes = multiprocessing.Pool(processes=self.num_processes)
results_list_of_lists = processes.map(function, split_data)
results_list = [item for sublist in results_list_of_lists for item in sublist]
To demonstrate how this class works, I’ll create a list of 20 integers from 0-19. I’ve also created a function that will square every number in a list. When I run it, I’ll pass the function (job) and the list (data). The class will then break this into a list of lists and then run the function as a separate process on each of the sublists.
return [i*i for i in num_list]
data = range(20)
p = ProcessHelper(4)
result = p.map_reduce(job, data)
So if my data originally was a list that looked like this:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
When I split it into sublists, I’ll end up with a list of 4 lists (as I’ve indicated that I want to initialise 4 processes):
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]]
Finally, the result will give me the list of squared values that looks like this:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]
I’ll continue to build this class as I identify other handy helper methods that I could add.