Computational biology PhD researcher. Interested in science, software development, and machine learning. I write about medical research at BioSky.co and contribute content to a variety of additional publications.CVAbout
I frequently find myself working with large lists where I need to apply the same time-consuming function to each element in the list without concern for the order that these calculations are made. I’ve written a small class using Python’s multiprocessing module to help speed things up.
It will accept a list, break it up into a list of lists the size of the number of processes you want to run in parallel, and then process each of the sublists as a separate process. Finally, it will return a list containing all the results.
import multiprocessing class ProcessHelper: def __init__(self, num_processes=4): self.num_processes = num_processes def split_list(self, data_list): list_of_lists =  for i in range(0, len(data_list), self.num_processes): list_of_lists.append(data_list[i:i+self.num_processes]) return list_of_lists def map_reduce(self, function, data_list): split_data = self.split_list(data_list) processes = multiprocessing.Pool(processes=self.num_processes) results_list_of_lists = processes.map(function, split_data) processes.close() results_list = [item for sublist in results_list_of_lists for item in sublist] return results_list
To demonstrate how this class works, I’ll create a list of 20 integers from 0-19. I’ve also created a function that will square every number in a list. When I run it, I’ll pass the function (job) and the list (data). The class will then break this into a list of lists and then run the function as a separate process on each of the sublists.
def job(num_list): return [i*i for i in num_list] data = range(20) p = ProcessHelper(4) result = p.map_reduce(job, data) print(result)
So if my data originally was a list that looked like this:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
When I split it into sublists, I’ll end up with a list of 4 lists (as I’ve indicated that I want to initialise 4 processes):
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]]
Finally, the result will give me the list of squared values that looks like this:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]
I’ll continue to build this class as I identify other handy helper methods that I could add.