Data scientist at Port Jackson Partners in Sydney, Australia. My PhD was in computational biology. In my spare time I write about medical research at BioSky.co.CVAbout
Before today, the only real use I’d had for regular expressions in Python was to just find the first instance of a pattern. For example, if I want to find the contents of the text between the first set of single quotation marks (in this case ‘26245730’), I would proceed like so:
import re all_id="'26245730': 817, '389595538': 735, '541129065': 529, '541129071': 340, '558870185': 305, '444325280': 287, '573974252': 272, '281314044': 222" first_id = re.search("'(.*?)'",all_id) print first_id.group(1)
The arguments passed to re.search define the pattern I am looking for: The single quotation marks on either side of the brackets show that I am looking for a pattern between them. The “.” within the brackets tells Python that I am happy with finding any character, number, etc and the “*” next to these mean it will look for 0 or more instances of this text. Finally, the “?” ensures that the expression isn’t greedy. What does it mean to be greedy with a regular expression? It means that instead of finding the pattern between the first two single quotation marks, it will find the pattern between the first and the last quotation marks! So I’ll end up with practically all of my string being returned!
Now, what if I wanted to find every single number between single quotation marks and output it as a list?
import re all_id="'26245730': 817, '389595538': 735, '541129065': 529, '541129071': 340, '558870185': 305, '444325280': 287, '573974252': 272, '281314044': 222" unique_id = re.findall("'(.*?)'",all_id) print unique_id
You’ll notice how similar it is to the code example I showed before, with the exception of “findall” being used, rather than “search”.