NameChange

Genomics, Evolution and Medicine

NameChange.py

Traditionally i have used this program to change the names of species codes in a nexus formatted tree file. This python code is flexible enough to be used to change sequences/IDs/Names in many of the standard genomics/phylogenetics file formats.

 

You will need a key_file, which is a tab delimited file where the first column contains the names/sequences that are in your infile and the second column contains the names/sequences you would like to subsitute for your outfile.

 

Example key_file

 

Canis_familiaris Dog

Rattus_norvergicus Rat

Gorilla_gorilla Gorilla

Homo_sapiens Human

Mus_musculus Mouse

To use this program copy and paste the code below into a text editor and save the file as NameChange.py.

 

To this program on a file of interest type

> python NameChange.py key_file infile outfile

print 'To use: python NameChange.py key_file infile outfile '

 

import re,sys

 

oldname = re.compile(r'\b\w+\b')

#compiles regular expression: will match letters and numbers surrounded by non-word characters

 

 

d = {}

 

def repl_func(mo):

ms = mo.group(0)

#links ms to the matched string

return d.get(ms.lower(), ms)

#this returns the match string in "lower" case. Everything is lower to help with ID.

 

 

with open(sys.argv[1], "r") as keyFile:y

for line in keyFile:

key, value = line.strip().split(None,1)

#trailing white space removed

d[key.lower()] = value

#key and value are added to dictonary

with open(sys.argv[3], "w") as resultFile:

with open(sys.argv[2], "r") as treeFile:

for line in treeFile:

NewSpecies = re.sub(oldname, repl_func, line)

#substitute the old name with rew name

resultFile.write(NewSpecies)

Copyright © All Rights Reserved