ExtractSequences

Genomics, Evolution and Medicine

ExtractSequences.py

 

This program is useful if you would like to select a subset of sequences from a fasta formatted file. I wrote it when working on the "Mitochondrial data are not suitable for resolving placental mammal phylogeny" paper to extract fasta sequences for different clades within the mammal phylogeny.

 

To run this program copy the text below and save it in a text editor as ExtractSequences.py. Then type

 

>python ExtractSequences.py SubList FastaFile Outfile

 

#SubList is the list of species/sequence names that you would like to select. The species/sequence names should have no spaces and each species/sequence should be on a new line.

print "python ExtractSequences.py SubList FastaFile Outfile"

 

import re, os,sys,glob

from collections import defaultdict

 

 

ListofNames=open(sys.argv[1],'r')

Outfile=open(sys.argv[3],'w')

 

 

list1=[]

for name in ListofNames:

clean=name.strip()

list1.append(clean)

#this reads species names and creates a lits

 

savedSpecies = defaultdict( list )

title=""

for each_line in open(sys.argv[2],'r'):

if len( each_line ) == 1:

continue

if each_line.startswith(">"):

title=each_line[1:].rstrip("\n")

else:

savedSpecies[ title ].append( each_line.strip() )

 

 

for stuff in list1:

if stuff in savedSpecies:

foundRecords = savedSpecies[stuff]

Outfile.write(">" + stuff + "\n" )

for line in foundRecords:

Outfile.write( line + "\n" )

 

 

ListofNames.close()

Outfile.close()

 

Copyright © All Rights Reserved