-
Notifications
You must be signed in to change notification settings - Fork 9
5. Extracting sequences and creating merged tables
shenjean edited this page Jan 21, 2021
·
16 revisions
- Extract accession numbers and sequences and combine into a tab-separated table with header:
cat complete_partial_mitogenomes.fa | awk -F "|" '{OFS="%"}{print $1,$2}' | sed "s/%$//" | sed "s/>.*%/>/" | sed "s/>.*$/&#/" | tr -d "\n" | tr ">" "\n" | tr "#" "\t" | grep -v ^$ >complete.partial.seqtable
- Format of
complete.partial.seqtable
:
LM993800 TATTCCGAACAAACTAGGCGGAGTACTGGCCCTTCTATTCTCTATTCTAGTCCTAATACTGGTACCAGTCCTC
- Sort
complete.partial.seqtable
by first column (accession number):
cat complete.partial.seqtable | sort -k 1 >complete.partial.seqtable.sorted
# Add header line
echo -e Accession'\t'Sequence >complete.seq.header
cat complete.seq.header complete.partial.seqtable.sorted >complete.partial.seq.tsv
- Merge accession number, gene description, taxonomic information and sequence into a tab-separated table
complete.partial.ref.tsv
:
paste -d "\t" complete.partial.gene.tsv complete.partial.taxtable complete.partial.seq.tsv | awk -F "\t" '{OFS="\t"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$12}' >complete.partial.ref.tsv
- Format of
complete.partial.ref.tsv
with sequence truncated for readability:
Accession Gene definition taxid Superkingdom Phylum Class Order Family Genus Species Sequence
AB000667 Paralichthys olivaceus mitochondrial Cyt-b gene for cytochrome b, partial cds 8255 Eukaryota Chordata Actinopteri Pleuronectiformes Parali
chthyidae Paralichthys Paralichthys olivaceus CCTCCACATCGGCCGAGGTCTATACTACGGCTCTTTTCTGTATAAAGAAACATGAAATGTTGGCGTCATCCTGCTGCTTCTCGTAATGATGACCGCCTTTGTTGGTTACGTCCTTCCCTGAGGACA
AATATCATTCTGGGGTGCCACTGTCATCACCAACCTACTCTCAGCCGTACCTTATGTCGGTAACACCCTAGTACAATGGATCTGAGGCGGATTTTCTGTAGATAATGCCACACTCACCCGGTTCTTTGCATTCCACTTCC
- Extract accession numbers and sequences and combine into a tab-separated table with header:
cat mitogenomes/*.fa | sed "s/>.*$/&#/" | tr -d "\n" | tr ">" "\n" | grep -v ^$ | awk -F "|" '{OFS="#"}{print $4,$7}' | sed "s/\.[0-9]#.*#/#/" | tr "#" "\t" >complete.full.seqtable
cat complete.seq.header complete.full.seqtable >complete.seq.tsv
- Merge accession number, gene description, taxonomic information and sequence into a tab-separated table
complete.ref.tsv
:
paste -d "\t" complete.full.gene.tsv complete.full.taxtable complete.seq.tsv | awk -F "\t" '{OFS="\t"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$12}' >complete.ref.tsv
- Format of
complete.ref.tsv
with sequence truncated to improve readability:
Accession Gene definition taxid Superkingdom Phylum Class Order Family Genus Species Sequence
NC_000860 Salvelinus fontinalis, complete mitogenome 8038 Eukaryota ChordataActinopteri Salmoniformes Salmonidae Salvelinus Salvelinus fontinalis GCTGGCGTAGCTTAATTAAAGCATAACACTGAAGCTGTTAAGATGGACCCTAAAAAGTCCCGCAGGCACAAAGGCTTGGTCCTGACTTTACTATCAGCTTTAACTGAACTTACACATGCAAGTCTCCGCACTCCTGTGAGGATGCCCTTAATCCCCTGCCCGGGGACGAGGAGCCGGCATCAGGCGCGCCCAGGCAGCCCAAGACGCCTTGCTAAGCCACACCCCCAAGGAAACTCAGCAGTGA
head -1 complete.partial.ref.tsv >ref.header
cat complete.partial.ref.tsv complete.ref.tsv | grep -v "^accession" >mitofish.ref
cat ref.header mitofish.ref >mitofish.ref.tsv
rm mitofish.ref