Ibk
March 5, 2019, 1:38pm
1
I want to add the sequence length of File_1.fa and File _2.fa to form the form the fifth column in File_1_pos.txt and File_2_poa.txt respectively using awk and bash. Can anyone help me? Thanks
Get sequence length of each file
File_1.fa
File_2.fa
Add the sequence length to be the third column of File 3
File_1_pos.txt
File_1_pos 253 164
File_1_pos 738 827
File_2_pos.txt
File_2_pos 1494 1583
File_2_pos 1785 1874
Expected Output
File_1_pos.txt
1 File_1_pos 253 164 8126
2 File_1_pos 738 827 8126
File_2_poa.txt
1 File_2_pos 1494 1583 9655
2 File_2_pos 1785 1874 9655
I tried this but I dint get my expected output
for file in *.fa; do a=`awk '/^>/ {if (seqlen){print seqlen};next; } { seqlen += length($0)}END{print seqlen}' $file` | awk -F, '{$1=++i OFS"\t" $1;}1' ${file%.*}pos.txt | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4,a}'; done
RudiC
March 5, 2019, 2:20pm
2
Please post input file samples. How do you calculate the "sequence length"? Is there one or more of them per file?
Ibk
March 5, 2019, 2:42pm
3
These are examples of the input sequence. Just one long sequence per file. thanks
>File_1.fa
TTGAAAGGGGGCCCGGGGGATCTCCCCCGCGGTAACTGGTCACAGTTGCCGCGGACGGAGATCATCCCCC
GGTTACCCCCTTTCGACGCGGGTACTGCGATAGTGCCACCCCAGTCCTTCCTACTCCCGACTCCCGACCC
CAACCCAGGTTCCTTGGAACAGGAACACCAATTTATTCATCCCTTGGATGCTGACTAATCAGAGGAACGT
CAGCATTTTCCGGCCCAGGCTAAGAGAAGTAGATAAGTTAGAATCTAAATTATTTATCATCCCCTTGACG
AATTCGCGTTGGAAAAGCACCTCTCACTTGCCGCTCTTCACACCCATCATTCTAATTCGGCCCCTGTGTT
>File_2.fa
GAGCCCCTTGTTGAAGTGTTTCCCTCCATCGCGACGTGGTTGGAGATCTAAGTTAACCGACTCCGACGAA
ACTACCATCATGCCTCCCCGATTATGTGATGCTTTCTGCCCTGCTGGGTGGAGCATCCTCGGGTTGAGAA
ATCTTTCTTCCTTTTACCTTGGACTCCGGTCCCCCGGTCTAAGCCGCTTGGAATAAGACAGGGTTATCTT
CACTCCTCTTCTTTTCTACTTCACAGTGTTCTATGCTGTGAAAGGGTATGTGTCGCCCCTTCCTTCTTCG
RudiC
March 5, 2019, 4:59pm
4
Try
$ wc -cl *.fa | awk '
FILENAME == "-" {sub (".fa", "", $3)
T[$3] = $2 - $1
next
}
FNR == 1 {IX = FILENAME
sub (/_[^_]*\..*$/, "", IX)
}
{print FNR, $0, T[IX] > (FILENAME ".new")
}
' - OFS="\t" fil*pos.txt
$ cf *.new
---------- file_1_pos.txt.new: ----------
1 File_1_pos 253 164 350
2 File_1_pos 738 827 350
---------- file_2_pos.txt.new: ----------
1 File_2_pos 1494 1583 280
2 File_2_pos 1785 1874 280
Copy exactly as given; then mv
the ".new" files over the old ".txt" files
EDIT: Given there are any number of .fa
files, and each has a corresponding _pos.txt
file, you could try
$ wc -cl *.fa |
awk '
FILENAME == "-" {if ($3 == "total") next
sub (".fa", "", $3)
T[$3] = $2 - $1
ARGV[ARGC++] = $3 "_pos.txt"
next
}
FNR == 1 {IX = FILENAME
sub (/_[^_]*\..*$/, "", IX)
}
{print FNR, $0, T[IX] > (FILENAME ".new")
}
' - OFS="\t"
Ibk
March 7, 2019, 8:37am
5
Thank you Rudic, I have tried the code but did not give me the required output. The input files are separate files as well as the expected output.
rdrtx1
March 7, 2019, 3:14pm
6
for file in *.fa;
do
awk '
NR==FNR {
if ($0 ~ /^>/ && why_print_seqlen_for_this_line) {if (seqlen) print seqlen; next;}
# is there a record separator /^>/ in the *.fa files not mentioned in samples?
seqlen += length($0);
next;
}
{print FNR, $0, seqlen}
' OFS="\t" $file ${file%.*}_pos.txt > t_$$
mv -f t_$$ ${file%.*}_pos.txt
done
1 Like
RudiC
March 7, 2019, 4:22pm
7
So - which files exist, and by what feature are the files connected / related?