Concatenating sequence length to another file

Ibk · March 5, 2019, 1:38pm

I want to add the sequence length of File_1.fa and File _2.fa to form the form the fifth column in File_1_pos.txt and File_2_poa.txt respectively using awk and bash. Can anyone help me? Thanks

Get sequence length of each file

File_1.fa
File_2.fa

Add the sequence length to be the third column of File 3
File_1_pos.txt

File_1_pos       253     164
File_1_pos      738     827

File_2_pos.txt

File_2_pos      1494    1583
File_2_pos      1785    1874

Expected Output
File_1_pos.txt

1   File_1_pos       253     164    8126
2  File_1_pos      738     827    8126

File_2_poa.txt

1   File_2_pos      1494    1583    9655
2   File_2_pos      1785    1874    9655

I tried this but I dint get my expected output

for file in *.fa; do a=`awk '/^>/ {if (seqlen){print seqlen};next; } { seqlen += length($0)}END{print seqlen}' $file` | awk -F, '{$1=++i OFS"\t" $1;}1' ${file%.*}pos.txt | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4,a}'; done

RudiC · March 5, 2019, 2:20pm

Please post input file samples. How do you calculate the "sequence length"? Is there one or more of them per file?

Ibk · March 5, 2019, 2:42pm

These are examples of the input sequence. Just one long sequence per file. thanks
>File_1.fa

TTGAAAGGGGGCCCGGGGGATCTCCCCCGCGGTAACTGGTCACAGTTGCCGCGGACGGAGATCATCCCCC
GGTTACCCCCTTTCGACGCGGGTACTGCGATAGTGCCACCCCAGTCCTTCCTACTCCCGACTCCCGACCC
CAACCCAGGTTCCTTGGAACAGGAACACCAATTTATTCATCCCTTGGATGCTGACTAATCAGAGGAACGT
CAGCATTTTCCGGCCCAGGCTAAGAGAAGTAGATAAGTTAGAATCTAAATTATTTATCATCCCCTTGACG
AATTCGCGTTGGAAAAGCACCTCTCACTTGCCGCTCTTCACACCCATCATTCTAATTCGGCCCCTGTGTT

>File_2.fa

GAGCCCCTTGTTGAAGTGTTTCCCTCCATCGCGACGTGGTTGGAGATCTAAGTTAACCGACTCCGACGAA
ACTACCATCATGCCTCCCCGATTATGTGATGCTTTCTGCCCTGCTGGGTGGAGCATCCTCGGGTTGAGAA
ATCTTTCTTCCTTTTACCTTGGACTCCGGTCCCCCGGTCTAAGCCGCTTGGAATAAGACAGGGTTATCTT
CACTCCTCTTCTTTTCTACTTCACAGTGTTCTATGCTGTGAAAGGGTATGTGTCGCCCCTTCCTTCTTCG

RudiC · March 5, 2019, 4:59pm

Try

$ wc -cl *.fa | awk '
FILENAME == "-" {sub (".fa", "", $3)
                 T[$3] = $2 - $1
                 next
                }
FNR == 1        {IX = FILENAME
                 sub (/_[^_]*\..*$/, "", IX)
                }

                {print FNR, $0, T[IX] > (FILENAME ".new")
                }
' - OFS="\t" fil*pos.txt

$ cf *.new

---------- file_1_pos.txt.new: ----------

1    File_1_pos    253     164    350
2    File_1_pos    738     827    350

---------- file_2_pos.txt.new: ----------

1    File_2_pos    1494    1583    280
2    File_2_pos    1785    1874    280

Copy exactly as given; then mv the ".new" files over the old ".txt" files

EDIT: Given there are any number of .fa files, and each has a corresponding _pos.txt file, you could try

$ wc -cl *.fa |  
awk '
FILENAME == "-" {if ($3 == "total") next
                 sub (".fa", "", $3)
                 T[$3] = $2 - $1
                 ARGV[ARGC++] = $3 "_pos.txt"
                 next
                }
FNR == 1        {IX = FILENAME
                 sub (/_[^_]*\..*$/, "", IX)
                }
                {print FNR, $0, T[IX] > (FILENAME ".new")
                }
' - OFS="\t"

Ibk · March 7, 2019, 8:37am

Thank you Rudic, I have tried the code but did not give me the required output. The input files are separate files as well as the expected output.

rdrtx1 · March 7, 2019, 3:14pm

for file in *.fa;
do
  awk '
  NR==FNR {
     if ($0 ~ /^>/ && why_print_seqlen_for_this_line) {if (seqlen) print seqlen; next;}
     # is there a record separator /^>/ in the *.fa files not mentioned in samples?

     seqlen += length($0);
     next;
  }
  {print FNR, $0, seqlen}
  ' OFS="\t" $file ${file%.*}_pos.txt > t_$$

  mv -f t_$$ ${file%.*}_pos.txt
done

RudiC · March 7, 2019, 4:22pm

So - which files exist, and by what feature are the files connected / related?