Concatenating sequence length to another file

I want to add the sequence length of File_1.fa and File _2.fa to form the form the fifth column in File_1_pos.txt and File_2_poa.txt respectively using awk and bash. Can anyone help me? Thanks

Get sequence length of each file

File_1.fa
File_2.fa 

Add the sequence length to be the third column of File 3
File_1_pos.txt

File_1_pos       253     164
File_1_pos      738     827

File_2_pos.txt

File_2_pos      1494    1583
File_2_pos      1785    1874

Expected Output
File_1_pos.txt

1   File_1_pos       253     164    8126
2  File_1_pos      738     827    8126

File_2_poa.txt

1   File_2_pos      1494    1583    9655
2   File_2_pos      1785    1874    9655

I tried this but I dint get my expected output

for file in *.fa; do a=`awk '/^>/ {if (seqlen){print seqlen};next; } { seqlen += length($0)}END{print seqlen}' $file` | awk -F, '{$1=++i OFS"\t" $1;}1' ${file%.*}pos.txt | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4,a}'; done

Please post input file samples. How do you calculate the "sequence length"? Is there one or more of them per file?

These are examples of the input sequence. Just one long sequence per file. thanks
>File_1.fa

TTGAAAGGGGGCCCGGGGGATCTCCCCCGCGGTAACTGGTCACAGTTGCCGCGGACGGAGATCATCCCCC
GGTTACCCCCTTTCGACGCGGGTACTGCGATAGTGCCACCCCAGTCCTTCCTACTCCCGACTCCCGACCC
CAACCCAGGTTCCTTGGAACAGGAACACCAATTTATTCATCCCTTGGATGCTGACTAATCAGAGGAACGT
CAGCATTTTCCGGCCCAGGCTAAGAGAAGTAGATAAGTTAGAATCTAAATTATTTATCATCCCCTTGACG
AATTCGCGTTGGAAAAGCACCTCTCACTTGCCGCTCTTCACACCCATCATTCTAATTCGGCCCCTGTGTT

>File_2.fa

GAGCCCCTTGTTGAAGTGTTTCCCTCCATCGCGACGTGGTTGGAGATCTAAGTTAACCGACTCCGACGAA
ACTACCATCATGCCTCCCCGATTATGTGATGCTTTCTGCCCTGCTGGGTGGAGCATCCTCGGGTTGAGAA
ATCTTTCTTCCTTTTACCTTGGACTCCGGTCCCCCGGTCTAAGCCGCTTGGAATAAGACAGGGTTATCTT
CACTCCTCTTCTTTTCTACTTCACAGTGTTCTATGCTGTGAAAGGGTATGTGTCGCCCCTTCCTTCTTCG

Try

$ wc -cl *.fa | awk '
FILENAME == "-" {sub (".fa", "", $3)
                 T[$3] = $2 - $1
                 next
                }
FNR == 1        {IX = FILENAME
                 sub (/_[^_]*\..*$/, "", IX)
                }

                {print FNR, $0, T[IX] > (FILENAME ".new")
                }
' - OFS="\t" fil*pos.txt

$ cf *.new

---------- file_1_pos.txt.new: ----------

1    File_1_pos    253     164    350
2    File_1_pos    738     827    350

---------- file_2_pos.txt.new: ----------

1    File_2_pos    1494    1583    280
2    File_2_pos    1785    1874    280
 

Copy exactly as given; then mv the ".new" files over the old ".txt" files

EDIT: Given there are any number of .fa files, and each has a corresponding _pos.txt file, you could try

$ wc -cl *.fa |  
awk '
FILENAME == "-" {if ($3 == "total") next
                 sub (".fa", "", $3)
                 T[$3] = $2 - $1
                 ARGV[ARGC++] = $3 "_pos.txt"
                 next
                }
FNR == 1        {IX = FILENAME
                 sub (/_[^_]*\..*$/, "", IX)
                }
                {print FNR, $0, T[IX] > (FILENAME ".new")
                }
' - OFS="\t" 

Thank you Rudic, I have tried the code but did not give me the required output. The input files are separate files as well as the expected output.

for file in *.fa;
do
  awk '
  NR==FNR {
     if ($0 ~ /^>/ && why_print_seqlen_for_this_line) {if (seqlen) print seqlen; next;}
     # is there a record separator /^>/ in the *.fa files not mentioned in samples?

     seqlen += length($0);
     next;
  }
  {print FNR, $0, seqlen}
  ' OFS="\t" $file ${file%.*}_pos.txt > t_$$

  mv -f t_$$ ${file%.*}_pos.txt
done
1 Like

So - which files exist, and by what feature are the files connected / related?