Find and modify a huge file

GDC · April 13, 2017, 9:20am

Dear Forum,

I have a rather large file with a few million lines looking like this:

head -n 5 seq.txt
>KF1.8.1
010011001011100010101110000000
>DF1.6.1
0101000010111010101011111100
>XC1.3.7
010110101011101010110000011
>GG5.1.1
0100011010111010101110001101
>HK1.2.2
010000111011101101001110001010
0101011

In this file the lines can be split into different records with a name (starting with >) and the encoded information/sequence (001010...) associated with the header. Now, I need to add some code to the header according to the following file:

head -n 5 code.txt
>KF1.8.1;code=D0:B;D1:P;D2:E;D3:C;D4:H;D5:S_(1);
>DF1.6.1;code=D0:B;D1:D;D2:F;D3:C;D4:F;D5:S_(1);
>XC1.3.7;code=D0:A;D1:D;D2:E;D3:C;D4:H;D5:H;
>GG5.1.1;code=D0:A;D1:D;D2:E;D3:C;D4:F;D5:H;
>HK1.2.2;code=D0:A;D1:F;D2:F;D3:C;D4:H;D5:K_[23];

The results should look like this:

head -n 11 res.txt
>KF1.8.1;code=D0:B;D1:P;D2:E;D3:C;D4:H;D5:S_(1);
010011001011100010101110000000
>DF1.6.1;code=D0:B;D1:D;D2:F;D3:C;D4:F;D5:S_(1);
0101000010111010101011111100
>XC1.3.7;code=D0:A;D1:D;D2:E;D3:C;D4:H;D5:H;
0100011010111010101110001101
>GG5.1.1;code=D0:A;D1:D;D2:E;D3:C;D4:F;D5:H;
0100011010111010101110001101
>HK1.2.2;code=D0:A;D1:F;D2:F;D3:C;D4:H;D5:K_[23];
010000111011101101001110001010
0101011

The two files (seq.txt, code.txt) are not sorted but the number of records are identical.

I could use sed to change one record header at a time

sed 's/>KF1.8.1/>KF1.8.1;code=D0:B;D1:P;D2:E;D3:C;D4:H;D5:S_(1);/g' seq.txt

or maybe write it into a file an execute it

while read code
do
  record=`echo $code | cut -d';' -f 1`
  echo "sed 's/$record/$code/g' seq.txt" >> all.txt
done < code.txt

chmod a+x all.txt
./all.txt

but this might take some time. Does anybody have a faster and maybe more elegant way for me to modify the record headers?

Thanks for all your help!

Corona688 · April 13, 2017, 11:34am

Yes, editing a huge file once as opposed to editing a huge file n times for n lines would certainly be preferable!

This should work efficiently for anywhere up to millions of sequences listed in code.txt :

$ awk -F';' 'NR==FNR { A[$1]=$0 ; next } ; /^>/ && ($1 in A) { $1=A[$1] } 1' code.txt seq.txt

>KF1.8.1;code=D0:B;D1:P;D2:E;D3:C;D4:H;D5:S_(1);
010011001011100010101110000000
>DF1.6.1;code=D0:B;D1:D;D2:F;D3:C;D4:F;D5:S_(1);
0101000010111010101011111100
>XC1.3.7;code=D0:A;D1:D;D2:E;D3:C;D4:H;D5:H;
010110101011101010110000011
>GG5.1.1;code=D0:A;D1:D;D2:E;D3:C;D4:F;D5:H;
0100011010111010101110001101
>HK1.2.2;code=D0:A;D1:F;D2:F;D3:C;D4:H;D5:K_[23];
010000111011101101001110001010
0101011

$

It works because awk has associative arrays, you can do ARRAY["something"]="ABCD". And NR==FNR means 'do this only for the first file listed'. So it reads the entire list into an associative array, then reads through the huge file hunting for relevant lines, substituting where appropriate, then printing everything.

rovf · April 14, 2017, 1:56am

Just a side note: Many shells (for instance bash and zsh) have associative arrays too. Problem is that the OP did not specify whether he wants to restrict his solution to a particular shell, as the code snippet he wrote would be compliant to several shells.

GDC · April 14, 2017, 2:54am

corona688:

Yes, editing a huge file once as opposed to editing a huge file n times for n lines would certainly be preferable!

This should work efficiently for anywhere up to millions of sequences listed in code.txt :
$ awk -F';' 'NR==FNR { A[$1]=$0 ; next } ; /^>/ && ($1 in A) { $1=A[$1] } 1' code.txt seq.txt

>KF1.8.1;code=D0:B;D1:P;D2:E;D3:C;D4:H;D5:S_(1);
010011001011100010101110000000
>DF1.6.1;code=D0:B;D1:D;D2:F;D3:C;D4:F;D5:S_(1);
0101000010111010101011111100
>XC1.3.7;code=D0:A;D1:D;D2:E;D3:C;D4:H;D5:H;
010110101011101010110000011
>GG5.1.1;code=D0:A;D1:D;D2:E;D3:C;D4:F;D5:H;
0100011010111010101110001101
>HK1.2.2;code=D0:A;D1:F;D2:F;D3:C;D4:H;D5:K_[23];
010000111011101101001110001010
0101011

$
It works because awk has associative arrays, you can do ARRAY["something"]="ABCD". And NR==FNR means 'do this only for the first file listed'. So it reads the entire list into an associative array, then reads through the huge file hunting for relevant lines, substituting where appropriate, then printing everything.

---------- Post updated at 08:54 AM ---------- Previous update was at 08:35 AM ----------

Dear Corona,

Thanks for the help and the explanation. I'm am not sure I understand the solution completely.

A[$1]=$0 means I read everything from the first file provided - because I use -F ";" the line in the first file is split up
/^>/ && ($1 in A) is this the if statement - if the line starts with a ">" sign and $1 is somewhere in the arry - why $1 ? Is it not file two or is it everything after ";" meant for file two?

Would be great if you would find the time to explain me the awk array a bit more. I really appreciate your help.

RavinderSingh13 · April 14, 2017, 3:48am

Hello GDC,

Could you please go through following and let me know if this helps you.

awk -F';'             ##### Making field separator as ";"
'NR==FNR              ##### Checking NR==FNR condition here, this condition will be TRUE when first file code.txt is getting read.
{ A[$1]=$0 ;          ##### Making an aray named A with index $1 and keeping it's value as current line.
next } ;              ##### putting next keyword from built-in awk's keyword it will skip all next statements then.
/^>/ && ($1 in A)     ##### Checking here 2 conditions, 1st condition if any line starts from ">" and first field or that line is present in array A. If both conditions are TRUE then perform the following statements.
{ $1=A[$1] }          ##### Making first field as array A's value whose index is $1.
1                     ##### Mentioning 1 here, so awk works on condition then action part, when condition is TRUE then action will happen. So here by mentioning 1 we are making condition as TRUE and no action mentioned so default action will happen which is printing of the current line.
' code.txt seq.txt    ##### Mentioning the Input_files here too.

Thanks,
R. Singh

GDC · April 14, 2017, 7:46am

Dear R. Singh

yes it does. Thanks for the help!