Search for string in column using variable: awk

genome · February 3, 2018, 5:47pm

I'm interested to match column pattern through awk using an external variable for data:

-9	1:751343:T:A	-9	0	T	A	0.726	-5.408837e-03	9.576603e-03	7.967536e-01	5.722312e-01
-9	1:751756:T:C	-9	0	T	C	0.727	-5.360458e-03	9.579447e-03	7.966977e-01	5.757858e-01
-9	1:752566:G:A	-9	0	G	A	0.331	6.583382e-03	8.958503e-03	7.950995e-01	4.624419e-01
-9	1:753425:T:C	-9	0	T	C	0.295	7.321481e-03	
-9	3:60197:G:A	-9	0	G	A	0.918	1.480658e-03	1.554497e-02	7.950968e-01	9.241192e-01
-9	3:60202:C:G	-9	0	C	G	0.989	-2.318091e-02	2.707507e-02	7.947803e-01	5.699114e-01
-9	22:51228888:T:G	-9	0	T	G	0.737	-7.274594e-03	1.073497e-02	7.928675e-01	4.980153e-01
-9	22:51228910:G:A	-9	0	G	A	0.791	-6.978814e-03	1.147448e-02	7.936905e-01	5.430739e-01
-9	22:51229455:G:C	-9	0	G	C	0.965	4.726587e-03	2.609153e-02	7.949339e-01	8.562523e-01
-9	22:51229491:G:A	-9	0	G	A	0.970	5.992810e-02	2.711477e-02	7.917267e-01	2.712828e-02
-9	22:51229591:A:G	-9	0	A	G	0.988	1.235893e-01	4.360370e-02	7.923663e-01	4.605606e-03
-9	22:51229717:A:T	-9	0	A	T	0.791	-4.919975e-03	1.159186e-02	7.941657e-01	6.712634e-01

The data are stored in more.txt file.

for i in {1..22} 
do  

printf "$i\n" 

awk -v chr=$i '{

if ($2 ~ /^chr:/ )
{
print $0 
} 
}' more.txt

done

Issue:
The match pattern doesn't pick chr variable. It works fine if I put number (hard coded). It prints chr variable correctly though.

Only numbers through 1 to 22 are printed for the for loop.

Darwin 17.3.0 Darwin Kernel Version 17.3.0: Thu Nov 9 18:09:22 PST 2017; root:xnu-4570.31.3~1/RELEASE_X86_64 x86_64
OSX 10.13.2

Don_Cragun · February 3, 2018, 6:21pm

Putting an awk variable name inside double-quotes or inside slashes in an awk script turns it into a literal string; not the name of a variable to be expanded. Although the extended regular expressions are usually written with slashes as delimiters, in reality all that awk requires is a string (a constant string between double-quotes, a constant string between slashes, or a variable containing a string).

Try:

for i in {1..22} 
do
	printf "$i\n" 

	awk -v chr=$i '
	{
		if ($2 ~ ("^" chr ":") )
		{
			print $0 
		} 
	}' more.txt
done

or, very slightly more efficiently:

for i in {1..22} 
do
	printf "$i\n" 

	awk -v ERE="^$i:" '
	{
		if ($2 ~ ERE)
		{
			print $0 
		} 
	}' more.txt
done

genome · February 3, 2018, 6:23pm

Thank you.


for i in {1..22}
do



awk -v chr="^$i:" '{                                                                                                                                                                                        
                                                                                                                                                                                                            
if ($2 ~ chr )                                                                                                                                                                                              
{                                                                                                                                                                                                           
print $0                                                                                                                                                                                                    
}                                                                                                                                                                                                           
}' more.txt

done

I was about to post modified code.

RudiC · February 4, 2018, 4:04am

Did you consider using awk 's default behaviour for condensing the script to

for i in {1..22}
  do    printf "$i\n"
        awk -v chr="^$i:" '$2 ~ chr' more.txt
  done

EDIT: Are you aware that your script creates 22 processes to run awk in either, opening and reading more.txt 22 times? That's quite expensive, resourcewise. How about one single awk invocation and one single file read for all:

awk '
        {TMP = $2
         sub (/:.*$/, "", TMP)
         BUF[TMP, ++CNT[TMP]] = $0
        }

END     {for (i=1; i<=22; i++)  {print i
                                 for (c=1; c<=CNT; c++) print BUF[i, c]
                                }
        }
'  more.txt

genome · February 4, 2018, 10:00am

Oh, no print needed? Thanks. It's pretty.

rudic:

EDIT: Are you aware that your script creates 22 processes to run awk in either, opening and reading more.txt 22 times? That's quite expensive, resourcewise. How about one single awk invocation and one single file read for all:
awk '
   {TMP = $2
   sub (/:.*$/, "", TMP)
   BUF[TMP, ++CNT[TMP]] = $0
   }

END     {for (i=1; i<=22; i++)  {print i
   for (c=1; c<=CNT; c++) print BUF[i, c]
   }
   }
'  more.txt

Yes, I was loading and reading file 22 times. Sorry, can't understand your code.
How and what logic is being implemented.

Thank you as always for your reply.

RudiC · February 4, 2018, 11:04am

awk '
        {BUF[TMP=substr($2,1,index($2,":")-1), ++CNT[TMP]] = $0                 # store the input file in memory with index based on $2 and in increasing order
        }

END     {for (i=1; i<=22; i++)  {print i                                        # create sequence No. (1 .. 22) and print it
                                 for (c=1; c<=CNT; c++) print BUF[i, c]      # print the input for this sequence number - if exists - in increasing order
                                }                                               # if it does not exist, CNT defaults to zero, and loop is not entered.
        }
' more.txt

genome · February 14, 2018, 1:37pm

BUF[TMP=substr($2,1,index($2,":")-1), ++CNT[TMP]]

I am totally incapable of understanding this code snippet. Before comma I can see splitting, substring but , ++CNT[TMP]
May you please help?

RudiC · February 14, 2018, 3:56pm

You are right, that is a bit intricate...
We have the BUF array, that needs to be "multidimensional", i.e. indexed by two indices. awk doesn't provide real multidimensional array but approximates them by using a "compound" index concatenating the different "dimensions' " subindices separated by comma (or the variable SUBSEP, c.f. man awk ).
The first index is built from the beginning of $2 up to the first : using the substr function, saving the result to the TMP variable at the same time for later use. awk allows for this construct.
The second index is just a pre-incremented (++ operator in front of the variable, c.f. man awk ) counter array indexed by that TMP .
So consecutive lines with identical TMP index (1, 3, and 22 in your above sample) will have an incremental / sequential integer second index.
You could demonstrate this behaviour by printing out the BUF array's indices:

for (b in BUF) print b

Please be aware that in awk , the order in which b transverses the indices of the array is not defined.