Failure using regex with awk in 'while read file' loop

pathunkathunk · April 6, 2015, 11:26pm

I have a file1.txt with several 100k lines, each of which has a column 9 containing one of 60 "label" identifiers. Using an labels.txt file containing a list of labels, I'd like to extract 200 random lines from file1.txt for each of the labels in index.txt.

Using a contrived mini-example:

$ cat file1.txt 
H	0	328	100.0	-	0	0	38D150M140D	M01433:68:000000000-AAT0D:1:1111:13371:3239;barcodelabel=c8;	OTU_1;size=17947;
H	1	325	100.0	+	0	0	150M175D	M01433:68:000000000-AAT0D:1:1105:27659:19941;barcodelabel=c12;	OTU_2;size=101;
H	4	411	99.3	+	0	0	24D150M237D	M01433:68:000000000-AAT0D:1:2107:16393:23698;barcodelabel=g10;	OTU_5;size=64;
H	2	283	98.7	+	0	0	150M133D	M01433:68:000000000-AAT0D:1:2104:21919:3018;barcodelabel=c12;	OTU_3;size=80;
H	1	277	98.5	-	0	0	15I135M142D	M01433:68:000000000-AAT0D:1:2108:12616:12185;barcodelabel=c12;	OTU_2;size=101;
H	0	295	100.0	+	0	0	14D150M131D	M01433:68:000000000-AAT0D:1:1108:4978:15986;barcodelabel=g10;	OTU_1;size=17947;
H	29	312	97.6	-	0	0	25I125M187D	M01433:68:000000000-AAT0D:1:1109:20934:22671;barcodelabel=g15;	OTU_30;size=8;
H	0	315	99.3	-	0	0	88D150M77D	M01433:68:000000000-AAT0D:1:2114:17509:23920;barcodelabel=g10;	OTU_1;size=17947;

$ cat labels.txt
c12
g10

This is what I'm trying, but it results in empty files:

$ while read file
> do
> awk '/${file}/' file1.txt | gshuf -n 200 > ${file}.txt
> done < labels.txt

Desired output--two random lines for each label in labels.txt (i.e. may vary except for "label=c12" or "label=g12", respectively):

$ cat c12.txt
H	1	325	100.0	+	0	0	150M175D	M01433:68:000000000-AAT0D:1:1105:27659:19941;barcodelabel=c12;	OTU_2;size=101;
H	2	283	98.7	+	0	0	150M133D	M01433:68:000000000-AAT0D:1:2104:21919:3018;barcodelabel=c12;	OTU_3;size=80;

$ cat g10.txt
H	0	295	100.0	+	0	0	14D150M131D	M01433:68:000000000-AAT0D:1:1108:4978:15986;barcodelabel=g10;	OTU_1;size=17947;
H	0	315	99.3	-	0	0	88D150M77D	M01433:68:000000000-AAT0D:1:2114:17509:23920;barcodelabel=g10;	OTU_1;size=17947;

It seems like the problem is with the " awk '/${file}/' "? I say this because I can extract lines for each label but only if I explicitly specify the label regex (in this case g10.txt also has two random lines with "label=c12" instead of g10):

$ while read file
> do
> awk '/c12/' file1.txt | gshuf -n 2 > ${file}.txt
> done < labels.txt
$ cat c12.txt 
H	1	277	98.5	-	0	0	15I135M142D	M01433:68:000000000-AAT0D:1:2108:12616:12185;barcodelabel=c12;	OTU_2;size=101;
H	1	325	100.0	+	0	0	150M175D	M01433:68:000000000-AAT0D:1:1105:27659:19941;barcodelabel=c12;	OTU_2;size=101;

Thanks for any pointers.

Akshay_Hegde · April 7, 2015, 12:33am

Try

awk 'FNR==NR{A[$1];next}$3 in A && A[$3] < limit {  print > sprintf("%s.txt",$3); A[$3]+=1 }' limit=2  labels.txt FS='[=;]' file1.txt

OR

awk 'FNR==NR{A[$1];next}{split($9,T,/[=;]/)}T[3] in A && A[T[3]] < limit{print > sprintf("%s.txt",T[3]); A[T[3]]+=1 }' limit=2 labels.txt  file1.txt

Test result

[akshay@localhost tmp]$ cat labels.txt 
c12
g10

[akshay@localhost tmp]$ cat file1.txt
H	0	328	100.0	-	0	0	38D150M140D	M01433:68:000000000-AAT0D:1:1111:13371:3239;barcodelabel=c8;	OTU_1;size=17947;
H	1	325	100.0	+	0	0	150M175D	M01433:68:000000000-AAT0D:1:1105:27659:19941;barcodelabel=c12;	OTU_2;size=101;
H	4	411	99.3	+	0	0	24D150M237D	M01433:68:000000000-AAT0D:1:2107:16393:23698;barcodelabel=g10;	OTU_5;size=64;
H	2	283	98.7	+	0	0	150M133D	M01433:68:000000000-AAT0D:1:2104:21919:3018;barcodelabel=c12;	OTU_3;size=80;
H	1	277	98.5	-	0	0	15I135M142D	M01433:68:000000000-AAT0D:1:2108:12616:12185;barcodelabel=c12;	OTU_2;size=101;
H	0	295	100.0	+	0	0	14D150M131D	M01433:68:000000000-AAT0D:1:1108:4978:15986;barcodelabel=g10;	OTU_1;size=17947;
H	29	312	97.6	-	0	0	25I125M187D	M01433:68:000000000-AAT0D:1:1109:20934:22671;barcodelabel=g15;	OTU_30;size=8;
H	0	315	99.3	-	0	0	88D150M77D	M01433:68:000000000-AAT0D:1:2114:17509:23920;barcodelabel=g10;	OTU_1;size=17947;

[akshay@localhost tmp]$ awk 'FNR==NR{A[$1];next}$3 in A && A[$3] < limit {  print > sprintf("%s.txt",$3); A[$3]+=1 }' limit=2  labels.txt FS='[=;]' file1.txt

[akshay@localhost tmp]$ cat c12.txt 
H	1	325	100.0	+	0	0	150M175D	M01433:68:000000000-AAT0D:1:1105:27659:19941;barcodelabel=c12;	OTU_2;size=101;
H	2	283	98.7	+	0	0	150M133D	M01433:68:000000000-AAT0D:1:2104:21919:3018;barcodelabel=c12;	OTU_3;size=80;

[akshay@localhost tmp]$ cat g10.txt 
H	4	411	99.3	+	0	0	24D150M237D	M01433:68:000000000-AAT0D:1:2107:16393:23698;barcodelabel=g10;	OTU_5;size=64;
H	0	295	100.0	+	0	0	14D150M131D	M01433:68:000000000-AAT0D:1:1108:4978:15986;barcodelabel=g10;	OTU_1;size=17947;

pathunkathunk · April 7, 2015, 1:08am

This successfully extracts lines with the correct label. But I am looking for only two lines for each label in labels.txt, and the lines should be randomly chosen. Can this script be piped to " gshuf -n 2"--or the equivalent--before the file is created?

What does $3 do in this code?

Don_Cragun · April 7, 2015, 3:03am

With FS='[=;]' , the 1st field on each line in your sample data is the string of characters before the 1st semicolon, the 2nd field is the string barcodelabel , and the 3rd field (referenced in awk by $3 ) is the string between the equals sign and the 2nd semicolon (which is what you're looking for to match a string read from labels.txt ).

pathunkathunk · April 7, 2015, 10:18am

Do I have this basically right?

{
    A[$1]; # Build array on the first column of the file
    next  # Skip all proceeding blocks and process next line
}
$3 in A # Check in the value in column one of the second files is in the array
{
    # If so print it it to file named based on $3
print > sprintf("%s.txt",$3) 
}

If so, how can I adapt this code to give me only a random subset of matched lines for each array value? Is there a way to redirect the match specified by $3 in A to gshuf -n 2 before printing?

vgersh99 · April 7, 2015, 10:25am

pipe the 'awk' output into whatever you want to get you random output.

pathunkathunk · April 7, 2015, 10:35am

I know that I could pipe the output file and get my desired output. But I would like to accomplish it all at once if possible--find matching lines in file1.txt based on the labels in labels.txt, and print a random subset of two of these matching lines to an output file named according to the label.

RudiC · April 7, 2015, 11:07am

Do you want random lines from the input file? Then shuf the input file before feeding it into above awk proposals.
If you want always the same lines, but presented in random order, shuf the output files lateron.

vgersh99 · April 7, 2015, 11:21am

pathunkathunk:

Do I have this basically right?
{
   A[$1]; # Build array on the first column of the file
   next  # Skip all proceeding blocks and process next line
}
$3 in A # Check in the value in column one of the second files is in the array
{
   # If so print it it to file named based on $3
print > sprintf("%s.txt",$3) 
}
If so, how can I adapt this code to give me only a random subset of matched lines for each array value? Is there a way to redirect the match specified by $3 in A to gshuf -n 2 before printing?

or something along these lines (without pre/post randomization) - not tested

BEGIN {
   srand()
}
function genrand(n)
{
   return (int(n*rand())+1);
}

{
    A[$1]; # Build array on the first column of the file
    next  # Skip all proceeding blocks and process next line
}
$3 in A { # Check in the value in column one of the second files is in the array
  a[++c]=$0
  file[c]=sprintf("%s.txt",$3) 
}
END {
for(i=1;i<=2;i++) {
  n=genrand(c)
  print a[n] > file[n]
}