awk to create separate files but not include specific field in output

cmccabe · May 9, 2018, 1:44pm

I am trying to use awk to create (in this example) 3 seperate text file from the unique id in $1 in file, if it starts with the pattern aa . The contents of each row is used to populate each text file except for $1 which is not needed. It seems I am close but not quite get there. Thank you :).

file tab-delimeted

aa1110-0	12	47259533	47259533	G	A	Comment:heterozygous_snv
aa1110-1	11	23892795	23892799	G	C	Comment:heterozygous_snv
	2	7581601	7581601	T	A	Comment:heterozygous_snv
aa1110-2	1	237837422	237837422	C	TTC	Comment:substitution
	3	7583892	7583892	G	A	Comment: heterozygous snv
		19	23892788	23892799	G	-	Comment:deletion

awk

awk -F'\t' '/^aa/{                     # if line starts with aa
        if(!w)                          # if negate of w is true
           f=sprintf($1"%d.txt",++n);   # pre increment n, and set up variable f 
        w=1;                            # set variable w = 1
        print >f;                       # write record/row/line to file
        next                            # go to next line
     }
     {                                  # for which does not start with aa  
        close(f);                       # close file
        w=0                             # set w = 0 for next line with aa use newfile
     }
' file

current output is two files with each row in them but $1 as well
Here is one:

aa1110-0	12	47259533	47259533	G	A	Comment:heterozygous_snv
aa1110-1	11	23892795	23892799	G	C	Comment:heterozygous_snv

awk

awk '{for(i=2;i<=NF;i++){printf "%s ", $i >> $1".txt"};printf "\n" >> $1".txt"; close($1".txt")}' file

current output is three files with no $1 in them but only one line in them.
Here is the same file as above:

12	47259533	47259533	G	A	Comment:heterozygous_snv

desired output tab-delimeted

aa1110-0.txt
12	47259533	47259533	G	A	Comment:heterozygous_snv

aa1110-1.txt
11	23892795	23892799	G	C	Comment:heterozygous_snv
2	7581601	7581601	T	A	Comment:heterozygous_snv

aa1110-2.txt
1	237837422	237837422	C	TTC	Comment:substitution
3	7583892	7583892	G	A	Comment:heterozygous_snv
19	23892788	23892799	G	-	Comment:deletion

Chubler_XL · May 9, 2018, 10:13pm

Try this:

awk -F'\t' '
/^aa/{                             # if line starts with aa
   if(f) close(f)                  # close already open file
   f=sprintf($1"%d.txt",++n)       # pre increment n, and set up variable f 
}
f {                                # if file name created
   $1=""                           # blank field #1
   $0=substr($0, 2)                # strip blank #1 field
   print >f;                       # write record/row/line to file
}
' OFS='\t' file

Don_Cragun · May 9, 2018, 10:22pm

The following seems to do what I think you want; which assumes you don't want extra whitespace characters added to the ends of your output lines, that you want <tab> delimited output from your <tab> delimited input, and that you just want the contents of field 1 with .txt added as the filename for your output files (with no sequence numbering added to the filenames):

awk '
BEGIN {	FS = OFS = "\t"
}
/^aa/ {	if(f != "")
		close(f)
	f = $1 ".txt"
}
{	for(i = 2; i <= NF; i++)
		printf("%s%s", $i, (i == NF) ? ORS : OFS) > f
}' file

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

cmccabe · May 10, 2018, 7:33am

Thank you both very much :).