Print . in blank fields to prevent fields from shifting

The below code works great, kindly provided by @Don Cragun, the lines in bold print the current output . Since some of the fields printed can be blank some of the fields are shifted. I can not seem too add . to the blank fields like in the desired output. Basically, if there is nothing in the field then . otherwise print what the script matches. Thank you :).

script

for file in /home/cmccabe/Desktop/concordance/comparison/update/*.txt ; do
    file1=${file##*/}    # Strip off directory
    getprefix=${file1%%_*.txt}
    file1=$(printf '%s\n' "/home/cmccabe/Desktop/concordance/reference/files/${file1%%_*.txt}_"*.txt) # look for matching file
    if [[ -f "$file1" ]]
    then
          awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 OFS $2 OFS $3] = $0
}
{if($2 OFS $4 OFS $5 in o)
o[$2 OFS $4 OFS $5] = $1 OFS $2 OFS $4 OFS $5 OFS $6 OFS $7 OFS $8 OFS $9 OFS $10 OFS $11 OFS $12 OFS $13 OFS $14 OFS $15 OFS $16 OFS $17 OFS $18 OFS $19           }   
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' $file $file1
   fi
done

current output

Missing in IDP but found in Reference:                                         
2    166848646    G    A    exonic    SCN1A    68    13    16;20    0;0    17;15    0;0    0;0    0;0        c.[5139C>T]+[=]    52.94        Not low     found 
12    52200340    A    C    exonic    SCN8A    4129    28.3    1560;1672    413;453    0;0    0;0    0;2    31;0        c.[5070A>C]+[=]    20.97        Not low     Not found 
13    77570076    -    A    exonic    CLN5    2762    26.6    2060;702    0;0    0;0    0;0    2050;696    0;0        c.526_527insA    99.42    TP    Not low     Not found 
7    148106478    -    GT    intronic    CNTNAP2    4051    28.5    0;1    0;0    0;0    2220;1829    1085;887    0;1    rs60451214    c.3716-5_3716-4insGT    48.68        Not low     Not found 
9    138678036    TGCCC    -    intronic    KCNT1    834    23.1    0;0    0;0    0;31    0;1    0;0    0;802    rs141359570    c.3178-7_3178-3delTGCCC    96.16        Not low     Not found 
7    148106476    -    TT    intronic    CNTNAP2    4052    28.8    0;0    5;0    0;0    2221;1826    1081;884    0;0    rs61232377    c.3716-7_3716-6insTT    48.49        Not low     Not found 
2    166245425    C    T    exonic    SCN2A    49    12.6    0;0    13;9    0;0    18;9    0;0    0;0        c.[5109C>T]+[=]    55.1        Not low     found 

desired output

Missing in IDP but found in Reference:                                                                             
CHR    POS    REF    ALT    FUNC    GENE    COVERAGE    PHRED    A[#F,#R]    C[#F,#R]    G[#F,#R]    T[#F,#R]    INS[#F,#R]    DEL[#F,#R]    SNP    MUT    FREQ    SANGER    REGION    TVC 
2    166848646    G    A    exonic    SCN1A    68    13    16;20    0;0    17;15    0;0    0;0    0;0    .    c.[5139C>T]+[=]    52.94    .    Not low     found 
12    52200340    A    C    exonic    SCN8A    4129    28.3    1560;1672    413;453    0;0    0;0    0;2    31;0    .    c.[5070A>C]+[=]    20.97    .    Not low     Not found 
13    77570076    -    A    exonic    CLN5    2762    26.6    2060;702    0;0    0;0    0;0    2050;696    0;0    .    c.526_527insA    99.42    TP    Not low     Not found 
7    148106478    -    GT    intronic    CNTNAP2    4051    28.5    0;1    0;0    0;0    2220;1829    1085;887    0;1    rs60451214    c.3716-5_3716-4insGT    48.68    .    Not low     Not found 
9    138678036    TGCCC    -    intronic    KCNT1    834    23.1    0;0    0;0    0;31    0;1    0;0    0;802    rs141359570    c.3178-7_3178-3delTGCCC    96.16    .    Not low     Not found 
7    148106476    -    TT    intronic    CNTNAP2    4052    28.8    0;0    5;0    0;0    2221;1826    1081;884    0;0    rs61232377    c.3716-7_3716-6insTT    48.49    .    Not low     Not found 
2    166245425    C    T    exonic    SCN2A    49    12.6    0;0    13;9    0;0    18;9    0;0    0;0    .    c.[5109C>T]+[=]    55.1    .    Not low     found 

Hello cmccabe,

As you haven't shown us the sample Input_file so I haven't tested it, could you please run following and let us know how it goes then.

for file in /home/cmccabe/Desktop/concordance/comparison/update/*.txt ; do
    file1=${file##*/}    # Strip off directory
    getprefix=${file1%%_*.txt}
    file1=$(printf '%s\n' "/home/cmccabe/Desktop/concordance/reference/files/${file1%%_*.txt}_"*.txt) # look for matching file
    if [[ -f "$file1" ]]
    then
          awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 OFS $2 OFS $3] = $0
}
{for(i=1;i<=19;i++)
{if(!$i){$i="."}}
}
{if($2 OFS $4 OFS $5 in o)
o[$2 OFS $4 OFS $5] = $1 OFS $2 OFS $4 OFS $5 OFS $6 OFS $7 OFS $8 OFS $9 OFS $10 OFS $11 OFS $12 OFS $13 OFS $14 OFS $15 OFS $16 OFS $17 OFS $18 OFS $19           }   
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' $file $file1
   fi
done

I have highlighted bold code in above.

Thanks,
R. Singh

1 Like

Hi Ravinder,
I don't remember which thread the code shown in post #1 was addressing so I don't have any sample input either and I haven't tested your code. Note, however, that the code:

if(!$i){$i="."}

will not only change field # i to a <period> if the field is empty, it was also change it to a period if the field contains a numeric string that evaluates to zero (e.g., 0 , 0.000 , and 0e+10 ). For cases like this, the following would be safer:

{if($i == "")$i = "."}
2 Likes

Sorry about that I was leaving for a weekend trip. Anyway here are the files:

file that is updated:

Missing in IDP but found in Reference:
2	166848646	G	A	exonic	SCN1A	68	13	16;20	0;0	17;15	0;0	0;0	0;0	c.[5139C>T]+[=]	52.94	Not low
12	52200340	A	C	exonic	SCN8A	4129	28.3	1560;1672	413;453	0;0	0;0	0;2	31;0	c.[5070A>C]+[=]	20.97	Not low
13	77570076	-	A	exonic	CLN5	2762	26.6	2060;702	0;0	0;0	0;0	2050;696	0;0	c.526_527insA	99.42	TP	Not low
7	148106478	-	GT	intronic	CNTNAP2	4051	28.5	0;1	0;0	0;0	2220;1829	1085;887	0;1	rs60451214	c.3716-5_3716-4insGT	48.68	Not low
9	138678036	TGCCC	-	intronic	KCNT1	834	23.1	0;0	0;0	0;31	0;1	0;0	0;802	rs141359570	c.3178-7_3178-3delTGCCC	96.16	Not low
7	148106476	-	TT	intronic	CNTNAP2	4052	28.8	0;0	5;0	0;0	2221;1826	1081;884	0;0	rs61232377	c.3716-7_3716-6insTT	48.49	Not low
2	166245425	C	T	exonic	SCN2A	49	12.6	0;0	13;9	0;0	18;9	0;0	0;0	c.[5109C>T]+[=]	55.1	Not low

current output:

Missing in IDP but found in Reference: has no . so fields shift when blank
CHR	POS	REF	ALT	FUNC	GENE	COVERAGE	PHRED	A[#F,#R]	C[#F,#R]	G[#F,#R]	T[#F,#R]	INS[#F,#R]	DEL[#F,#R]	SNP	MUT	FREQ	SANGER	REGION
 TVC 
2	166848646	G	A	exonic	SCN1A	68	13	16;20	0;0	17;15	0;0	0;0	0;0	c.[5139C>T]+[=]	52.94	Not low	 found
12	52200340	A	C	exonic	SCN8A	4129	28.3	1560;1672	413;453	0;0	0;0	0;2	31;0	c.[5070A>C]+[=]	20.97	Not low	 Not found
13	77570076	-	A	exonic	CLN5	2762	26.6	2060;702	0;0	0;0	0;0	2050;696	0;0	c.526_527insA	99.42	TP	Not low	 Not found
7	148106478	-	GT	intronic	CNTNAP2	4051	28.5	0;1	0;0	0;0	2220;1829	1085;887	0;1	rs60451214	c.3716-5_3716-4insGT	48.68	Not low	 Not found
9	138678036	TGCCC	-	intronic	KCNT1	834	23.1	0;0	0;0	0;31	0;1	0;0	0;802	rs141359570	c.3178-7_3178-3delTGCCC	96.16	Not low	 Not found
7	148106476	-	TT	intronic	CNTNAP2	4052	28.8	0;0	5;0	0;0	2221;1826	1081;884	0;0	rs61232377	c.3716-7_3716-6insTT	48.49	Not low	 Not found
2	166245425	C	T	exonic	SCN2A	49	12.6	0;0	13;9	0;0	18;9	0;0	0;0	c.[5109C>T]+[=]	55.1	Not low	 found

desired output: tab-delimited with . if the field is blank

Missing in IDP but found in Reference:			
CHR	POS	REF	ALT	FUNC	GENE	COVERAGE	PHRED	"A[#F,#R]"	"C[#F,#R]"	"G[#F,#R]"	"T[#F,#R]"	"INS[#F,#R]"	"DEL[#F,#R]"	SNP	MUT	FREQ	SANGER	REGION	TVC
2	166848646	G	A	exonic	SCN1A	68	13	16;20	0;0	17;15	0;0	0;0	0;0	.	c.[5139C>T]+[=]	52.94	.	Not low	 found
12	52200340	A	C	exonic	SCN8A	4129	28.3	1560;1672	413;453	0;0	0;0	0;2	31;0	.	c.[5070A>C]+[=]	20.97	.	Not low	 Not found
13	77570076	-	A	exonic	CLN5	2762	26.6	2060;702	0;0	0;0	0;0	2050;696	0;0	.	c.526_527insA	99.42	TP	Not low	 Not found
7	148106478	-	GT	intronic	CNTNAP2	4051	28.5	0;1	0;0	0;0	2220;1829	1085;887	0;1	rs60451214	c.3716-5_3716-4insGT	48.68	.	Not low	 Not found
9	138678036	TGCCC	-	intronic	KCNT1	834	23.1	0;0	0;0	0;31	0;1	0;0	0;802	rs141359570	c.3178-7_3178-3delTGCCC	96.16	.	Not low	 Not found
7	148106476	-	TT	intronic	CNTNAP2	4052	28.8	0;0	5;0	0;0	2221;1826	1081;884	0;0	rs61232377	c.3716-7_3716-6insTT	48.49	.	Not low	 Not found
2	166245425	C	T	exonic	SCN2A	49	12.6	0;0	13;9	0;0	18;9	0;0	0;0	.	c.[5109C>T]+[=]	55.1		Not low	 found

I updated the code with:

{for(i=1;i<=19;i++)
{if($i == "")$i = "."}
}

but that seemed to remove the last 6 fields from the output. Thank you :).

Hello cmccabe,

Could you please make sure that your Input_file has delimiter as TAB as example shown by you, doesn't seems to have TABs as delimiter in it, please do confirm on same.

Thanks,
R. Singh

1 Like

Hi RavinderSingh13,

The input file is space delimited but the output is tab-delimited . Thank you :).

Hello cmccabe,

If I am not wrong code would have given for TAB delimited Input_file only as you could see we are setting it in BEGIN section, so how it will read the fields correctly if there are NO TABS in Input_file. So with space why it will NOT work out because let's have an example of following line 1 2 3 4 5 6 . So let's run the code with a TAB delimited field separator first as follows.

echo "1              2 3 4 5   6" | awk -F"\t" '{for(i=1;i<=NF;i++){print i "---->" $i}}'
1---->1              2 3 4 5   6

See as there is NO TAB present in line so it is printing whole line as a single field.
Now let's test it without setting TAB delimiter as follows.

echo "1              2 3 4 5   6" | awk  '{for(i=1;i<=NF;i++){print i"-->"$i}}'
1-->1
2-->2
3-->3
4-->4
5-->5
6-->6

So in above outputs left side of digits before --> shows the number of field and after arrow it shows the field's value. So one thing you could try here, if none of your fields into Input_file have space in their values that you could substitute space with TAB and then try to run above code. As a hint you could use gsub(/ +/,"\t",$0) utility of awk for doing so.

Kindly try it and do let us know how it goes then.

Thanks,
R. Singh

1 Like

Here is the output I get using the F113.txt attached. Thank you :).

for file in /home/cmccabe/Desktop/concordance/comparison/update/*.txt ; do
    file1=${file##*/}    # Strip off directory
    getprefix=${file1%%_*.txt}
    file1=$(printf '%s\n' "/home/cmccabe/Desktop/concordance/reference/files/${file1%%_*.txt}_"*.txt) # look for matching file
    if [[ -f "$file1" ]]
    then
          awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 OFS $2 OFS $3] = $0
}
{for(i=1;i<=19;i++)
{if($i == "")$i = "."}
}
{if($2 OFS $4 OFS $5 in o)
o[$2 OFS $4 OFS $5] = $1 OFS $2 OFS $4 OFS $5 OFS $6 OFS $7 OFS $8 OFS $9 OFS $10 OFS $11 OFS $12 OFS $13 OFS $14 OFS $15 OFS $16 OFS $17 OFS $18 OFS $19           }   
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' $file $file1
   fi
done
awk: cmd. line:10: (FILENAME=/home/cmccabe/Desktop/concordance/comparison/update/F113.txt FNR=1) fatal: attempt to use array `i' in a scalar context

output

Missing in IDP but found in Reference:    
CHR    POS    REF    ALT    FUNC    GENE    COVERAGE    PHRED    A[#F,#R]    C[#F,#R]    G[#F,#R]    T[#F,#R]    INS[#F,#R]    DEL[#F,#R]    SNP    MUT    FREQ    SANGER    REGION
74992800    A    G    Not low     Not found
100794363    C    T    Not low     Not found
189931518    A    -    Not low     Not found

Hello cmccabe,

There are 2 points here.
1st: You are getting following error:

Because you are using variable i as an variable and trying to use it as an array later in line print o[i[j]] > outfile .

2nd: As I explained in previous post of mine like if you don't have a TAB delimited Input_file and you have only space as a delimiter then it is quite difficult to find out which fields are missing in a line/record as in awk , if you give space or a single space , it will be considered as one field only so we could find out the number of fields are more or less into a line/record but can't find which fields are missing, until/unless there is a rule like eg--> 1st field is a string, 2nd field is a digit etc and so on.

Thanks,
R. Singh

1 Like

Just an observation: the reason you gave for inserting a dot (".") instead of a blank field was to keep the output from shifting rightwards if the field is empty. Wouldn't it be easier in this case to simply employ the printf function instead of the print statement? Consider:

printf "%10s%10s%10s\n" "a" "" "b"
printf "%10s%10s%10s\n" "c" "d" ""
printf "%10s%10s%10s\n" "" "e" "fghi"

will result in something like:

a                   b         
c         d                   
          f         fghi      

Which would be what you wanted in first place, no?

I hope this helps.

bakunin

1 Like

Thank you all :slight_smile: