The awk
below uses the tab-delimeted
file
and reformats each line based on one of three conditions (rules). The 3 rules are for deletion (lines in blue), snv (line in red), and insertion (lines in green). I have included all possible combinations of lines from my actual data, which is very large. The awk
includes comments but does nt produce the desired output. I think my thinking is correct but maybe I am missing something or have not included something. Thank you :).
file tab-delimeted
id1 1 101702547 AG A
id2 15 48782104 G C
id3 1 116268178 GAAA G
id4 1 116268178 GAAA GAAAA
id5 2 228197304 A AATCC
current output
id1 1 101702548 101702547 -
id3 1 116268179 116268178 -
desired output tab-delimeted
id1 1 101702548 101702549 G -
id2 15 48782104 48782104 G C
id3 1 116268179 116268182 AAA -
id4 1 116268179 116268179 - A
id5 2 228197305 228197305 - TCC
rules
line1: since length of $5 is greater then the length of $6 the matching value in $5 and $6 is remove and a - is placed in $6 the value in $3 has 1 added to it and the length of $5 is added to $3 and copied to $4 (condition 1)
line2: since length of $5 and length of $6 are equal to 1 the value in $3 is duplicated or copied in front of $4 (condition 2)
line3: since length of $5 is greater then the length of $6 the matching value in $5 is removed from $5 and $5 and a - is placed in $5 the value in $3 has 1 added to it and the length of $4 is added to $3 and copied in front of $4 (condition 1)
line4: since length of $4 is less then the length of $5 the matching value(s) from $4 and $5 are removed in $6 and a - is placed in $4 the value in $3 has 1 added to it and $3 and copied in front of $4 (condition 3)
line5: since length of $5 is less then the length of $6 the matching value(s) from $5 are removed in $5 and $6 and a - is placed in $4 the value in $3 has 1 added to it and $3 and copied in front of $4 (condition 3)
awk
awk 'BEGIN{FS=OFS="\t"} # define fs and output
FNR==NR{ # process each field in each line of file
if(length($5) > length($6)) { # condition 1 for deletion
gsub($5,"",$6) # removing matching
print $1,$2,$3+1,$3+length($4),"-" # print desired output
next
}
if(length($5) == length($6)) { # condition 2 for snv
print $1,$2,$3,$3,$5,$6 # print desired output
next
}
if(length($5) < length($6)) { # condition 3 for insertion
gsub($5,"",$6) # removing matching
print $1,$2,$3+1,"-",$3+1 # print desired output
}
}' file