awk to update file with numerical difference if condition is met

cmccabe · March 31, 2017, 9:36am

In the file1 below if $9 and $12 are . (dot) then the value in $8 of file1 is used as a key (exact match) to lookup in each $2 of file2 , when a match is found then the value of $4
in file1 is used to look for a range match within +/- 50 using the values in $4 and after in file2 . The number of fields can be variable but will always start at $4 .
For example, ISG15 has 2 fields in it with coordinates starting at $4 and ending at $5 . CR2 has 19 coordinates in it starting at $4 ending at $22 . The value in $1 of file2
tells you how many coordinates there are but the start or first will always be in $4 .

There will only be one range match but if the number is closer to the first value before the - (hyphen) in it the $9 of file1 is updated from a . to the numerical difference between the two numbers with a - (minus) in front. If the number is closer to the second value after the - (hyphen) in it the $9 of
file1 is updated from a . (dot) to the numerical difference between the two numbers with a + (plus) in front. However is the calculated difference is greater than 50 , then >50 is printed in $9 of file1 .

If $9 or $12 of file1 have a value other then . (dot) in them then that line is skipped (nothing happens) and the next line is processed. In file1 lines 2 and 3 are skipped. The awk below will identify these lines and print them, but I am not sure how to do the rest and need some expert help. Thank you :).

file1 tab-delimited

R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	Gene.IDP.refGene	GeneDetail.IDP.refGene	Inheritence	ExonicFunc.IDP.refGene	AAChange.IDP.refGene
1	chr1	948846	948846	-	A	upstream	ISG15	.	.	.	.
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101.3:c.-84C>G	.	.
3	chr1	949608	949608	G	A	exonic	ISG15	.	.	nonsynonymous SNV	ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4	chr1	949925	949925	C	T	downstream	ISG15	.	.	.	.
5	chr1	207646923	207646923	G	A	intronic	CR2	.	.	.	.
6	chr2	3653844	3653844	T	C	intronic	COLEC11	.	.	.	.
7	chr1	154562623	154562625	CCG	-	intronic	ADAR	.	.	.	.
8	chr1	948840	948840	-	C	upstream	ISG15	.	.	.	.

file2 space-delimited

2 ISG15 NM_005101.3 948846-948956 949363-949919
19 CR2 NM_001006658.2 207627644-207627821 207639870-207640257 207641871-207642060 207642144-207642244 207642494-207642577 207643039-207643447 207644084-207644261 207644341-207644432 207644767-207644844 207646116-207646524 207647145-207647230 207647585-207647668 207648168-207648561 207649578-207649764 207651229-207651415 207652601-207652625 207653322-207653398 207658808-207658917 207662486-207663240
6 COLEC11 NM_024027.4 3642421-3642758 3651904-3652060 3660900-3660972 3687867-3687921 3691033-3691129 3691316-3692234
15 ADAR NM_001111.4 154554533-154557519 154557692-154557820 154558228-154558341 154558656-154558839 154560600-154560734 154561026-154561149 154561844-154561938 154562232-154562404 154562737-154562885 154569280-154569414 154569598-154569743 154570303-154570452 154570877-154571061 154573516-154575102 154580467-154580724

desired updated file1

R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	Gene.IDP.refGene	GeneDetail.IDP.refGene	Inheritence	ExonicFunc.IDP.refGene	AAChange.IDP.refGene
1	chr1	948846	948846	-	A	upstream	ISG15	0	.	.	.
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101.3:c.-84C>G	.	.   .
3	chr1	949608	949608	G	A	exonic	ISG15	.	.	nonsynonymous SNV	ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4	chr1	949925	949925	C	T	downstream	ISG15	+6	.	.	.
5	chr1	207646923	207646923	G	A	intronic	CR2	>50	.	.	.
6	chr2	3653844	3653844	T	C	intronic	COLEC11	>50	.	.	.
7	chr1	154562623	154562625	CCG	-	intronic	ADAR	>50	.	.	.
8	chr1	948840	948840	-	C	upstream	ISG15	-6	.	.	.

Description of updated file1

line1:file1 $9 updated to 0 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 948846 is a exact match to the first cordinate in $4 before the - 
line2:not updated, skipped because $9 or $12 in file1 have a value other then . in them
line3:not updated, skipped because $9 or $12 in file1 have a value other then . in them
line4:file1 $9 updated to +6 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 949925 is a range match to the second coordinate in $5 after the - 
line5:file1 $9 updated to >50 because CR2 is matched to line 2, $2 of file2 and the value in $4 of file1, 207646923 is a range match to the first coordinate in $14 before the - but the difference of 222 is > 50
line6:file1 $9 updated to >50 because COLEC11 is matched to line 3, $2 of file2 and the value in $4 of file1, 3653844 is a range match to the second coordinate in $2 after the - but the difference of 1784 is > 50
line7:file1 $9 updated to >50 because ADAR is matched to line 4, $2 of file2 and the value in $4 of file1, 154562625 is a range match to the second coordinate in $12 after the - but the difference of112  is > 50
line8: file1 $9 updated to -6 because ISG15 is matched to line 1, $2 of file2 and the value in $4 of file1, 948840 is a range match to the first coordinate in $4 before the -

awk

 awk -F'\t' -v OFS='\t' '{if ($9=="." && $12==".") print }' file1

Don_Cragun · April 1, 2017, 1:42am

There doesn't seem to be anything required in this thread that hasn't been done for you in one or more of the other 340 threads you have submitted in this forum.

Please put a little more effort into showing us that you have learned something from the suggestions we have provided during your membership here of more than two years.

cmccabe · April 1, 2017, 5:09am

I apologize, but have not done anything like this before. I do always check previous threads and look to try something but thos seems different. I will read through them again, but this seemed more complex than others, Thank you :).

Don_Cragun · April 1, 2017, 5:28am

You have seen examples that showed you how to read one file and gather information from that file to be used while processing a second file. You have seen examples that showed you how to split a field into subfields based on a subfield delimiter (in this case the minus sign). You have seen examples that showed you how to create arrays from fields in a line (or two arrays with one array containing data and a related array containing counts of elements in the first array). With more than two years of learning from the examples we have provided in 340 threads, we would like to think that you could put those pieces together to solve this problem on your own (or at least get close to it).

Please give it a try and, if you get stuck, show us what you have accomplished and explain what you can't get to work. We want to help you learn how to write your own awk scripts; not to act as your unpaid programming staff.

cmccabe · April 1, 2017, 9:25pm

I have learned quite a lot in the past couple of years. I guess that am just not too confident, but will try. In writing the detail of what I was trying to do, it seemed very complex. I will try to come up with a workable solution that will hopefully be a start. Thank you and others for all your help in making this scientist better able to handle large data sets and complex issues. I really appreciate it :).

---------- Post updated at 08:25 PM ---------- Previous update was at 04:41 AM ----------

The below awk is what I was able to come up with. I included comments on each line as well. The section in bold is where each condition is checked (that is where the distance is calculated)... I am also not sure if the split is done correctly and accounts for the possibility the coordinates are in multiple fields. I tried to follow the description to get the desired output. Thank you very much:).

awk '        # Run awk to process the following awk script...
BEGIN {    OFS = "\t"        # Set output field separator.
}
FNR == NR {                 # For each line in the 1st input file (file2)...
         A[$1] = $2     # Read each 2nd field in file2 into array A
                next    # process next line in file2
        }
        {
                for ( k in A )   # Read each line in file1 into Array k
                { 
                        if ( $9=="." && $12=="." == A[k] }  # only store file1 lines where field 9 and 12 are a . in them and match them to Array A store the results in updated Array k
                        
                F[k] = $2   # match Array k from file1 with 2nd field of file2 and store in Array k
                }
        
        split($4, a, /[-]/)    # Split the 4th field  of file1 into b[] with hyphen and as subfield separator
        s[++c] = a[1]        # Save the symbol found at the start of this line.
                # c is the current line number in the 1st file.
        m[c] = b[2]        # Save the minimum value found on this line. (left of hyphen)
        M[c] = b[3]        # Save the maximum value found on this line.  (right of hyphen)
        t[c] = $4        # Save the tag found on this line.
        next            # Skip the remaining steps in this script.
           }
        }
---- if then checks here?
        END {
                for ( k in F )     # read each K in variable F to update
                        print k, F[k]   # update each k value   
        
                 break          # break out of this for loop.
        }
}1                # Print the updated
' file2 file1    # End the awk script and name the input files to be processed.

Scrutinizer · April 1, 2017, 9:44pm

I think this case may be a bit more complex than earlier cases as it involves multi-dimensional arrays and looping over them. Try this and see if it works for you and compare it to your own code..

awk -v m=50 '
  NR==FNR {                         # For each line in the 1st input file (file2)...
    for(i=4; i<=NF; i++) {          # for all ranges in $4 and further
      split($i,F,/-/)               # split the range in a left field and a right field
      L[$2,i-3]=F[1]                # Assign them to multi-dimensional associative array L(eft)
      R[$2,i-3]=F[2]                # And R(ight), which will be needed in the second part
      N[$2]=NF-3                    # Record the number of ranges in associative array N for index $2 
    }
  } 

  $9=="." && $12=="." {             # For each line in the 2nd input file (file1) if both field 9 and 12 equal "."
    d=">50"                         # set the default difference to >50
    for(i=1; i<=N[$8]; i++) {       # loop over the number of ranges for key $8
      if((d1=$4-L[$8,i])<0)         # set the left difference for a range
        d1=-d1 
      if((d2=$4-R[$8,i])<0)         # set the right difference for a range
        d2=-d2 
      if(d1<=50 || d2<=50) {        # if either of those difference is less than or equal to 50 
        if(d1<=50 && d2<=50) {      # if they are both less than or equal to 50
          if(d1<d2)                 # calculated the smallest difference. 
            d=-d1                   # If it is the left one then the difference is minus that value
          else                      #
            d=d2                    # If it is the right one then the difference is plus that value
        }
        else {                      # if only one of them is less than or equal to 50
          if(d1<=50)                # if it is the left one
            d=-d1
          else                      # if it is the right one
            d=d2
        }
        break                       # found a match, no need to loop further (assuming there are no overlapping ranges) 
      }
    }
    $9=d                            # set field 9 to the difference that was found
  }
  1                                 # print the record    
' file2 FS='\t' OFS='\t' file1

cmccabe · April 1, 2017, 10:14pm

Amazing, works perfect.

Using the original code I put comments by what I think I understand (not gonna lie most I don't, but maybe eventually I will).... thank you very much for all the great help and continued learning :).

awk -v m=50 '    # set maximum distance for difference and store as variable m
  NR==FNR {      # For each line in the 1st input file (file2)...
    for(i=4; i<=NF; i++) {   # start at field 4 and store in i as a loop (capture variable fields lenght of each line)
      split($i,F,/-/)   # split each i on the - (hyphen)
      L[$2,i-3]=F[1]    # store the left of the - as F[1]
      R[$2,i-3]=F[2]    # store the right of the - asF[2}
    }    # end file2 processing
    if(NF>n)
      n=NF
    next
  } 
  $9=="." && $12=="." {   # only look at lines where fields 9 and 12 have a . (dot) in file1
    d=">50" 
    for(i=1; i<=n; i++) {   # create loop for file1 i variable
      if( ($8,i) in L) {    # loop hrough field 8 of file1 and store in variable L
        if((d1=$4-L[$8,i])<0)
          d1=-d1 
        if((d2=$4-R[$8,i])<0)
          d2=-d2 
        if(d1<=50 || d2<=50) {
          if(d1<=50 && d2<=50) {
            if(d1<d2)
              d=-d1
            else
              d=d2
          }
          else {
            if(d1<=50)
              d=-d1
            else
              d=d2
          }
          break     # break out of this for loop.
        }
      }
    }
    $9=d
  }
  1    
' file2 FS='\t' OFS='\t' file1   # End the awk script and name the input files to be processed with field seperators set as tab.

Scrutinizer · April 1, 2017, 10:28pm

Hi cmccabe, you're welcome. In the mean time I improved the script somewhat and put comments everywhere..

--
As you have noticed, I put 50 in variable m , but still used 50 everywhere, but it could have been m , so the script can be used for other cases where a different value may be needed. I'll leave it up to you to try it out and correct the code so that m is used everywhere instead of 50.

You may also want to change the quick and dirty one letter variable names to more mnemonic names, once the script is to your liking
And add exception error handling, for example when no corresponding value for $8 is found in file 2 ...

Don_Cragun · April 2, 2017, 9:12am

Here is a slightly different approach to handling your problem...

Your specification talked about left and right values in the ranges separated by a minus sign in file2 , my code treats them as low and high values (since the left value always seems to be the low end of a range and the right values always seems to be the high end of a range in your sample data). Instead of using the awk split() function to split out the low and high values, my code uses <space> and <hyphen> as characters in the field separator used when reading in file2 so the low and high values in each range are already split apart when awk reads lines from file2 .

Your specification said that if field 4 was closer to the low end of a range, the difference between the values should be printed with a leading minus sign. My code does that. Scrutinizer's code does that as long as $4 is not the low end of a range (in which case it prints 0 instead of -0 ). Your sample desired output also omitted this minus sign. Without a leading sign, you can't tell if the value in field 4 matched the low end of a range or matched the high end of a range.

Your specification said that if field 4 was closer to the high end of a range, the difference should be printed with a leading plus sign. My code does that. Scrutinizer's code never prints a leading + .

Your specification doesn't say what should happen if the value given is exactly halfway between the low and high values. My code treats this case as if the midpoint is closer to the low end of the range. If this isn't acceptable, you can modify the code to do whatever you want in this case.

Your description said that if field 9 is not a period or field 12 is not a period, the input line should be unchanged in the output. My code and Scrutinizer's code both do that. But, it doesn't match the output that you said should be printed for the 3rd line in file1 . The input line is:

2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101.3:c.-84C>G	.	.

but you say the desired output for that line should be:

2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101.3:c.-84C>G	.	.   .

which has an added four spaces and a period added to the end of the line??? I assume this was a typo in the desired sample output you posted.

The code I came up with is:

awk -v extra=50 -v OFS='\t' '
NR == FNR {
	# Process 1st input file (file2).  Note that we have set FS to split
	# fields on <space> and <hyphen> so the ranges have been split into
	# pairs of fields before we get here.
	#	Field 1 tells us how many ranges there are.
	#	Field 2 is the key field to match against field 8 in file1.
	#	Field 3 is ignored.
	#	Even fields 4 through NF-1 are low ends of ranges.
	#	Odd fields 5 through NF are high ends of ranges.
	# Gather the count of ranges and the low and high ends of the ranges
	# for each key.
	count[$2] = $1
	for(i = 1; i <= $1; i++) {
		low[$2, i] = $(2 + 2 * i)
		high[$2, i] = $(3 + 2 * i)
		# Also calculate the midpoint of each range.
		mid[$2, i] = (low[$2, i] + high[$2, i]) / 2
	}
	next	# Skip to the next input record.
}
FNR != 1 && $9 == "." && $12 == "." && $8 in count {
	# Look for a range for the key in this record that contains field 4.
	# Note that the range is extended on both ends by the value specified
	# in extra.
	for(i = 1; i <= count[$8]; i++)
		if($4 >= (low[$8, i] - extra) && $4 <= (high[$8, i] + extra)) {
			# A matching range was found; update field 9...
			if($4 > mid[$8, i]) {
				# The value in this record is closer to the high
				# end of the range...
				sign = "+"
				value = high[$8, i]
			} else {# The value in this record is closer to the low
				# end of the range...
				sign = "-"
				value = low[$8, i]
			}
			# Calculate the absolute value of the difference between
			# field 4 and the closest end of the range.  If the
			# value is closer to the low end of the range and the
			# difference is less than or equal to 50 print it with a
			# leading minus sign; if the value is closer to the high
			# end of the range and the difference is less than or
			# equal to 50, print it with a leading plus sign;
			# otherwise (if the difference is greater than 50),
			# print ">50".
			diff = (value > $4) ? value - $4 : $4 - value
			$9 = (diff > 50) ? ">50" : (sign diff)
			break
		}
	if(i > count[$8]) {
		# No matching range was found.
		$9 = ">50"	# Or do something else to indicate this case???
	}
}
1	# Print the possibly updated record.
' FS='[- ]' file2 FS='\t' file1

In cases where a line containing a periods in fields 9 and 12, but field 8 was not found in file2 , this code leaves the line unchanged with warning that the field 8 value was not found. If the field 8 value is found, but field 4 is not in any of the specified ranges, this code changes field 9 to >50 . The spot where this is done is commented so you can change it to some other string or issue a diagnostic if this is not what you want to happen in this case.

As always, if someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

With your sample input, the above code produces the output:

R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	Gene.IDP.refGene	GeneDetail.IDP.refGene	Inheritence	ExonicFunc.IDP.refGene	AAChange.IDP.refGene
1	chr1	948846	948846	-	A	upstream	ISG15	-0	.
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101.3:c.-84C>G	.	.
3	chr1	949608	949608	G	A	exonic	ISG15	.	.	nonsynonymous SNV	ISG15:NM_005101.3:exon2:c.248G>A:p.S83N
4	chr1	949925	949925	C	T	downstream	ISG15	+6	.
5	chr1	207646923	207646923	G	A	intronic	CR2	>50	.	.	.
6	chr2	3653844	3653844	T	C	intronic	COLEC11	>50	.
7	chr1	154562623	154562625	CCG	-	intronic	ADAR	>50	.	.	.
8	chr1	948840	948840	-	C	upstream	ISG15	-6	.

The differences between what this code produces and what Scrutinizer's code produces are marked in red (the - in the 2nd output line and the + in the 5th output line). Scrutinizer's code also copies the contents of file2 to the output; the code above does not.

I hope this helps.

cmccabe · April 3, 2017, 7:27am

Thank you both very much for the help and explanations... I have a lot to learn to fully understand but these help a lot :).