awk to update unknown value in file using range of another

cmccabe · November 18, 2016, 8:56am

I am trying to use awk to update all the unknown values in $6 of file2 , if the $4 value in file 2 is within the range of $1 of file1 . If there is already a value in $6 other then unknown , it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields . Thank you :).

file1 (space-delimited)

chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
chr1:16888814-16940057 NBPF1
chr1:16888814-16939982 NBPF1
chr1:25568728-25664704 C1orf63
chr1:25568747-25664656 C1orf63

file2 (tab-delimited)

chr1    3649533    3649653    chr1:3649533-3649653    .    TP73
chr1    4736396    4736516    chr1:4736396-4736516    .    unknown   
chr1    5923314    5923434    chr1:5923314-5923434    .    NPHP4
chr1    9161991    9162111    chr1:9161991-9162111    .    unknown
chr1    9162050    9162051    chr1:9162050-9162051    .    rs6697376

desired output
--- the second and fourth unknown values are updated based on the $4 value and the range that they fall in $1 of file1

chr1    3649533    3649653    chr1:3649533-3649653    .    TP73
chr1    4736396    4736516    chr1:4736396-4736516    .    AJAP1  
chr1    5923314    5923434    chr1:5923314-5923434    .    NPHP4
chr1    9161991    9162111    chr1:9161991-9162111    .    unknown
chr1    9162050    9162051    chr1:9162050-9162051    .    rs6697376

awk with current output

awk -v OFS='\t' 'NR==FNR{
                           rstart[a[1]]=a[2]
                           rend[a[1]]=a[3]
                           value[a[1]]=$2
                           next} 
     $6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1]
                          {sub(/unknown/,value[$1],$6)}1' file1 file2 |    column -t
chr1  3649533  3649653  chr1:3649533-3649653  .  TP73
chr1  4736396  4736516  chr1:4736396-4736516  .
chr1  5923314  5923434  chr1:5923314-5923434  .  NPHP4
chr1  9161991  9162111  chr1:9161991-9162111  .
chr1  9162050  9162051  chr1:9162050-9162051  .  rs6697376

awk number 2 with output ---- sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t all one one line

awk -v OFS='\t' 'NR==FNR{
                           rstart[a[1]]=a[2]
                           rend[a[1]]=a[3]
                           value[a[1]]=$2
                           next} 
     $6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input |    column -t
chr1  3649533  3649653  chr1:3649533-3649653  .  TP73
chr1  4736396  4736516  chr1:4736396-4736516  .  unknown
chr1  5923314  5923434  chr1:5923314-5923434  .  NPHP4
chr1  9161991  9162111  chr1:9161991-9162111  .  unknown
chr1  9162050  9162051  chr1:9162050-9162051  .  rs6697376

I think I need a split split($1,a,/[:-]/) but the key is not unique, is there a better way?

The only way I can think off to make the key unique is, though I am not sure how to implement i:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end

chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A

Don_Cragun · November 19, 2016, 1:02am

Yes, clearly you need to use the split() call to define the array a[] that you are using, but (as you noted) you can't use the elements of that array as subscripts in another array because the values are not unique. Instead of you an array of minimum values and an array of maximum values indexed by the line number in your first file.

But, I don't understand the output that you say should be produced. Why do you want the output to be (with all occurrences of four spaces in your output replaced by <tab> characters):

chr1	3649533	3649653	chr1:3649533-3649653	.	TP73
chr1	4736396	4736516	chr1:4736396-4736516	.	AJAP1
chr1	5923314	5923434	chr1:5923314-5923434	.	NPHP4
chr1	9161991	9162111	chr1:9161991-9162111	.	unknown
chr1	9162050	9162051	chr1:9162050-9162051	.	rs6697376

instead of:

chr1	3649533	3649653	chr1:3649533-3649653	.	TP73
chr1	4736396	4736516	chr1:4736396-4736516	.	AJAP1
chr1	5923314	5923434	chr1:5923314-5923434	.	NPHP4
chr1	9161991	9162111	chr1:9161991-9162111	.	GPR157
chr1	9162050	9162051	chr1:9162050-9162051	.	rs6697376

?

cmccabe · November 19, 2016, 11:49am

Your output is correct in the post but I do not know how to index correctly. Thank you :).

Don_Cragun · November 19, 2016, 3:57pm

Try something more like:

awk '
BEGIN {	OFS = "\t"
}
FNR == NR {
	split($1, a, /[-:]/)
	s[++c] = a[1]
	m[c] = a[2]
	M[c] = a[3]
	t[c] = $2
	next
}
$6 == "unknown" {
	for(i = 1; i <= c; i++)
		if($1 == s && $2 + 0 >= m + 0 && $3 + 0 <= M + 0) {
			$6 = t
			break
		}
}
1' file1 file2

cmccabe · November 19, 2016, 5:39pm

Thank you very much for your help, so each line is indexed by the minimum and maximum of all matching values in file1 $2 . I added what I hope is close to understanding. Thank you :).

awk '
BEGIN { OFS = "\t"     --- output tab-delimeted
}
FNR == NR {   ---- process alll rows  and columns
        split($1, a, /[-:]/)     ---- split $1 on [-:]
        s[++c] = a[1]            ---- array to read each lline
        m[c] = a[2]       ---- array to store min line value
        M[c] = a[3]       ---- array to store max line value
        t[c] = $2          ---- define min and max match (all matching values used)
        next             ---- process next line
}
$6 == "unknown" {       ----- only use this alue from $6 of file2
        for(i = 1; i <= c; i++)    ---- loop od each line in file
                if($1 == s && $2 + 0 >= m + 0 && $3 + 0 <= M + 0) {  � min and max for each match line
                        $6 = t    ---- update each �unknown�
                        break      ---- break out of process
                }
}
1' file1 file2

Don_Cragun · November 19, 2016, 9:21pm

I would have commented the code this way:

awk '		# Run awk to process the following awk script...
BEGIN {	OFS = "\t"		# Set output field separator.
}
FNR == NR {			# For each line in the 1st input file...
	split($1, a, /[-:]/)	# Split the 1st field into a[] with hyphen and
				# colon as subfield separators.
	s[++c] = a[1]		# Save the symbol found at the start of this line.
				# c is the current line number in the 1st file.
	m[c] = a[2]		# Save the minimum value found on this line.
	M[c] = a[3]		# Save the maximum value found on this line.
	t[c] = $2		# Save the tag found on this line.
	next			# Skip the remaining steps in this script.
}
$6 == "unknown" {		# For lines in the 2nd input file where $6 is
				# the string "unknown"...
	for(i = 1; i <= c; i++)	# Search for an entry in the arrays set while
				# reading the 1st input file where the symbol
				# on that line is the same as $1 on this line
				# and the minimum on that line is <= $2 on this
				# line and the maximum on that line is >= $3 on
				# this line...
		if($1 == s && $2 + 0 >= m + 0 && $3 + 0 <= M + 0) {
			$6 = t	# and if all of those conditions are met,
					# replace the "unknown" with the tag from
					# that line, and
			break		# break out of this for loop.
		}
}
1				# Print the (possibly updated) line from the 2nd
				# file.
' file1 file2	# End the awk script and name the input files to be processed.

cmccabe · November 23, 2016, 8:36am

Thank you very much :).