awk to add text to each line of matching id

The awk below executes as expected if the id in $4 (like in f ) is unique. However most of my data is like f1 where the same id can appear multiple times. I think that is the reason why the awk is not working as expected. I added a comment on the line that I can not change without causing the script to abort. Each line in f2 is searched and must contain the id, in this case COL1A2 but that id may not be a single entry. That is the id may appear 5 times, but each line with that is in f2 is searched. Using the $4 in f1 as the id and reading each $1 , $2 , and $3 value into a variable min and max .

The $4 is then split on the _ in f2 and read into array . The same id from f1 may appear in multiple lines of f2 however each will have unique $2 and $3 values. Each value in the split will match a $4 id in f1 . The min and max must match the $1 of f2 and be between the $2 and $3 values in f2 . An exact match is not needed rather just that the min or max variables falls within $2 and $3 . If that is true then exon is printed in $5 of f2 if it is not true then intron is printed in $5 . Most of this works as expected I just did not account for the possibity for multiple enteries and am nut sure how to adjust for it. Thank you :slight_smile:

For example using the contents of the f1 where COL1A2 appears 3 times, each entry or line is searched in f2 . Currently, I believe since COL1A2 is not unique not match is found in f2 as the min and max are not set per entry or line. Thank you :).

awk w/ desired output

awk '
 BEGIN{
  SUBSEP=","
}
FNR==NR{
  max[$1,$NF]=$3
  min[$1,$NF]=$2
  next
}
{
 split($4,array,"_")   # How do I change/modify this so it only looks a each line with this id `COL1A2` in it?
}
(($1,array[1]) in max){
if(($2>min[array[5],array[1]] && $2<max[array[5],array[1]]) || ($3>max[array[5],array[1]] && $3<max[array[5],array[1]])){
  print array[5],array[1],min[array[5],array[1]],max[array[5],array[1]],"exon"
  next
}
}
{
  print $0,"intron"}' f f2 

chr7    94024333    94024423    COL1A2_cds_0_0_chr7_94024344_f  0   + intron
chr7    94027049    94027080    COL1A2_cds_1_0_chr7_94027060_f  0   + intron
chr7 COL1A2 94027591 94027701 exon

awk w/ current output

  .... }' f1 f2

chr7    94024333    94024423    COL1A2_cds_0_0_chr7_94024344_f  0   + intron
chr7    94027049    94027080    COL1A2_cds_1_0_chr7_94027060_f  0   + intron
chr7    94027683    94027718    COL1A2_cds_2_0_chr7_94027694_f  0   + intron
contents of f single COL1A2 entry

chr7    94027591    94027701    COL1A2
contents of f1  multiple COL1A2 entry, this is most of the actual data, very few are single entries though there are some

chr7    94027591    94027701    COL1A2  
chr7    94027799    94027811    COL1A2  
chr7    94030799    94030847    COL1A2
contents of f2 always the same format

chr7    94024333    94024423    COL1A2_cds_0_0_chr7_94024344_f  0   +
chr7    94027049    94027080    COL1A2_cds_1_0_chr7_94027060_f  0   +
chr7    94027683    94027718    COL1A2_cds_2_0_chr7_94027694_f  0   +

Hi cmccabe,
If we rewrite the code that reads your first input file to be:

FNR==NR{
	max[$1,$NF,++count[$1,$NF]]=$3
	min[$1,$NF,count[$1,$NF]]=$2
	next
}

does that give you enough of a hint for what you then need to do in the loop you need to add in the code that reads your second input file?

1 Like

Thank you for the hint, I made two adjustments to the script and commented them. The output is the same but maybe I have the idea just not implementing it correctly? Thank you :).

awk '
 BEGIN{
  SUBSEP=","
}
FNR==NR{
  max[$1,$NF,++count[$1,$NF]]=$3  # read with count each line of f1 max
  min[$1,$NF,count[$1,$NF]]=$2
  next
}
{ for (i in count)   # start a loop with setting each line in id to i
 split($4,array,"_")   
}
(($1,array[1],i++) in max){    # search each matching id line in f2
if(($2>min[array[5],array[1]] && $2<max[array[5],array[1]]) || ($3>max[array[5],array[1]] && $3<max[array[5],array[1]])){
  print array[5],array[1],min[array[5],array[1]],max[array[5],array[1]],"exon"
  next
  }
 }
 {
  print $0,"intron"}' f1 f2

chr7    94024333    94024423    COL1A2_cds_0_0_chr7_94024344_f  0   + intron
chr7    94027049    94027080    COL1A2_cds_1_0_chr7_94027060_f  0   + intron
chr7    94027683    94027718    COL1A2_cds_2_0_chr7_94027694_f  0   + intron

Hi cmccabe,
If you create an array with three subscripts, you have to use three subscripts when you try to access an element of that array.

Are the ranges given in your first input file always in increasing numerical order for each $1,$4 set of values (as in your sample file f1 )? If they are we can use that information to make your code run faster.

Is the fifth subfield of $4 in your second input file always identical to the $1 value on the same input line (as in your sample files)? If they are, we can use that information to make your code run faster.

You note that your input files fields are separated by tabs. Do you want the output file to be tab delimited too; or do you want the output to be delimited by spaces as shown in your sample output?

Note that in your original code (and in your updated code) you have the line:

if(($2>min[array[5],array[1]] && $2<max[array[5],array[1]]) || ($3>max[array[5],array[1]] && $3<max[array[5],array[1]])){

and a $3 value can never be less than a max[] value and greater than the same value. Can we assume that you intended to write:

if(($2>min[array[5],array[1]] && $2<max[array[5],array[1]]) || ($3>min[array[5],array[1]] && $3<max[array[5],array[1]])){

or more likely that you meant:

if(($2>=min[array[5],array[1]] && $2<=max[array[5],array[1]]) || ($3>=min[array[5],array[1]] && $3<=max[array[5],array[1]])){

(i.e., are the min-max ranges inclusive of the endpoints or exclusive of the endpoints)?

1 Like

Sorry... I missed one question concerning the line of code:

if(($2>min[array[5],array[1]] && $2<max[array[5],array[1]]) || ($3>max[array[5],array[1]] && $3<max[array[5],array[1]])){

Is it your intent to print the line containing exon if either endpoint is in an entry in the first input file for that $1,$4 pair, or should it only print the exon line if both endpoints are in range?

1 Like

Yes, these should always be sorted like in f1

Yes, this will always be the case if $4 is found as in f1

f1 will always be tab-delimited except for a whitespace after $3 and $4 , but the output would be tab-delimited I did and OFS="\t" but I think the whitespaces are making that not work

You are correct in that I meant to be looking for inclusive endpoints so the >=/<= is what I should have used.

I used the || statement to make sure the script works as expected but it could be && as both coordinates should lie within the endpoints (trying to think of a situation where its not the case and not coming up with anything).

Thank you very much :).

Thanks for the responses.

Unfortunately, upon looking closer at your example input files, there are no entries in f1 where both endpoints of any line in f2 fall within the ranges specified in f1 . In the last line of f2 $2 falls inside the range specified in the first line in f1 but $3 does not.

And, despite what you said about the input files being tab delimited, the samples you provided don't contain any <tab> characters. And, since you said that the real files you're using do contain some <space>s between field 3 and 4 and after field4, the following code assumes that strings of one or more <space> and/or <tab> characters separate field and that any <space> and <tab> characters after field 4 are to be ignored. (As written, the code shown below will not work if a line in either input file contains any leading <space> or <tab> characters.)

So, given the above and assuming that you just want there to be some overlap between the ranges specified in a line in f1 and in a line in f2 , maybe the following will do what you want:

#!/bin/ksh
awk '
BEGIN{	FS = "[[:blank:]]+|_"
	OFS = "\t"
}
FNR == NR {
	# Note that this code assumes that the min and max ranges in the first
	# input file are presented with the minimum values arranged in
	# increasing order for each $1, $NF set of values.
	max[$1, $4, ++count[$1, $4]] = $3	# Save max value with count of
						# lines with the same $1 and $4
						# value from the first input
						# file.
	min[$1, $4, count[$1, $4]] = $2		# Save corresponding min value.
	next					# Skip to next input line.
}
{	for(i = 1; i <= count[$1, $4]; i++) {
		# If the minimum value on this line is greater than the saved
		# maximum value we are looking at now or the maximum value on
		# this line is less than thee save minimum value we are looking
		# at now, there cannot be a matching entry for this $1,$4 value
		# pair.
		if($2 > max[$1, $4, i] || $3 < min[$1, $4, i])
			break
		# If the minimum or maximum on this line is within range, we
		# have found what we are looking for.
		if(($2 >= min[$1, $4, i]) || $3 <= max[$1, $4, i]) {
			print $1, $4, min[$1, $4, i], max[$1, $4, i], "exon"
			next
		}
	}

	# No entry was found for this $1,$4 value pair.  Report "intron" found.
	print $0, "intron"
}' "${1:-f1}" f2

This uses <tab> as the output field separator, but on output lines that end in "intron", <space>s in the input will not be converted to <tab>s in the output.

If you run the above script with no operands (or with the operand f1 or with the operand f ) from your sample input files, it will produce the output:

chr7    94024333    94024423    COL1A2_cds_0_0_chr7_94024344_f  0   +	intron
chr7    94027049    94027080    COL1A2_cds_1_0_chr7_94027060_f  0   +	intron
chr7	COL1A2	94027591	94027701	exon

Note that the above code does not set SUBSEP since it was not used in your script and isn't needed in the code shown above. Note also that the field separator I'm using the code above uses any sequence of one or more <space>s and/or <tab>s to treated as a field separator and uses every <underscore> as a field separator. That means that the subfields you were splitting into the array named array in your code will all be treated as separate fields in the code above. (That means I don't have to call split() to break that string into subfields.)

The sample files you provided to test any of the "corner" cases where I might have missed something. I think it will work OK, but I haven't performed enough extensive testing to give you any kind of guarantee.

Hope this helps,
Don

1 Like