awk to change value in field according to another

Don_Cragun · December 10, 2018, 7:29pm

Hi cmccabe,
There were three debugging printf s in the code I suggested, you commented out one of them and removed the code that enabled those debugging statements to be controlled by the presence of operands passed to your script. That seems like a strange combination.

I would have thought that you would realize that we can't tell if your changes to exon.sh are going to work without seeing how you invoke exon.sh . Giving parameters to exon.sh that are ignored when you invoke it doesn't make much sense. (And it seems that the script you showed us in post #20 ignores any parameters that you give it when you invoke it.)

Having a for loop in a script that invokes exon.sh has no effect on exon.sh unless something in that loop passes one or more parameters to exon.sh that exon.sh uses to adjust its behavior (which the code you showed us in post #20 does not do) or you copy the input file you want to process into a file named 00-0000low before you invoke it and copy the data written into 00-0000_filter by exon.sh into whatever file you want to contain the output of that iteration through your loop. That seems like it is a lot of extra copying of data to get what you want, but it would mesh with the code you showed us in post #20.

cmccabe · December 11, 2018, 1:46am

I added a couple parameters in the script to pass from the for loop. It does look like the correct files are being passed to the script, however the output is not correct. Commenting out the printf statements causes syntax errors

awk: cmd. line:9: 		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
awk: cmd. line:9: 		  ^ syntax error
awk: cmd. line:9: 		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
awk: cmd. line:9: 		                 ^ syntax error
awk: cmd. line:9: 		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
awk: cmd. line:9: 		                                       ^ syntax error
awk: cmd. line:10: 		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
awk: cmd. line:10: 		                 ^ syntax error
awk: cmd. line:10: 		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
awk: cmd. line:10: 		                                       ^ syntax error
awk: cmd. line:16: 			i, m[$1, $4, i],
awk: cmd. line:16: 			 ^ syntax error
awk: cmd. line:16: 			i, m[$1, $4, i],
awk: cmd. line:16: 			               ^ syntax error
awk: cmd. line:17: 			i, M[$1, $4, i],

I guess I am not understanding if the parameters are being passed from the for loop what else I am missing. I added the code that allows the operands to be controlled by the parameters passed by the for loop. Thank you :).

for file in /home/cmccabe/folder/less/*.txt ; do
       bname=$(basename "$file")
       pref=${bname%%_*.txt}
       #echo "file:\"$file\"    bname:\"$bname\"    pref:\"$pref\""
       #echo "output will be directed to:\"/home/cmccabe/folder/less/${pref}_output.txt\""
       bash -x /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" > /home/cmccabe/folder/less/${pref}_output.txt
done

exon.sh

#!/bin/sh
awk -v d=$# '
BEGIN {	FS = "[\t_]"
	OFS = "\t"
}
FNR == NR {
	m[$1, $4, ++c[$1, $4]] = $2 + 0
	M[$1, $4, c[$1, $4]] = $3 + 0
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
	next
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)
	for(i = 1; i <= c[$1, $4]; i++) {
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {if(m[$1, $4, i] > $2 + 0) {
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {$5 = "intron"
					break
				}
		}
	}
}
	if(i > c[$1, $4])
		$5 = "intron"
}
1' "$1" "$2"

using the bash -x

+ awk -v d=2 '
BEGIN {	FS = "[\t_]"
	OFS = "\t"
}
FNR == NR {
	m[$1, $4, ++c[$1, $4]] = $2 + 0
	M[$1, $4, c[$1, $4]] = $3 + 0
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
	next
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)
	for(i = 1; i <= c[$1, $4]; i++) {
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {if(m[$1, $4, i] > $2 + 0) {
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {$5 = "intron"
					break
				}
		}
	}
}
	if(i > c[$1, $4])
		$5 = "intron"
}
1' /home/cmccabe/folder/less/all_cdsV2 /home/cmccabe/folder/less/11-1111_regions.txt

output
00-0000_output.txt (only a few lines)

m[chr1,ADC,1]=33547850,M[chr1,ADC,1]=33547955
m[chr1,ADC,2]=33549554,M[chr1,ADC,2]=33549728
m[chr1,ADC,3]=33557650,M[chr1,ADC,3]=33557823

11-1111_output.txt

m[chr1,ADC,1]=33547850,M[chr1,ADC,1]=33547955
m[chr1,ADC,2]=33549554,M[chr1,ADC,2]=33549728
m[chr1,ADC,3]=33557650,M[chr1,ADC,3]=33557823

Don_Cragun · December 11, 2018, 2:45am

The output you have shown us in post #22 in the two output files is debugging output from exon.sh reading data from /home/cmccabe/folder/less/all_cdsV2 .

If that is all of the output that is stored in those two output files, one might be inclined to believe that /home/cmccabe/folder/less/00-0000_regions.txt and /home/cmccabe/folder/less/11-1111_regions.txt are empty files.

What output do you get from the command:

ls -l /home/cmccabe/folder/less/*.txt

cmccabe · December 11, 2018, 8:19am

ls -l /home/cmccabe/folder/less/*.txt
-rw-rw-r-- 1 cmccabe cmccabe 32254 Dec  7 17:14 /home/cmccabe/folder/less/00-0000_regions.txt
-rw-rw-r-- 1 cmccabe cmccabe 32254 Dec  7 17:14 /home/cmccabe/folder/less/11-1111_regions.txt

Those are only a few lines from /home/cmccabe/folder/less/all_cdsV2 , that file is several thousands of lines.

Both 00-0000_regions.txt and 11-1111_regions.txt have the format below (only a few lines but all are similar)

chrX	153579249	153579469 	FLNA	
chrX	153579888	153580108 	FLNA	
chrX	153579904	153580124 	FLNA

If the debugging lines are being printed then each file is being read but not processed? Thank you very much :).

Don_Cragun · December 11, 2018, 1:37pm

No. Debugging lines are being printed (along with your desired output) because you set d to a non-zero, non-null value in exon.sh . The fact that you implied that the output files only contained three lines of debugging information and none of the output you were expecting tells me that you didn't bother to try to understand how the script I gave you works. That is very discouraging. I guess I'm wasting my time here.

What output do you get if you run the command:

grep -E 'splicing|intron|exon' /home/cmccabe/folder/less/00-0000_output.txt

What output were you hoping to have exon.sh put into the file named /home/cmccabe/folder/less/00-0000_output.txt ?

cmccabe · December 11, 2018, 2:52pm

I have been trying to understand your code and am just not understanding, but I am trying, I know it may seem like I am not but I assure yu that I am and will continue to do so. I added comments to each line and some questions. My understanding is not there completely but hopefully its a start. I apologize for the misleading output description, thousands of lines print, i only showed a few to keep the post short. Thank you :).

#!/bin/sh
awk -v d=$# '   # does this define d as non-zero
BEGIN {	FS = "[\t_]"    # define FS as tab or underscore
	OFS = "\t"      # define output as tab delimited
}
FNR == NR {   # process same line in file 2 as file1 and start processing file2 or /home/cmccabe/folder/less/all_cdsV2
	m[$1, $4, ++c[$1, $4]] = $2 + 0    # split $4 and store $1 and $2in array m, what does the + 0 do?
	M[$1, $4, c[$1, $4]] = $3 + 0      # split $4 and store $1 and $2in array M, what does the + 0 do?
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n", # debugging print
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],  # for m (m=min)?
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])  # for M (M=max)?
	next   # process next line
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)  # not sure what this doesI think it prints each line in $file?
	for(i = 1; i <= c[$1, $4]; i++) {     # start a loop using $4 and $1 value
		#if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n", # again not sure?
			i, m[$1, $4, i],  # loop through each m in /home/cmccabe/folder/less/all_cdsV2 for each $1 and $4 of $file
			i, M[$1, $4, i],  # loop through each M in /home/cmccabe/folder/less/all_cdsV2 for each $1 and $4 of $file
			$2)  # not sure what this does?
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {  # if the value of each matching m<=$2 and <=M then print 
			$5 = "exon"  #  exon in $5
			break    # break loop and move to next line
		} else {if(m[$1, $4, i] > $2 + 0) { # if the value of each matching m>=$2 and >=M then print
				if(m[$1, $4, i] - 10 <= $2 + 0) {  # if the value of each matching -10and <=$2 then print
					$5 = "splicing"
					break     # break loop and move to next line
				} else {$5 = "intron"   # print intron in $5
					break  # break loop and move to next line
				}
		}
	}
}
	if(i > c[$1, $4])     # what does this do hasn't intron already been printed?
		$5 = "intron"
}
1' "$1" "$2"   # parameters passed from for loop

  (only a few lines of the thousands to show the desired output results from the grep)
grep -E 'splicing|intron|exon' /home/cmccabe/folder/less/00-0000_output.txt
chr7	30062272	30062492 	FKBP14	splicing
chr7	30065867	30066087 	FKBP14	intron
chr7	30065964	30066184 	FKBP14	exon
chr7	94024268	94024488 	COL1A2	intron

bakunin · December 11, 2018, 9:41pm

$# is the number of arguments a script actually got when it was invoked. Thry this script:

#! /bin/bash
echo number of arguments is: $#
exit 0

and try the following invocations, note their results:

./script.sh
./script.sh one
./script.sh one two
./script.sh one "two two" three
./script.sh ""

I hope this helps.

bakunin

Don_Cragun · December 14, 2018, 1:25pm

Here is an updated and commented version of your exon.sh script...

#!/bin/sh
# exon.sh gene_regions_file coding_exons_file
#
# This script reads the gene_regions_file and the coding_exons_file and
# produces output that classifies each line in the coding_exons_file as "exon",
# "intron", or "splicing".
#
# In addition, this script optionally prints debugging information showing:
#  1.	the minimum and maximum values found for each line read from the
#	gene_regions_file along with the field 1 and 4 values found on the
#	corresponding line and the number of lines seen with those field 1 and
#	4 values,
#  2.	the line number and contents of each line read from the
#	coding_exons_file, and
#  3.	while classifying an entry in the coding_exons_file; print the minimum
#	and maximum values, the sequence number of the minimum and maximum
#	values for the current entry's field 1 and 4 value; and the field 2
#	value of the current entry.
#
# An earlier version of this script turned on debugging output if any arguments
# were given to the script when it was invoked (by using awk -v d=$#).  This
# worked then because the filenames used by that version of the script were
# constants and built into the script rather than passed in as operands (as in
# this version of the script).  Since this version of the script now requires
# two operands, the same results could be achieved by using
# awk -v d=$(($# > 2)).  But, instead of doing that, this version of the
# script allows the user to set the debugging flag to a non-zero, non-null
# value before a file pathname operand if debugging printouts for that file
# are desired:
#	No debugging output:
#		exon.sh gene_regions_file coding_exons_file
#	Enable debugging output for both input files:
#		exon.sh d=1 gene_regions_file coding_exons_file
#	Enable debugging output only for 2nd input file:
#		exon.sh gene_regions_file d=1 coding_exons_file
#	Enable debugging output only for 1st input file:
#		exon.sh d=1 gene_regions_file d= coding_exons_file
#		or:
#		exon.sh d=1 gene_regions_file d=0 coding_exons_file

IAm=${0##*/}	# Save final component of the pathname of this script for
		# diagnostic messages.
if [ $# -lt 2 ]
then	# Print a diagnostic usage message and exit if we don't have at least
	# two arguments (the required input files).  (As explained above,
	# additional operands to enable or disable debugging may be included.
	# Verification of there being two file operands and only appropiate
	# debugging operands is left as an exercise for the reader.  For now,
	# assume that diagnostics from awk will be sufficient if inappropriate
	# arguments are supplied.)
	echo "Usage: $IAm gene-regions_file coding_exons_file" >&2
	exit 1
fi
awk '
BEGIN {	FS = "[\t_]"	# Set input field separators to <tab> and <underscore>.
	OFS = "\t"	# Set output field separator to <tab>.
}
FNR == NR {
	# When the current file record # is the same as the current record #
	# from all files, we are looking at a record from the 1st input file...

	# First, increment the number of entries seen for this pair of field 1
	# & 4 values (c[$1, $4]) and then save the minimum value found in field
	# 2 in this entry.  Adding 0 converts an alphanumeric field to a
	# numeric field by discarding any trailing non-numeric characters from
	# the field.
	m[$1, $4, ++c[$1, $4]] = $2 + 0

	# And, then save the corresonding maximum value found in field 3 in
	# this entry.
	M[$1, $4, c[$1, $4]] = $3 + 0
	# Note that this code will not work if the min-max ranges in the 1st
	# input file are not in increasing order for each field 1 and 4 pair
	# (although the entries for a given pair do not have to be adjacent).

	# If debugging is enabled, print the minimum and maximum values saved
	# from this entry.
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])

	# Skip the remaining steps in this script for this input record and
	# continue with the next input record.
	next
}
{	# If we are here, we have a record from the 2nd input file.

	# If debugging is enabled, print the record number in this file and the
	# contents of this record.
	if(d) printf("FNR=%d:\"%s\"\n", FNR, $0)

	# Loop through all of the entries for the field 1 & 4 pair values found
	# in the current input record.
	for(i = 1; i <= c[$1, $4]; i++) {
		# If debugging is enabled, print the minimum and maximum values
		# for this entry in the saved m[] and M[] arrays along with the
		# field 2 value from the current input record.
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		# If the field 2 value in the current input record is in range
		# for the entry being evaluated in this loop, set the 5th field
		# in this input record to "exon" and break out of this loop.
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {
			# If the minimum field 2 value in the current input
			# record is greater than the numeric value of the field
			# 2 value in the current input record...
			if(m[$1, $4, i] > $2 + 0) {
				# If the field 2 value in the current input
				# record is within 10 of the low end of the
				# range for the entry being evaluated in this
				# iteration of the loop, set the 5th field in
				# this input record to "splicing" and break
				# out of this loop.
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {# Otherwise, set the 5th field in this
					# input record to "intron" and break
					# out of this loop.
					$5 = "intron"
					break
				}
			}
		}
	}
	if(i > c[$1, $4])
		# If we fell through the loop (instead of breaking out of it),
		# we need to set the 5th field in this input record to
		# "intron".
		$5 = "intron"
}
1	# Print the updated current input record (with a field 5 value added).
' "$@"	# Mark end of the awk script operand and add the operands provided to
	# this script as operands to awk.

Although written to use /bin/sh as its interpreter, this script will not work with a pure Bourne shell (it uses some shell parameter expansions that are defined by the POSIX shell standards that were not present in the Bourne shell). It has been tested and works with /bin/sh , /bin/bash , and /bin/ksh on macOS Mojave version 10.14.2 with the sample input files provided in this thread with all occurrences of chrx in file2 changed to chrX .

If you want to run this script on a Solaris/SunOS system, change awk in this script to /usr/xpg4/bin/awk or nawk .