awk to change value in field according to another

cmccabe · November 9, 2018, 8:43am

I am trying to use awk to check if each $2 in file1 falls between $2 and $3 of the matching $4 line of file2 . If it does then in $5 of file2 , exon if it does not intron . I think the awk below will do that, but I am struggling trying to is add a calculation that if the difference is less than 10, then $5 is splicing . I have added an example of line 1 as well.

The 5th line is an example of the splicing , because the $2 value in file1 is 2 away from the $2 value in file2 . Thank you :).

file1

chr1	17345304	17345315 	SDHB	
chr1	17345516	17345524 	SDHB	
chr1	93306242	93306261 	RPL5	
chr1	93307262	93307291 	RPL5
chrx	153295819	153296875 	MECP2	
chrx	153295810	153296800 	MECP2

file2 tab-delimeted

chr1	17345375	17345453	SDHB_cds_0_0_chr1_17345376_r	0	-
chr1	17349102	17349225	SDHB_cds_1_0_chr1_17349103_r	0	-
chr1	17350467	17350569	SDHB_cds_2_0_chr1_17350468_r	0	-
chr1	17354243	17354360	SDHB_cds_3_0_chr1_17354244_r	0	-
chr1	17355094	17355231	SDHB_cds_4_0_chr1_17355095_r	0	-
chr1	17359554	17359640	SDHB_cds_5_0_chr1_17359555_r	0	-
chr1	17371255	17371383	SDHB_cds_6_0_chr1_17371256_r	0	-
chr1	17380442	17380514	SDHB_cds_7_0_chr1_17380443_r	0	-
chr1	93297671	93297674	RPL5_cds_0_0_chr1_93297672_f	0	+
chr1	93298945	93299015	RPL5_cds_1_0_chr1_93298946_f	0	+
chr1	93299101	93299217	RPL5_cds_2_0_chr1_93299102_f	0	+
chr1	93300335	93300470	RPL5_cds_3_0_chr1_93300336_f	0	+
chr1	93301746	93301949	RPL5_cds_4_0_chr1_93301747_f	0	+
chr1	93303012	93303190	RPL5_cds_5_0_chr1_93303013_f	0	+
chr1	93306107	93306196	RPL5_cds_6_0_chr1_93306108_f	0	+
chr1	93307322	93307422	RPL5_cds_7_0_chr1_93307323_f	0	+
chrX	153295817	153296901	MECP2_cds_0_0_chrX_153295818_r	0	-
chrX	153297657	153298008	MECP2_cds_1_0_chrX_153297658_r	0	-
chrX	153357641	153357667	MECP2_cds_2_0_chrX_153357642_r	0	-

desired output tab-delimited

chr1	17345304	17345315 	SDHB	intron
chr1	17345516	17345524 	SDHB	intron	
chr1	93306242	93306261 	RPL5	intron	
chr1	93307262	93307291 	RPL5	intron
chrx	153295819	153296875	MECP2	exon
chrx	153295810	153296800	MECP2	splicing

awk

awk '
FNR==NR{
  a[$4];
  min[$4]=$2;
  max[$4]=$3;
  next
}
{
  split($4,array,"_");
  print $0,(array[1] in a) && ($2>=min[array[1]] && $2<=max[array[1]])?"exon":"intron"
}
' file1 OFS="\t" file2 > output

example of line 1

a[$4] = SDHB
min[$4] = 17345304
max[$4] = 17345315

array[1] = SDHB, 17345304 >= 17345375 && array[1] = SDHB, 17345315 <= 17345453 ---- intron

vgersh99 · November 9, 2018, 10:22am

hmmm...
$4 in file1 is not unique - the last $4 wins.
Is that what you want?
Or you rather determine min/max per $4 in file1 as you go?

cmccabe · November 9, 2018, 10:45am

The $4 value in file1 is not unique but is meant to ensure that, using line 1 as an example, only SDBH lines are searched or used in the comparison. There may be hundreds of lines in file1 , but only a subset will match the $4 value. Thank you :).

Don_Cragun · November 10, 2018, 2:29am

And, of that subset that match the $4 value, we have to assume that the same subset or a smaller subset may be classified as "intron", the same subset or a smaller subset may be classified as "exon", and the same subset or a smaller subset may be classified as "splicing" when your criteria are applied. How are the subset of lines that match the $4 value supposed to be combined or selected so that only one of the possible results are returned (presumably the one possible result that is the one that you want to match from all of the ones in the subset that do match)???

bakunin · November 10, 2018, 3:55am

It certainly helps if one understands what this is all about and since it happens i have a biological researcher at home who explained it to me, here it is (errors/omissions are due to my limited understanding - i was told this is already the kindergarten version of what is really going on):

"exon", short for "expressed region", is a unit of a gene which codes something like a protein. Think of a "gene" as a text of describing something, then the "exon" would be one complete sentence of this text. When DNA is read (so that what it codes is actually produced) it is copied to "RNA"-pieces. This process is called RNA-splicing*) and these pieces contain always several whole such exons.

"intron", short for "intragenetic region" is (more or less meaningless) parts of the DNA between the exons. Think of it as some sort of punctuation and whitespace in the text. It is removed during RNA-splicing so that only the exons make it there.

*) RNA-splicing: the process of producing RNA from DNA works in several steps. First a complete DNA-piece is copied, including the introns. Then the real RNA is made from that ommitting the introns and only leaving the exons. This, in fact, is the "splicing".

In the human genome about 1% is exons (so this in fact makes up for the whole genetic information), about 25% is introns. The rest is intergenetic (that is: between genes and hence completely meaningless).

Thanks to my wife.

bakunin

RudiC · November 10, 2018, 7:55am

Out of sheer curiosity - is that meaningless intergenetic rest the info that makes up the "genetic fingerprint" identifying individuals and revealing relationships like parent - child, or siblings? And, regards to your wife for educating us.

cmccabe · November 10, 2018, 9:30am

file2 is a very large file of genes and all associated coding exons. So, using the SDHB gene, as an example, that is one of the ~22,000 genes in the human genome. A gene is made up a variable exons, introns, intragenic regions. File2 only lists the coding sequence of a gene, that is what is currently known to code for a protein product and contribute to the human "genetic makeup".

File1 is created from a script that output all regions in a particular gene that may need to be interrogated further. The problem is not all those regions, defined by the $1 , $2 , and $3 values may be important to know (there is still a lot unknown about the human genome).... its complex as @bakunin kindly described (a big thanks to your wife :).

The intons and intrageneic regions regulate/effect exons (both coding and non-coding) but are still largely an unknown. What is known is that coding exons(file2) and splicing(defined as +/- 10) are important.

Using the value in $4 of file1 that is looked up in file2 to return the subset of a gene to use. There may be multiple lines in each file for that gene but the combination of $2 and $3 will define each line in file1 as intron , exon , or splicing .

I hope this helps and apologize for the long post but since I am excited to help share knowledge and really appreciate all the help... thank you very much :).

bakunin · November 11, 2018, 2:32am

After asking her: no. They use so-called "RFLP"s for that purpose and these are parts(s) of an exon if i have understood correctly. Here is a

[link to the Wikipedia-article"]https://en.wikipedia.org/wiki/Restriction\_fragment\_length_polymorphism]link to the Wikipedia-article](https://en.wikipedia.org/wiki/Restriction\_fragment\_length_polymorphism)

bakunin (enlightened by his wife)

Don_Cragun · November 11, 2018, 6:40am

I'm sorry. I appreciate the lessons I'm getting in genomics, but I still don't understand your requirements.

From your description and examples, I'm guessing that even though you haven't said so:

there will be no overlap in $2-$3 value ranges for any two lines in file2 ,
all of the lines in file2 that are associated with a $4 value in file1 are adjacent,
the strings in $4 in file1 and at the start of $4 in file2 are irrelevant to this problem (only the ranges specified by $2-$3 matter other than copying the $4 value in file1 into the output),
if a $2 value in file1 is inside one of the $2-$3 ranges in file2 , then a new 5th field added to file1 should be set to exon in the output (this comes from the examples, but conflicts with several statements in the English requirements),
if a $2 value in file1 is not inside any $2-$3 range in file2 and the difference $2 on some line in file2 minus $2 on a line in file1 is greater than zero and less than eleven, then a 5th field added to file1 should be set to splicing in the output (this also comes from the examples, but conflicts with the stated English requirements), and
otherwise, a 5th field added to file1 should be set to intron .

Please confirm whether or not my guesses are correct. And, if my guesses are not correct, please restate your requirements and give us an example where the stated requirements and the given examples are consistent with each other.

Note that if file2 is sorted on increasing values of field 2 (as in your example) and file1 was sorted on increasing values of field 2, neither file would have to be loaded into memory and both files could be read one line at a time. (This would make the code more complex, but would reduce the amount of memory needed to run your program if that is an issue.) But, in your sample data, file1 is not sorted.

cmccabe · November 11, 2018, 10:01am

there could potentially be overlap in the $2-$3 value ranges, that is why $4 or the gene id id used because the same $2-$3 values can not exist in two different genes. To be extra sure the combination of $4 and $1 can be used to ensure this, that will look at only the gene in $4 on the chromosome in $1 . That might be better as it will be a unique lookup key used in the search.

yes, after the search key or lookup value in found in file2 all its associated lines will be adjacent, one on top of the other...

...... SDBH
...... SDBH
...... SDBH

into the output)
yes, this is true... though the combination of $1 and $4 in file1 may be better to ensure a unique match and values are found faster.

yes this is correct, the conflicts in the english requirements have to do with the nature of the human genome and that it is ever-changing and still full of unknowns. The test being performed or utilized also factors in to it and can add additional complexity/conflicts.

Thank you very much for all of your help :).

awk

awk '
FNR==NR{
  a[$4];
  chr[$4]=$1;
  min[$4]=$2;
  max[$4]=$3;
  next
}
{
  split($4,array,"_");
  print $0,(array[1] in a) && ($2>=min[array[1]] && $2<=max[array[1] && $1=chr[array[1]])?"exon":"intron"
}
' file1 OFS="\t" file2 > output

Don_Cragun · November 12, 2018, 12:40am

We would really appreciate it if the data you post in your examples was consistent with itself and with the descriptions of the problems you present.

Note that numeric values with a trailing space are not always equivalent to numeric values without a trailing space.

Note also that string values (in this case gene names) are case sensitive and gene chrx in file1 does not match gene chrX in file2 . Therefore, when you say that we should use $1 and $4 to match values between your two input files, there can never be a match for any gene chrx information in file2 .

If I change your file1 contents to:

chr1	17345304	17345315 	SDHB	
chr1	17345516	17345524 	SDHB	
chr1	93306242	93306261 	RPL5	
chr1	93307262	93307291 	RPL5
chrX	153295819	153296875 	MECP2	
chrX	153295810	153296800 	MECP2

to match the gene names in your file2 (but leaving the trailing spaces in field #3), the following code:

#!/bin/ksh
awk -v d=$# '
BEGIN {	FS = "[\t_]"
	OFS = "\t"
}
FNR == NR {
	m[$1, $4, ++c[$1, $4]] = $2 + 0
	M[$1, $4, c[$1, $4]] = $3 + 0
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
	next
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)
	for(i = 1; i <= c[$1, $4]; i++) {
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {if(m[$1, $4, i] > $2 + 0) {
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {$5 = "intron"
					break
				}
			}
		}
	}
	if(i > c[$1, $4])
		$5 = "intron"
}
1' file2 file1

produces the output you said you wanted. Since you have extraneous non-numeric characters in some fields that should be numeric, this code includes safeguards to convert string values that may contain non-numeric values before performing comparisons. The debugging statements included helped me track down the conflict in your gene names that was keeping my code from producing the output you said you wanted. (To enable debugging, invoke the above script with an argument, any argument.) If you want to use case-insensitive comparisons on field #1 and field #4 values (which would be required to produce the output you say you want from the sample files you provided), I will leave it to you to update the code to do that. If you want to use case-insensitive comparisons you really need to say that in the description of you problem and not just hide it in inconsistent data in your sample input files.

The above code was written and tested on macOS Mojave (Version 10.14.1) using the Korn shell. It should work with any shell that uses Bourne shell syntax. If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

cmccabe · November 15, 2018, 8:10am

Thank you very much for your help, I did not realize that there were extra spaces but was able to fix that. Again thank you very much

cmccabe · December 7, 2018, 12:18pm

If I try to use a for loop on the above script (which I called exon.sh) to define $file I get empty output.

for file in path/to/*.txt ; do
     bname=$(basename $file)
     pref=${bname%%_*.txt}
     bash /path/to/exon.sh static $file > path/to/${pref}_output.txt
done

In the for loop the static will never change only the $file variable will (always .txt file). If I hardcode the files to use as part of the script then the desired output is achieved. I did not mention this in the original post because I thought the for loop would be able to be used. However I seem to be using it incorrectly. Thank you :).

bakunin · December 7, 2018, 3:07pm

Please sit down, we need to talk. More specifically, i need to give you "the talk" - and, no, it is not about flowers and bees.... ;-))

What you do might look to you like some "quick hacks" to make your life easier. In fact it is full-fledged software development and you will never be successful in this endeavour if you do not apply the tenets and procedures of software development. You will never a successful biologist if you, instead of following established good lab practice, do whatever comes to your mind. I take it, you learned your trade as a researcher and acquired all these established best practices. It is now time you do the same with this certain aspect of your research.

Instead of giving you an answer i'd like to show you how to methodically apply procedures to hunt down a bug and find out the answer yourself. Even if your code is only a few lines long you apply the same techniques.

cmccabe:

for file in path/to/*.txt ; do
   bname=$(basename $file)
   pref=${bname%%_*.txt}
   bash /path/to/exon.sh static $file > path/to/${pref}_output.txt
done
In the for loop the static will never change only the $file variable will (always .txt file). If I hardcode the files to use as part of the script then the desired output is achieved. I did not mention this in the original post because I thought the for loop would be able to be used.

The first, obvious thing that comes to mind is this discrepancy:

bash /path/to/exon.sh static $file > path/to/${pref}_output.txt

whereas Don Craguns script from post #1 reads:

#!/bin/ksh

Now, the bash is well capable of strting a Korn shell, so that should work, but it is best practice to minimize every possible source of problems. Because the script states its command processor anyway you can change the line to:

/path/to/exon.sh static $file > path/to/${pref}_output.txt

which will perhaps not remedy the problem. Let us get on! The next thing is: if you get empty output you may have empty input. We actually see how the involved programs are called and what they are told to do, but there is some uncertainty involved and that are the variable contents: we suppose them to be correct, but better to "be sure" about something is to test it, so let us test it. For this we change the script a little bit. We do that in single steps, like walking is done: you do one step at a time, because if you try to make several steps at once you hop up and down but won't go anywhere.

The first thing to test is the for-llop itself. Does it produce all the files we want it to produce? And, while we are at it:

does it produce all the files we want?
does it produce files we don NOT want ("false positives")?
does it not produce files we do want ("false negatives")?

for file in path/to/*.txt ; do
     echo "$file"
     # bname=$(basename $file)
     # pref=${bname%%_*.txt}
     # bash /path/to/exon.sh static $file > path/to/${pref}_output.txt
done

What did you find? Often software does not do what it is supposed to do because of small things we easily overlook: i.e. "path/to/*.txt" lacks an introducing "/" to be an absolute path. I understand this is not your real path, but maybe you made the same typo (or a similar one) there as you did here. This makes sure that - if the correct list is produced - this is not the case. This part will be "provenly correct". Let us assume it is and get on. The next thing we test is the variable expansion

for file in path/to/*.txt ; do
     echo "$file"
     bname=$(basename $file)
     pref=${bname%%_*.txt}
     echo "bname: \"$bname\"   pref:\"$pref\""
     # bash /path/to/exon.sh static $file > path/to/${pref}_output.txt
done

The first thing i notice is the lacking quoting of the variables. Your code will break when a filename will contain a space. A line like:

variable=something

should, if you are not absolutely 101% sure about what "something" is (and even then, because it doesn't hurt and you should do it habitually right) be quoted:

variable="something"

Therefore:

for file in path/to/*.txt ; do
     echo "$file"
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     echo "bname: \"$bname\"   pref:\"$pref\""
     # bash /path/to/exon.sh static "$file" > "path/to/${pref}_output.txt"
done

Now, run that. Do the variables all contain the expected values? To be honest, i am suspicious that they do not, for some reason. But ou now have the tools to find out - the first step in correcting it.

The last thing, if the variables do produce the correct values, is to test the command itself: instead of running it we just display it. Notice that we need to escape the redirection:

for file in path/to/*.txt ; do
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     echo /path/to/exon.sh static "$file" \> "path/to/${pref}_output.txt"
done

Now, this should produce a list of commands. Copy and paste one of them to another window and let it run. There might be some diagnostic message (very common are "file not found", "path does not exist" and similar ones, also an attempt to write to some write protected place, full disks, ....) in case the command is what fails.

In one sentence: we took a complex procedure which didn't work as expected and tested one step after the other until we found the culprit. This is how every scientist works and this is how software developer work. Here is a bonus information: you can switch "tracing mode" in the shell on and off so that every command is displayed (to stderr) before it is executed. Try this modification:

set -xv
for file in path/to/*.txt ; do
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     echo /path/to/exon.sh static "$file" \> "path/to/${pref}_output.txt"
done
set +xv

set -xv switches on the trace, set +xv sitches it off again. You could also only trace certain parts:

for file in path/to/*.txt ; do
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     set -xv
     /path/to/exon.sh static "$file" \> "path/to/${pref}_output.txt"
     set +xv
done

This works for Korn shell (ksh) and bash alike.

I hope this helps.

bakunin

cmccabe · December 8, 2018, 8:39am

I truely appreciate the helpful troubleshooting tips and explanations. I will use them to debug and post back. I did learn as a reseacher and am always trying to refine and improve my technique. I don't know if I will get there, but i will always try and practice. Biology and programming are essential and interesting. Thank you again .

--- Post updated 12-08-18 at 07:39 AM ---

The variables are all being populated correctly but using set -xv , thank you... the script never gets past or to the done (it stalls on the bold line.

for file in /home/cmccabe/folder/less/*.txt ; do
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     set -xv
     /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" \> /home/cmccabe/folder/less/${pref}_output.txt
done
set +xv

So since the first .txt file never completes the second one never starts processing. Why this is seems to be my issue no its a matter off figuring out the why. Thank you for the helpful "talk", I appreciate it :).

I ran bash -x

for file in /home/cmccabe/folder/less/*.txt ; do
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     bash -x /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" > /home/cmccabe/folder/less/${pref}_output.txt
done

and confirmed the .txt files are not being passed to the exon script as input.

Don_Cragun · December 8, 2018, 10:47am

Did you try running the logical equivalent of:

for file in path/to/*.txt ; do
     echo "$file"
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     echo "bname: \"$bname\"   pref:\"$pref\""
     # bash /path/to/exon.sh static "$file" > "path/to/${pref}_output.txt"
done

with your actual pathnames and operands, as bakunin suggested? Please show us the output that produced! Like bakunin, I find it hard to believe that pref is being set to the value that I would assume you are trying to set (which isn't at all clear to me).

And, when you run the loop:

for file in /home/cmccabe/folder/less/*.txt ; do
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     set -xv
     /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" \> /home/cmccabe/folder/less/${pref}_output.txt
done
set +xv

(which has the set +xv line after the done instead of before it like bakunin suggested), the whole purpose of enabling tracing on the invocations of exon.sh is so we can all see the trace output produced. But, you haven't shown us any of the trace output???

Please try this slight modification to the above, and show us the output that it produces:

for file in /home/cmccabe/folder/less/*.txt
do   bname=$(basename "$file")
     pref=${bname%%_*.txt}
     echo "file:\"$file\"    bname:\"$bname\"    pref:\"$pref\""
     echo "output will be directed to:\"/home/cmccabe/folder/less/${pref}_output.txt\""
     #/home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" > "/home/cmccabe/folder/less/${pref}_output.txt"
done

If, and only if, that produces the values that you expect for bname , pref , and the pathname of the file you want to produce, then also try running the following:

for file in /home/cmccabe/folder/less/*.txt
do   bname=$(basename "$file")
     pref=${bname%%_*.txt}
     #echo "file:\"$file\"    bname:\"$bname\"    pref:\"$pref\""
     #echo "output will be directed to:\"/home/cmccabe/folder/less/${pref}_output.txt\""
     set -xv
     /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" > "/home/cmccabe/folder/less/${pref}_output.txt"
     set +xv
done

and show us the output that produces.

Note that I am not sure why bakunin suggested using \> in the exon.sh command. Doing that makes the redirection operator and the intended output file become operands to exon.sh instead of being a redirection.

I also note that the exon.sh can't be the script that I supplied in post #11. That script didn't look at any of its operands; it only used the presence of one or more operands as a flag to enable debugging printouts. I will assume that you removed the debugging printf statements and the d variable and are using the two operands you are passing to exon.sh as the two filenames processed by that script.

cmccabe · December 8, 2018, 7:27pm

There are two .txt files in the directory:

00-0000_regions.txt and 11-1111_regions.txt

these two .txt files are $bname and $pref is the digits after the _regions.txt is removed.

for file in /home/cmccabe/folder/less/*.txt ; do
     echo "$file"
     bname=$(basename "$file")
     pref="${bname%%_*.txt}"
     echo "bname: \"$bname\"   pref:\"$pref\""
     # /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 $file > /home/cmccabe/folder/less/${pref}_output.txt
done
/home/cmccabe/folder/less/00-0000_regions.txt
bname: "00-0000_regions.txt"   pref:"00-0000"
/home/cmccabe/folder/less/11-1111_regions.txt
bname: "11-1111_regions.txt"   pref:"11-1111"

for file in /home/cmccabe/folder/less/*.txt ; do   
       bname=$(basename "$file")
       pref=${bname%%_*.txt}
       echo "file:\"$file\"    bname:\"$bname\"    pref:\"$pref\""
       echo "output will be directed to:\"/home/cmccabe/folder/less/${pref}_output.txt\""
       #/home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" > "/home/cmccabe/folder/less/${pref}_output.txt"
done
    file:"/home/cmccabe/folder/less/00-0000_regions.txt"    bname:"00-0000_regions.txt"    pref:"00-0000"
    output will be directed to:"/home/cmccabe/folder/less/00-0000_output.txt"
    file:"/home/cmccabe/folder/less/11-1111_regions.txt"    bname:"11-1111_regions.txt"    pref:"11-1111"

for file in /home/cmccabe/folder/less/*.txt ; do
       bname=$(basename "$file")
       pref=${bname%%_*.txt}
       #echo "file:\"$file\"    bname:\"$bname\"    pref:\"$pref\""
       #echo "output will be directed to:\"/home/cmccabe/folder/less/${pref}_output.txt\""
       set -xv
       /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" > "/home/cmccabe/folder/less/${pref}_output.txt"
       set +xv
done
+ /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 /home/cmccabe/folder/less/00-0000_output.txt

The process seems to stall and get stuck on the first file 00-0000_regions.txt .

I did also run bash -x /home/cmccabe/folder/less/exon.sh and can see that the .txt files are not getting passed to the script as input. I'm not sure why but I believe this may help:

I'm not sure I follow but I will re-read and maybe that will help. Thank you :).

for file in /home/cmccabe/folder/less/*.txt ; do      bname=$(basename "$file");      pref="${bname%%_*.txt}";      bash -x /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" > /home/cmccabe/folder/less/${pref}_output.txt; donefor file in /home/cmccabe/folder/less/*.txt ; do      bname=$(basename "$file");      pref="${bname%%_*.txt}";      bash -x /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 "$file" > /home/cmccabe/folder/less/${pref}_output.txt; done
+ for file in '/home/cmccabe/folder/less/*.txt'
basename "$file"
++ basename /home/cmccabe/folder/less/00-0000_output.txt
+ bname=00-0000_output.txt
+ pref=00-0000
+ bash -x /home/cmccabe/folder/less/exon.sh /home/cmccabe/folder/less/all_cdsV2 /home/cmccabe/folder/less/00-0000_output.txt
+ awk -v d=2 '
BEGIN {	FS = "[\t_]"
	OFS = "\t"
}
FNR == NR {
	m[$1, $4, ++c[$1, $4]] = $2 + 0
	M[$1, $4, c[$1, $4]] = $3 + 0
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
	next
}
{	if(d) printf("FNR=%d:\"%s\"\n",FNR,$0)
	for(i = 1; i <= c[$1, $4]; i++) {
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {if(m[$1, $4, i] > $2 + 0) {
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {$5 = "intron"
					break
				}
		}
	}
}
	if(i > c[$1, $4])
		$5 = "intron"
}
1'

Don_Cragun · December 8, 2018, 11:38pm

One might note that the script I provided for you in post #11 had file2 and file1 as operands to awk after the awk code operand. The trace output you have shown us for the awk command at the end of post #17 shows that awk is called with only one operand (the awk code as its first and only operand). That code is, as you say, stalling because it is waiting for you to type input into it since it has no file operands specifying what files it is supposed to process!

Clearly, the assumption I stated at the end of post #16 was wrong. I trust that you will fix it.

bakunin · December 9, 2018, 5:52am

Exactly this was the point. When i suggested debugging the command i suggested to use an echo command in front of it. This echo would have been redirected instead of displayed. This was why i suggested to escape the redirection to display - for debugging purposes - the commands which would be issued by the script.

The last thing, if the variables do produce the correct values, is to test the command itself: instead of running it we just display it. Notice that we need to escape the redirection:
[...]
echo /path/to/exon.sh static "$file" \> "path/to/${pref}_output.txt"

Of course the escaped redirection has to be reverted back to normal once the debugging is over.

bakunin

cmccabe · December 10, 2018, 10:39am

I the original code below the bold is the static or the file that is always used. The italics output is set by pref and $file is the underlined portion and would be dependent on each .txt file in the directory.

Since one of the operands will never change and the other is set by the for loop are you saying (sorry for my confusion).I added comments as well. Thank you :).

#!/bin/sh
awk -v d=$# '    (the d=$# is removed because the files are dependent on the for loop)
BEGIN {	FS = "[\t_]"
	OFS = "\t"
}
FNR == NR {
	m[$1, $4, ++c[$1, $4]] = $2 + 0
	M[$1, $4, c[$1, $4]] = $3 + 0
	if(d) printf("m[%s,%s,%d]=%s,M[%s,%s,%d]=%s\n",
		$1, $4, c[$1, $4], m[$1, $4, c[$1, $4]],
		$1, $4, c[$1, $4], M[$1, $4, c[$1, $4]])
	next
}
{	#if(d) printf("FNR=%d:\"%s\"\n",FNR,$0) (remove this line as it assumes d is typed in)
	for(i = 1; i <= c[$1, $4]; i++) {
		if(d) printf("m[%d]=%d,M[%d]=%d,$2=%d\n",
			i, m[$1, $4, i],
			i, M[$1, $4, i],
			$2)
		if(m[$1, $4, i] <= $2 && $2 <= M[$1, $4, i]) {
			$5 = "exon"
			break
		} else {if(m[$1, $4, i] > $2 + 0) {
				if(m[$1, $4, i] - 10 <= $2 + 0) {
					$5 = "splicing"
					break
				} else {$5 = "intron"
					break
				}
		}
	}
}
	if(i > c[$1, $4])
		$5 = "intron"
}
1' all_cdsV2 00-0000low > 00-0000_filter