How to get min and max values using awk?

Hi,

I need your kind help to get min and max values from file based on value in $5 .

File1

SP12.3	stc	2240806	2240808	+	ID1_N003	 ID2_N003T0
SP12.3	sto	2241682	2241684	+	ID1_N003	 ID2_N003T0
SP12.3	XE	2239943	2240011	+	ID1_N003	 ID2_N003T0
SP12.3	XE	2240077	2241254	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2240806	2241254	+	ID1_N003	 ID2_N003T0
SP12.3	XE	2241471	2241684	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2241471	2241681	+	ID1_N003	 ID2_N003T0
SP12.3	stc	2245127	2245129	+	ID1_N005	 ID2_N005T0
SP12.3	sto	2246954	2246956	+	ID1_N005	 ID2_N005T0
SP12.3	XE	2244762	2247195	+	ID1_N005	 ID2_N005T0
SP12.3	CD	2245127	2246953	+	ID1_N005	 ID2_N005T0
SP12.3	stc	2253115	2253117	-	ID1_N006	 ID2_N006T0
SP12.3	sto	2249759	2249761	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2253090	2254054	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2253090	2253117	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2252492	2252908	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2252492	2252908	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2251730	2251882	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2251730	2251882	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2251591	2251664	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2251591	2251664	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2249887	2251530	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2249887	2251530	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2249087	2249821	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2249762	2249821	-	ID1_N006	 ID2_N006T0
SP12.3	stc	2252073	2252075	-	ID1_N006	 ID2_N006T1
SP12.3	sto	2249759	2249761	-	ID1_N006	 ID2_N006T1
SP12.3	XE	2252492	2252973	-	ID1_N006	 ID2_N006T1
SP12.3	XE	2251730	2252227	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2251730	2252075	-	ID1_N006	 ID2_N006T1
SP12.3	XE	2251591	2251664	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2251591	2251664	-	ID1_N006	 ID2_N006T1
SP12.3	XE	2249887	2251530	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2249887	2251530	-	ID1_N006	 ID2_N006T1
SP12.3	XE	2249090	2249821	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2249762	2249821	-	ID1_N006	 ID2_N006T1
SP12.5	stc	3001307	3001309	+	ID1_N01140	ID2_N01140T0
SP12.5	sto	3005026	3005028	+	ID1_N01140	ID2_N01140T0
SP12.5	XE	3000439	3001397	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3001307	3001397	+	ID1_N01140	ID2_N01140T0
SP12.5	XE	3001572	3002765	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3001572	3002765	+	ID1_N01140	ID2_N01140T0
SP12.5	XE	3002821	3004797	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3002821	3004797	+	ID1_N01140	ID2_N01140T0
SP12.5	XE	3004855	3004929	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3004855	3004929	+	ID1_N01140	ID2_N01140T0
SP12.5	XE	3004994	3005417	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3004994	3005025	+	ID1_N01140	ID2_N01140T0

I did the following codes:-

awk -F"\t" '$2=="CD"{if ($5~/\+/) {print $1"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7} else {print $1"\t"$4"\t"$3"\t"$5"\t"$6"\t"$7}}' file1

But the results shows all lines containing "CD" patterns like below:

SP12.3	CD	2240806	2241254	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2241471	2241681	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2245127	2246953	+	ID1_N005	 ID2_N005T0
SP12.3	CD	2253090	2253117	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2252492	2252908	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2251730	2251882	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2251591	2251664	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2249887	2251530	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2249762	2249821	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2251730	2252075	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2251591	2251664	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2249887	2251530	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2249762	2249821	-	ID1_N006	 ID2_N006T1
SP12.5	CD	3001307	3001397	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3001572	3002765	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3002821	3004797	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3004855	3004929	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3004994	3005025	+	ID1_N01140	ID2_N01140T0

The real output that i want will only show min and max value if "CD" pattern is found, and it should be based on value in $5. If "+", then the value in $3 for the first "CD" found and value in $4 for the last "CD" found for each ID2 ($6) will be printed in $3 and $4 of output file respectively. If "-", then the value in $4 for the first "CD" found and value in $3 for the last "CD" found for each ID2($6) will be printed in $4 and $3 respectively like below:-

SP12.3	CD	2240806	2241681	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2249762	2253117	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2249762	2252075	-	ID1_N006	 ID2_N006T1
SP12.5	CD	3001307	3005025	+	ID1_N01140	ID2_N01140T0

If there is only 1 CD for any ID2 ($7), the line will also be omitted. Would appreciate if you can help me on this. thanks

I don't understand your selection of the left value for the "+" sign not the right value for the "-" sign. With this code

awk     '$2 != "CD"     {next}
         !($7 in EXT3)  {EXT3[$7]=EXT4[$7]= -1E100 * ($5"1")}
                        {CNT[$7]++;SGN[$7]=$5}
         $5 == "+"      {if ($3 > EXT3[$7]) EXT3[$7] = $3
                         if ($4 > EXT4[$7]) EXT4[$7] = $4}
         $5 == "-"      {if ($3 < EXT3[$7]) EXT3[$7] = $3
                         if ($4 < EXT4[$7]) EXT4[$7] = $4}

         END            {for (i in EXT3) if (2 <= CNT) print "SP12.3", "CD", EXT3, EXT4, SGN, substr (i, 2, 8), i}
        ' FS="\t" OFS="\t" file

i get the result

SP12.3    CD    2249762    2249821    -    ID2 N006     ID2 N006T1
SP12.3    CD    2249762    2249821    -    ID2 N006     ID2 N006T0
SP12.3    CD    2241471    2241681    +    ID2 N003     ID2 N003T0

which does not match your requirement for above mentioned values...

Hi RudiC,

Thanks a lot for your quick response.
I am not really clear about your question above but, I am extracting info for gene features and that's how to find out the region for the coding sequence.

i tried your code but it did not give accurate results on my real data. I tried to change and play around with your code but still the result is not correct. below is the sample result that i got:-

SP12.5	CD	2249762	2249821	-	ID2_N006	 ID2_N006T1
SP12.5	CD	3004994	3005025	+	D2_N0114	ID2_N01140T0
SP12.5	CD	2249762	2249821	-	ID2_N006	 ID2_N006T0
SP12.5	CD	2241471	2241681	+	ID2_N003	 ID2_N003T0

If u don't mind, can you explain about your codes? The above data is just a sample. for $1, i have many different values, not only SP12.3. So, i changed "print "SP12.3"" to print "$1". But the output is still wrong. Thanks

awk '
	$2=="CD" {
		key=$5"|"$9"|";
		($3>A[key"max"] || A[key"max"]=="")? A[key"max"]=$3:"";
		($4>A[key"max"] || A[key"max"]=="")? A[key"max"]=$4:"";
		($3<A[key"min"] || A[key"min"]=="")? A[key"min"]=$3:"";
		($4<A[key"min"] || A[key"min"]=="")? A[key"min"]=$4:"";
		!(key in line)? line[key]=$0: "";
		count[$9]++;
	}
	END {
		for(key in line) {
			split(key,s,"|");
			if(count[s[2]] > 1) {
				sub(/[0-9]+\s+[0-9]+/, A[key"min"]" "A[key"max"], line[key]);
				print line[key];
			}
		}
	}
' file
1 Like

Hi jethrow,

thanks so much for your response. tried your code but the result is not accurate.

SP12.5	CD	3001307	3001397	+	ID1_N01140	ID2_N01140T0
SP12.3	CD	2240806	2241254	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2251730	2252075	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2253090	2253117	-	ID1_N006	 ID2_N006T0

It is no wonder that the results you are getting are not what you want. Your description of how to process the input is so vague that we do not understand what you want.

The code you showed us prints parts of every line with "CD" in the 2nd field. For those lines, it throws away fields 2, 8, and 9; and, if $5 is "+", it swaps fields 3 and 4 before printing the remainder of the line. But, the output you say you want shows every field (keeping fields 2, 8, and 9). And if fields 3 and 4 have been swapped, it isn't obvious to me.

You mentioned ID2 ($7), but it looks like you are looking for the minimum $3 value and the maximum $4 value for each different value in field 9 (not field 7). And from the data shown, I don't see that the + or - in field 5 makes any difference at all.

You have shown us data where fields 1, 6, and 8 are all constants. You have said that $1 may change, but you haven't given any indication of how, or if, that should affect the output produced.

Please give us a clear English description of what you are trying to do and explain what the meaning is for each of the fields in your file.

Also, lots of gene data that we're asked to help with has huge files to process. If that is the case here as well, any details you can give us about the data may help speed up the process considerably. For example, what you have shown us could be sorted with field 1, 5, or 9 as a primary sort key. If data is to be grouped using field 9 as a key and the input is sorted on field 9, we can produce any needed output every time the contents of field 9 changes (as opposed to accumulating all of the input into memory and processing everything at the end).

We also need to know up front whether or not it is important that the output be in the same order as the input.

And, finally: just saying that the code you were given did't give you accurate results is useless information. Show us the output you got, the output you wanted, and explain why (based on your description of what you wanted) the output you got was wrong! Help us help you!

3 Likes

Hi Don Crugan,

Thank u for your comments. Forgive me for the vague description. I just edited my question and sample above. I tried my best to explain my issue. My data is long and huge and has different conditions and i tried my best to make it simple for the sample. but it seems that it created more confusion. my mistake. thanks

Hi redse171,
Thanks of rthe update. That gives us a better idea of what you are trying to do. Although the awk script you have shown us will not produce the output you showed us for the sample input you provided. (Your awk script doesn't copy the CD field to the output.)

I haven't dug into all of the details again yet, but I think that if we get answers to the following, we'll be able to help you write a script that will work:

  1. Do you want the output to contain the "CD" field from the input?
  2. Will all lines with the same combination of $5, $6, and $7 values be on contiguous lines in your input file? (The answer to this is "yes" for your sample input. Does it hold true for your real, huge input files?)
  3. If the answer to #2 is no, does the order of lines in your output file matter?
1 Like

Answers to Don Cragun's above question may kill the assumptions on which this is based. Try

awk     '$2 != "CD"     {next}                                          # not a "CD" line -> no action
         !($7 in LINE)  {LINE[$7]=$0}                                   # new $7? Keep line with first occurrence of $3/$4 in memory
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               # count $7 lines and keep last $3 and $4

         END            {for (i in LINE) if (CNT>=2) {               # for the lines recorded, if count = 1: discard
                                 match (LINE,"[0-9]*\t[0-9]*\t[+-]") # search for $3 $4 +- pattern (you can use constants here if 
                                                                        # sure the file structure remains identical all over)
                                 if (substr (LINE, RSTART+RLENGTH-1, 1) == "-") {    # take decision on + or -
                                        POS=RSTART                      # where to replace
                                        STR=E3}                      # what to put in 
                                  else {POS=RSTART+8
                                        STR=E4} 
                                 print  substr (LINE, 1, POS-2),     # print first part of line, dep. on sign
                                        STR,                            #       replacement string
                                        substr (LINE, POS+8)         #       last part
                                }
                        }
        ' FS="\t" OFS="\t" file
SP12.3    CD    2249762    2252075    -    ID1_N006     ID2_N006T1
SP12.5    CD    3001307    3005025    +    ID1_N01140    ID2_N01140T0
SP12.3    CD    2249762    2253117    -    ID1_N006     ID2_N006T0
SP12.3    CD    2240806    2241681    +    ID1_N003     ID2_N003T0
1 Like

Hi Don Crugan,

To answer your questions:-

  1. Yes, i need to have "CD" field in my output file as shown in my sample output
  2. Yes for my huge input files

thanks.

---------- Post updated at 10:12 AM ---------- Previous update was at 10:07 AM ----------

Hi RudiC,

Tried your codes and thanks so much for your explanations. It seems working for my real input file except that there are few lines a little bit weird. I am checking on it now and try play around with your codes. Will give the feedback asap. Thanks

---------- Post updated at 09:25 PM ---------- Previous update was at 10:12 AM ----------

Hi,

just to give feedback. The codes by RudiC is modified to suit my real data. The codes worked well with the sample data but there was an issue with the number and position of digits (values) in $3 and $4 in my real huge file. So, i split the LINE into segments and take the value from the segments (info from awk manual). Thanks to RudiC for the codes and explanations that help me to understand better. Below is the codes that being modified and i got the results that i wanted.

awk     '$2 != "CD"     {next}                                          
         !($7 in LINE)  {LINE[$7]=$0}                                   
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               

         END            {for (i in LINE) if (CNT>=2) {               
                                 match (LINE,"[0-9]*\t[0-9]*\t[+-]") 
                                                                        
                                 if (substr (LINE, RSTART+RLENGTH-1, 1) == "-") {    
                                        POS=RSTART                      
                                        STR=E3
                                 split(LINE, seg, "\t")
                                 print  seg[1], seg[2], 
                                        STR,                            
                                        seg[4], seg[5], seg[6], seg[7] 
                                 }                      
                                 else {POS=RSTART+7
                                       STR=E4
                                 split(LINE, seg, "\t")
                                 print  seg[1], seg[2], seg[3],     
                                        STR,                           
                                        seg[5], seg[6], seg[7] 

                                 }
                                }
                        }
        ' FS="\t" OFS="\t" File1

My first code was not informative enough as i don't have any idea how to find the min and max from my input file and what i gave was just to extract all line with CD patterns. The help that i got here is awesome and help me to learn and understand better. thanks a lot! . :slight_smile:

Hi redse171,
I'm very glad that RudiC was able to help you find a solution to your problem. Note that if you need to use split() to correctly group your fields, you don't need to also use match() and substr() to determine whether you have a + or - in field 5 (you can just look directly at seg[5] ) after you call split() . You can then simplify your code to something like:

awk     '$2 != "CD"     {next}                                          
         !($7 in LINE)  {LINE[$7]=$0}                                   
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               

         END            {for (i in LINE) if (CNT>=2) {               
                                 split(LINE, seg)
                                 if (seg[5] == "-") {    
					 print  seg[1], seg[2], E3, 
						seg[4], seg[5], seg[6], seg[7] 
                                 } else {
					 print  seg[1], seg[2], seg[3],     
						E4, seg[5], seg[6], seg[7]
                                 }
			 }
                        }
        ' FS="\t" OFS="\t" File1

and get the same results.

Hope this helps,
Don

Further simplification:

awk     '$2 != "CD"     {next}
         !($7 in LINE)  {LINE[$7]=$0}
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}

         END            {for (i in LINE) if (CNT>=2) {
                                split(LINE, seg)
                                if (seg[5] == "-")      seg[3] = E3
                                else                    seg[4] = E4
                                print  seg[1], seg[2], seg[3], seg[4], seg[5], seg[6], seg[7]
                         }
                        }
        ' FS="\t" OFS="\t" file
1 Like

Hi Don,

It does help!.. It just that i need to add a tiny part (in blue) there at printing part or else it wont show $4 in my output.

awk     '$2 != "CD"     {next}                                          
         !($7 in LINE)  {LINE[$7]=$0}                                   
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               

         END            {for (i in LINE) if (CNT>=2) {               
                                 split(LINE, seg)
                                 if (seg[5] == "-") {    
					 print  seg[1], seg[2], seg[3]= E3, 
						seg[4], seg[5], seg[6], seg[7] 
                                 } else {
					 print  seg[1], seg[2], seg[3],     
						seg[4]=E4, seg[5], seg[6], seg[7]
                                 }
			 }
                        }
        ' FS="\t" OFS="\t" file1


Thanks a bunch :wink:

---------- Post updated at 09:32 AM ---------- Previous update was at 09:31 AM ----------

Hi RudiC,

This is a lot cleaner!! Many thanks :wink: