Find smallest between replicates ID

giuliangiuseppe · July 23, 2014, 4:48am

Hi All
I need to find the smallest values between replicates id (column1)
Input file:

a name1 1200
a name2 800
b name1 100
b name2 150
b name3 4

output:

a name2 800
b name3 4

Do you have any suggestion?

Thank you!

Don_Cragun · July 23, 2014, 5:20am

Given what you have learned from your earlier thread Output minimum and maximum values for replicates ID, what have you tried to solve this problem on your own?

giuliangiuseppe · July 23, 2014, 6:14am

Hi Don Cragun and thank you for your reply!
unfortunately the command of the previous post does not work (I resolve the issue from my own with completely different approach).

the command was

awk '{idx=$1 FS $2}FNR==1{a3[idx]=$3}{a3[idx]=(a3[idx]>$3)?a3[idx]:$3;a4[idx]=($4>a4[idx])?$4:a4[idx]} END{for(i in a3)print i,a3,a4}' myFile

with my File:

a x 1 4
a x 2 5
b x 5 10
b x 6 12
c x 8 15
c x 6 12

the output is:

a x 2 5
b x 6 12
c x 8 15

As you can see in the column 3 is not reported the smalles value.
I try to change a little bit the command without success.

Giuliano

Don_Cragun · July 23, 2014, 12:09pm

Yes. In your previous thread you wanted to print the maximum value for the 4th column and the minimum value for the 3rd column. Now you have an easier job; you just want to print the line that has the minimum value for the 3rd column (and there is no 4th column).

How did you try to change that code to get what you need for this problem?

What did it do?

giuliangiuseppe · July 23, 2014, 2:53pm

I tried this one(suppose file with 2 column, first column ID)

awk '{idx=$1}FNR==1{a3[idx]=$2}{a3[idx]=(a3[idx]>$2)?$2:a3[idx]} END{for(i in a3)print i,a3}' myFile

But I have some problem because the command just output the first lane!

Akshay_Hegde · July 23, 2014, 3:42pm

This might help you

awk '{ 
	# duplicate is column1
	col = $1
	
	# value to be compared is from column3
	value = $3

	# Here we track for duplicate records
	rep[col]++

      }
      {
	# if column is not in array meaning array does not have index col so far
        # or column in array meaning index col is exists in array a but
	# array element is greater than current line value ($3) then 
	# modify array a 
	if(!(col in a) || ( col in a && a[col] > value))
	{
		a[col] = value
	
		# Here we set o/p required you can also write $1 OFS $2 etc
		# Used in end block
		output[value] = $0 
	}

      }
   END{
	# Loop throuh rep array
	for(i in rep)
	{
		# if array elements is greater then 1 then its duplicate 
		# so print contents from array output 
		# where index being element of array a 
		# array a index is current index i
		if(rep>1 )
			print output[a]
	}
      }'    file

Don_Cragun · July 23, 2014, 3:56pm

giuliangiuseppe:

I tried this one(suppose file with 2 column, first column ID)
awk '{idx=$1}FNR==1{a3[idx]=$2}{a3[idx]=(a3[idx]>$2)?$2:a3[idx]} END{for(i in a3)print i,a3}' myFile
But I have some problem because the command just output the first lane!

The code marked in red above (which is the only portion of your code that adds elements to the array a3) is only executed when FNR==1 (i.e., only when you are looking at the 1st line of the current input file). So, when you print the array at the end, only that one element is found.

The following uses similar logic to the code provided by Akshay Hegde, but will also print a line for keys that only appear once in your input file:

awk '
!($1 in d) || f3[$1] > $3 {
	d[$1] = $0
	f3[$1] = $3
}
END {	for(i in d)
		print d
}' myFile

which produces:

a name2 800
b name3 4

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

Akshay_Hegde · July 23, 2014, 4:02pm

Thanks Don for note, I am totally confused from title Find smallest between replicates ID and the sample input provided , so I added one more array

Don_Cragun · July 23, 2014, 4:16pm

Hi Akshay,
I agree that the title is an unusual sequence of words in English. I assume that the submitter is not a native English speaker and interpreted it to mean that what was wanted was the line containing the smallest value in the 3rd column for each different key found in the 1st column.

The code you supplied interpreted it to mean that what was wanted was the line containing the smallest value in the 3rd column for each different key found in the first column if that key is present on more than one line.

I think both are reasonable guesses. We'll have to let giuliangiuseppe further clarify what was wanted if neither of us guessed correctly.

PS Note that both of our guesses fit with the given sample input and the desired sample output given by giuliangiuseppe.

giuliangiuseppe · July 24, 2014, 10:48am

Thank you a lot Don Cragun and Akshay Hedge!
A found your posts very useful also because It is not much time that I study awk command!

Giuliano

---------- Post updated 24-07-14 at 04:48 PM ---------- Previous update was 23-07-14 at 11:17 PM ----------

Hi
Well, I am sorry for my English. However Don Craugun you are right.
Thank you again for the help and explanation.

Giuliano