Bash script to print the smallest floating point number in a row that is not 0

Hello,

I have often found bash to be difficult when it comes to floating point numbers. I have data with rows of tab delimited floating point numbers. I need to find the smallest number in each row that is not 0.0. Numbers can be negative and they do not come in any particular order for a given row.

I guess I would do something like read each row into an array and then sort it but I am not quite sure if that would work with floating point numbers.

Thanks for any suggestions,

LMHmedchem

With more than 300 posts you should know that posting your OS, shell, and tools' versions plus some representative input usually helps. I'm too tired to build an input sample myself...
Does it have to be bash , or are text tools like awk welcome as well?

Sorry, I am a bit tired myself.

This is a some input. There could be up to 100 cols or so in theory.

1   1.83958   0.0       0.0   0.0  -0.330313
2   0.450996  0.112848  0.0   0.0   0.136161
3  22.8728    0.0       0.0   0.0   0.0

The output would be,

1  -0.330313
2   0.112848
3  22.8728

I added a row index for clarity.

At the moment I am running this under cygwin but will probably run under opensuse as well. This is my bash version,

GNU bash, version 4.1.10(4)-release (i686-pc-cygwin)

This could be with any tool I have under cygwin. I use sed and awk most frequently but I also have perl, ruby, python, etc.

LMHmedchem

What output do you want if all fields are zero?

I assume that you know that sed is not a common tool for this project, and, as you said, bash isn't well known for handling floating point values. If you want to process your file entirely in shell code, ksh would be a good choice. Otherwise, as you well know, awk is perfectly suited to problems like this.

What have you tried to solve this on your own? Where are you stuck?

1 Like

Sorry, I was in a pretty bad accident last night and I just got back from the Hospital. I probably won't be able to respond more completely until tomorrow. I will answer your questions in my next post.

Thanks,

LMHmedchem

1 Like

I'm sorry to hear about your accident. Take the time you need to recover; we'll be ready to help when you get back to us.

3 Likes

Thanks, this board is always a great help.

For data like,

index	name	col_1	col_2	col_3	col_4	col_5
1	name_1	6.55903	0	0	0	3.44097
2	name_2	6.73342	0	4.45826	0	5.80832
3	name_3	6.7876	0	9.04868	0	8.16372
4	name_4	7.07704	0	2.06362	-0.6363	0.6673
5	name_5	0	13.15	0	10.4517	3.39833

This version seems to work,

#! /bin/sh

# assumes that the input file has a header row
# assumes that the first column is the index and the second column is a name
# assumes that all columns after the first two contain data

# input file name
input_file=$1
# output file name
output_file=$2

awk ' NR>1 { split($0, line_array, "\t");
             id=line_array[1];
             name=line_array[2];
             delete line_array[1];
             delete line_array[2];
             asort(line_array);
             for(x in line_array) {
                if(line_array[x] != 0) { print id "\t" name "\t" line_array[x];  break; }
             }
             delete line_array;
           }' $input_file > $output_file

giving the output,

1	name_1	3.44097
2	name_2	4.45826
3	name_3	6.7876
4	name_4	-0.6363
5	name_5	3.39833

In short, it parses each row into an array with split() , assigns the first two positions to the id and name variables, and then deletes the first two positions. The array is then sorted with asort() . Finally the array is checked and the first element that is not 0 is printed along with the name and index. I believe that this gives me the smallest non-zero number.

I don't know the type that is used for the array, so I don't know if the above will work if 0 in the input file is actually 0.0 , or 0.0000 , etc. It is not really possible for input rows to be all 0, but I guess that should be trapped. I didn't think that awk had asort() . I think the cygwin actually calls gawk for awk commands but I'm not sure.

Will this work as I have it now?

LMHmedchem

Hi LMHmedchem,
Well done. I see no reason why the code you have will not meet your requirements.

You might want to consider this alternative that should be slightly faster (since it doesn't split or sort the input, and only looks at each field once):

#!/bin/sh

# assumes that the input file has a header row
# assumes that the first column is the index and the second column is a name
# assumes that all columns after the first two contain data

# input file name (defaults to "file"
input_file=${1:-file}
# output file name (defaults to "output"
output_file=${2:-output}

awk '
BEGIN {	FS = OFS = "\t"
}
NR > 1 {m = $3
	for(i = 4; i <= NF; i++)
		if($i && ($i < m || !m))
			m = $i
	if(m)	print $1, $2, m
}' "$input_file" > "$output_file"

Note that even if you don't want default filenames to be supplied if you invoke your script with less than two operands, you should still quote the filenames in the last line of your script in case someone invokes your script with a quoted operand containing an IFS character.

2 Likes

Thank you for the solution. I have not seen that syntax to supply a default argument. I have always just tested the value of the argument and set a value if there isn't one. Your suggestion is much preferable because it's not always easy to tell which argument is missing.

If I understand, you set the IO delimiters, and then start with the second row NR > 1 . You assign the value of the third element to m, m = $3 , and then compare each other element to m starting with element 4 up to the number of fields, for(i=4; i<=NF; i++) .

The comparison, if($i && ($i < m || !m)) , is a bit unclear to me. I would guess that if($1) is true if $i is not 0, like evaluating a boolean. The if(($i < m) keeps track of the lowest value but I don't know what the ||!m "or not m" does.

Also, is there any case where if(m) print $1, $2, m will not evaluate as true? I guess if $3=0 and all the rest of the elements are also 0 that statement will not be true. Is that how you trap against all 0 values in a row? If so, I would probably add an else there to print the id and name also with a user message to keep the output fully determined.

You haven't had to explicitly insert the tabs to the print statement since you specified the delimiter with OFS.

I have tried this with input where 0 is 0.0 and 0.0000 and it still works. How does that work with no explicit types?

Sorry for all the questions.

Thanks,

LMHmedchem

Note that if you want to use the default value for the 1st operand and specify a different (non-default) value for the 2nd operand, you still have to supply both operands. To specify a default input filename of "file" and a non-default output filename of "new-output", you can use either:

your_script_name file new-output

or:

your_script_name "" new-output

Yes.

The variable m holds the minimum found so far in columns 3 through NF on the current input line (or the zero from column 3 if all values seen so far are zero). The if($i && ...) prevents the current column from resetting m if $i is zero or an empty string. If that condition is met, the $i < m || !m is used to reset m if the current column value is less than any value found earlier on the line or if the current value is the first non-zero value we have seen on this line but is greater than zero.

I used that code because it produces no output if all columns on a line have zero (or empty string) values just like the code you used to print results:

             for(x in line_array) {
                if(line_array[x] != 0) { print id "\t" name "\t" line_array[x];  break; }
             }

If you want to print whatever was in field 3 if all values are zero, change:

	if(m)	print $1, $2, m

to:

	print $1, $2, m

If you want alternative text to be printed instead you can either use something like:

	if(m)	print $1, $2, m
	else	print $1, $2, "all-zero"

or more concisely:

	print $1, $2, m ? m : "all-zero"

Yes.

When both values being compared are numeric values, awk performs a numeric comparison and all zero values compare equal.

Never apologize for asking questions. That is why we're here and that is how you learn.