Simplifying awk/sed short pipeline

Xterra · November 20, 2017, 1:49am

I have a file like this:

FileName,Well,Sample Description,Size [bp],Calibrated Conc. [ng/�l],Assigned Conc. [ng/�l],Peak Molarity [nmol/l],Area,% Integrated Area,Peak Comment,Observations
2017-11-15 - 13.49.50.D1000,EL1,Electronic Ladder,25,5.22,,321,0.803,,,Lower Marker
2017-11-15 - 13.49.50.D1000,EL1,Electronic Ladder,50,2.25,,69.3,0.347,11.11,,
2017-11-15 - 13.49.50.D1000,EL1,Electronic Ladder,100,2.37,,36.5,0.365,11.71,,
2017-11-15 - 13.49.50.D1000,EL1,Electronic Ladder,200,2.47,,19.0,0.380,12.20,,
2017-11-15 - 13.49.50.D1000,EL1,Electronic Ladder,300,2.55,,13.1,0.392,12.56,,
2017-11-15 - 13.49.50.D1000,EL1,Electronic Ladder,400,2.57,,9.87,0.395,12.66,,
2017-11-15 - 13.49.50.D1000,EL1,Electronic Ladder,500,2.71,,8.33,0.416,13.36,,
2017-11-15 - 13.49.50.D1000,EL1,Electronic Ladder,700,2.46,,5.41,0.379,12.15,,
2017-11-15 - 13.49.50.D1000,EL1,Electronic Ladder,1000,2.89,,4.44,0.444,14.25,,
2017-11-15 - 13.49.50.D1000,EL1,Electronic Ladder,1500,6.50,6.50,6.67,1.000,,,Upper Marker
2017-11-15 - 13.49.50.D1000,A1,,25,8.64,,531,1.329,,,Lower Marker
2017-11-15 - 13.49.50.D1000,A1,,78,1.62,,31.9,0.249,59.03,,
2017-11-15 - 13.49.50.D1000,A1,,99,1.13,,17.5,0.173,40.97,,
2017-11-15 - 13.49.50.D1000,A1,,1500,6.50,6.50,6.67,1.000,,,Upper Marker
2017-11-15 - 13.49.50.D1000,B1,,25,9.27,,570,1.426,,,Lower Marker
2017-11-15 - 13.49.50.D1000,B1,,43,1.40,,50.4,0.215,19.94,,
2017-11-15 - 13.49.50.D1000,B1,,66,1.24,,29.1,0.191,17.72,,
2017-11-15 - 13.49.50.D1000,B1,,85,0.866,,15.7,0.133,12.37,,
2017-11-15 - 13.49.50.D1000,B1,,111,1.11,,15.3,0.170,15.81,,
2017-11-15 - 13.49.50.D1000,B1,,189,0.525,,4.27,0.081,7.50,,
2017-11-15 - 13.49.50.D1000,B1,,226,0.395,,2.70,0.061,5.65,,
2017-11-15 - 13.49.50.D1000,B1,,270,0.333,,1.90,0.051,4.76,,
2017-11-15 - 13.49.50.D1000,B1,,717,0.380,,0.816,0.058,5.43,,
2017-11-15 - 13.49.50.D1000,B1,,969,0.758,,1.20,0.117,10.83,,
2017-11-15 - 13.49.50.D1000,B1,,1500,6.50,6.50,6.67,1.000,,,Upper Marker
2017-11-15 - 13.49.50.D1000,C1,,25,8.39,,516,1.291,,,Lower Marker
2017-11-15 - 13.49.50.D1000,C1,,44,0.597,,20.8,0.092,3.40,,
2017-11-15 - 13.49.50.D1000,C1,,95,1.10,,17.7,0.169,6.26,,
2017-11-15 - 13.49.50.D1000,C1,,116,0.856,,11.3,0.132,4.87,,
2017-11-15 - 13.49.50.D1000,C1,,388,15.0,,59.5,2.309,85.47,,
2017-11-15 - 13.49.50.D1000,C1,,1500,6.50,6.50,6.67,1.000,,,Upper Marker
2017-11-15 - 13.49.50.D1000,D1,,25,6.10,,375,0.939,,,Lower Marker
2017-11-15 - 13.49.50.D1000,D1,,221,2.83,,19.7,0.435,7.41,,
2017-11-15 - 13.49.50.D1000,D1,,554,28.7,,79.8,4.418,75.22,,
2017-11-15 - 13.49.50.D1000,D1,,808,5.37,,10.2,0.826,14.06,,
2017-11-15 - 13.49.50.D1000,D1,,1500,6.50,6.50,6.67,1.000,,,Upper Marker
2017-11-15 - 13.49.50.D1000,D1,,2645,1.27,,0.736,0.195,3.31,,
2017-11-15 - 13.49.50.D1000,E1,,25,8.31,,511,1.279,,,Lower Marker
2017-11-15 - 13.49.50.D1000,E1,,399,48.7,,188,7.494,66.32,,
2017-11-15 - 13.49.50.D1000,E1,,564,13.6,,37.1,2.092,18.51,,
2017-11-15 - 13.49.50.D1000,E1,,793,11.1,,21.6,1.715,15.17,,
2017-11-15 - 13.49.50.D1000,E1,,1500,6.50,6.50,6.67,1.000,,,Upper Marker
2017-11-15 - 13.49.50.D1000,F1,,25,9.27,,570,1.425,,,Lower Marker
2017-11-15 - 13.49.50.D1000,F1,,403,53.3,,203,8.193,65.36,,
2017-11-15 - 13.49.50.D1000,F1,,570,16.2,,43.8,2.493,19.89,,
2017-11-15 - 13.49.50.D1000,F1,,796,12.0,,23.2,1.849,14.75,,
2017-11-15 - 13.49.50.D1000,F1,,1500,6.50,6.50,6.67,1.000,,,Upper Marker
2017-11-15 - 13.49.50.D1000,G1,,25,13.3,,816,2.041,,,Lower Marker
2017-11-15 - 13.49.50.D1000,G1,,106,0.740,,10.8,0.114,0.66,,
2017-11-15 - 13.49.50.D1000,G1,,413,66.7,,249,10.257,59.46,,
2017-11-15 - 13.49.50.D1000,G1,,586,18.9,,49.7,2.913,16.89,,
2017-11-15 - 13.49.50.D1000,G1,,813,16.8,,31.7,2.577,14.94,,
2017-11-15 - 13.49.50.D1000,G1,,937,9.03,,14.8,1.390,8.06,,
2017-11-15 - 13.49.50.D1000,G1,,1500,6.50,6.50,6.67,1.000,,,Upper Marker
2017-11-15 - 13.49.50.D1000,H1,,25,28.6,,1760,4.407,,,edited Lower Marker
2017-11-15 - 13.49.50.D1000,H1,,41,0.781,,29.3,0.120,1.35,,
2017-11-15 - 13.49.50.D1000,H1,,55,4.88,,136,0.751,8.43,,
2017-11-15 - 13.49.50.D1000,H1,,86,4.55,,81.4,0.701,7.86,,
2017-11-15 - 13.49.50.D1000,H1,,377,21.5,,87.8,3.305,37.09,,
2017-11-15 - 13.49.50.D1000,H1,,525,8.07,,23.7,1.241,13.93,,
2017-11-15 - 13.49.50.D1000,H1,,749,18.2,,37.3,2.793,31.34,,
2017-11-15 - 13.49.50.D1000,H1,,1500,6.50,6.50,6.67,1.000,,,Upper Marker

And I am using the following script to :
1) remove first line
2)Identify rows where columns 4 displays values between 410 and 570, with values >0.5 in column 5.
3)If more than one meet the requirements, "combine" rows and add up values in column 5 and sort the entries.

sed "1d" input.txt | gawk -F "," "!/EL/{ if ( $4 > 410 && $4 < 570 && $5 > 0.5 ) print $2, $5; else print $2, 0 }" | gawk "{a[$1]+=$2}END{for(i in a){print i, a}}" | sed "s/\([A-z]\)\([1-9]\) /\10\2 /" | sort > output.txt

I then find the minimal value listed on the second column -excluding zeros:

gawk "NR == 1 || $2 < min && $2 > 0 {line = $2; min = $2}END{print line}" output.txt > minimum.txt

And multiply that value for 15 and divide it for each instance listed on the first output file:

gawk "FNR == NR{est = $1 * 15; next}{if ($2 > 0) print $1, $2, est / $2; else print $0, 0}" minimum.txt output.txt > final.txt

I really would like to improve this small pipeline. I dont want to keep using | to "stitch" together different AWK and sed scripts -cannot use anything else but gawk/sed. I also dont want to generate multiple temporary files (output and minimum) -which I am pretty sure are not needed (I just could not think of a better way to do it.
I am using GnuWin32/cmd to run the pipeline using a .bat file -not allowed to use CygWin on this box.
Any help will be greatly appreciated

MadeInGermany · November 20, 2017, 4:09am

First of all, you must put in 'ticks' not "quotes" around the embedded g/awk code, because in the latter $1 $2 ... are substituted with positional shell parameters. The shell does this before it passes the code string to g/awk.

RudiC · November 20, 2017, 5:41am

Probability is high that ALL your code can be combined into one single awk script. Unfortunately your min works only if the first line in output.txt has non-null value, so with your sample data the rest of the algorithm (which I don't really understand) doesn't work out, always printing the input line followed by a zero.
Any corrections to the algorithm?

Xterra · November 20, 2017, 8:49am

I get the following message if I use single quotes:

gawk: '!/EL/{
gawk: ^ invalid char ''' in expression
sed: couldn't write 72 items to stdout: Invalid argument

It runs ok using double quotes

You right! Let me work a bit more on that. Any suggestions to improve the first script so I dont keep using | to stitch it all together?

sed "1d" input.txt | gawk -F "," "!/EL/{ if ( $4 > 410 && $4 < 570 && $5 > 0.5 ) print $2, $5; else print $2, 0 }" | gawk "{a[$1]+=$2}END{for(i in a){print i, a}}" | sed "s/\([A-z]\)\([1-9]\) /\10\2 /" | sort > output.txt

RudiC · November 20, 2017, 9:18am

How about

awk -F "," '

NR==1 ||                                                        # get rid of first line (replace first sed command)
/EL/    {next                                                   # get rid of lines containing "EL"
        }

        {if ($4 <= 410 || $4 >= 570 || $5 <= 0.5) $5 = 0        # implement condition for $4 and $5 values
         a[$2] += $5                                            # sum up $5 values (replace second awk script)
         if ($5?$5:MIN < MIN) MIN = $5                          # determine minimum value (replace third awk script)
        }

END     {MIN *= 15                                              # implement minimum multiplication
         for (i in a)   {T = a                               # run through all collected index (= $2) values
                         IX = substr(i,1,1) "0" substr (i,2)    # modify $2 string (replace second sed script)
                         print  IX, T, T?MIN/T:0                # print them all
                        }
        }
' file
A01 0 0
B01 0 0
C01 0 0
D01 28.7 4.21777
E01 13.6 8.90074
F01 0 0
G01 66.7 1.81484
H01 8.07 15

Should a sorting step be necessary, pipe the output through sort , or use gawk 's sorting algorithms (which my mawk doesn't provide).

Xterra · December 4, 2017, 5:21pm

Thanks Rudy!
Worked like a charm

wbport · December 4, 2017, 6:05pm

You might want to give this version a try:

rev=""
if [ $# -gt 0 ]
  then
  if [ $1 = '-r' ]
  then
    rev="r"
#   echo $rev
    shift
  fi
fi
ls -l $* |grep -v "total " |sort +4n$rev -5 +8 |more -e

It lists all files (or star name convention) in increasing size or, with "-r" as the first parm, in decreasing size. Ties are alphabetical by file name.