ksh comparing current and previous lines

paulie · October 12, 2011, 9:42pm

Hi,

I am currently trying to work out how to compare one line with the last line I have read in via ksh. I have a file which has sorted output from a previous sort command so all the lines are in order already and the file would look something like show below. Each line has a name and a time taken to perform a certain task so I need to filter on the first column as shown below (aaa,bbb,ccc etc) and average all the data up for each matching one and for each no matching one. I am not interested in the addition part as yet as I do have that working in my script using a perl command due to the decimal places.

I.E. So if the current line does not match the previous one, write out it's name and it's data, however if the current line matches the previous one add up data and so on and so on until it does not match then write out the data. I thought it would be really easy but my scripting is just not working and I keep getting an extra line.

The result

B0000003,1.888
B0000024,0.728
B0000024,3.308
B0000024,1.948
B0000027,1.185
B0000030,3.287
B0000030,2.688

This is what I have tried to do, but does not seem to work and the result (regardless of how I place my if's and else's seems to not put the correct counter for each match.

InputFile=input_file.sort
 
    while read line
    do

        NAME=`echo $line | cut -d"," -f1`
        TIME=`echo $line | cut -d"," -f2`
 
        if [[ "$COUNT" -gt "1" ]] then
            if [[ $NAME != $PREV_NAME ]] then
               echo "$COUNT:$PREV_NAME"
               COUNT=1
            fi
        fi

 
        if [[ $NAME != $PREV_NAME ]] then
            COUNT=1
            echo "1,$NAME"
        elif [[ $NAME = $PREV_NAME ]] then
            let COUNT=$COUNT+1
        fi
 
 
        PREV_NAME=$NAME
    done <"$InputFile"

and this is the output I get from a valid input file.

Doc Name, Duration (secs)
1,Doc Name
B0000003,0.192494
1,B0000003                         < Error
B0000003,1.740529
B0000024,5.409181
2:B0000003                         < Correct
1,B0000024                         < Error
B0000024,5.409181
1,B0000024
B0000024,5.524508
B0000024,5.569696
B0000024,6.264533
B0000027,0.265375
4:B0000024                        < Correct
1,B0000027                        < Error
B0000027,0.298291
B0000027,1.157592
etc

As you can see it does work sort of, but it also puts in an extra line that is not needed.

Any help would be appreciated and Thanks!

rdcwayx · October 12, 2011, 9:53pm

Not fully understand your request. But if you just need to calculate the average for each name, try AWK's array, no need to sort the data

awk -F, '{a[$1]++;b[$1]+=$2}END{for (i in a) print i, b/a |"sort -n"}'  infile

B0000003 1.888
B0000024 1.99467
B0000027 1.185
B0000030 2.9875

As a new joiner, read this first:

How to Use Code Tags in The UNIX and Linux Forums

paulie · October 12, 2011, 10:12pm

Holy Toledo Batman

One line instead of about 30 lines in my script, shows you my level of scripting skill. I will need to read up on what is going on there, but it does seem to work.

Yes that does work correctly, however I also need to work out the number of ocurrunces, average, minimum and maximum times for each one. So what you have there is absolutely spot on for what I asked. You are awesome.

Is that able to be done in the same line by any chance like here.

Name Number Average Minimum Maximum
B0000024 10 6.9783 1.9203 12.7652
L0000002 1 10.5767 10.5767 10.5767
L0000003 3 17.5713 9.09821 23.9881

Thanks

---------- Post updated at 12:12 PM ---------- Previous update was at 12:08 PM ----------

I will make sure I put my code into the correct code tags.:o

agama · October 12, 2011, 11:03pm

Awk is wonderful

I have different opinions on trying to cram everything onto one line, especially if you're putting it into a script. It's easier to read and can be commented. Working with rdcwayx's start, here's how I'd add the extra bits, but it's not the only way:


awk -F, '
{
    if( !a[$1] )
        min[$1] = max[$1] = $2+0;   # initialise min/max on first hit of each name; adding zero ensures numeric comparison
    else
    {
        if( $2 > max[$1]  )        # capture larger/smaller values 
            max[$1] = $2+0;
        if( $2 < min[$1]  )
            min[$1] = $2+0;
    }

    a[$1]++;
    b[$1]+=$2;
}
END {
    for (i in a)
        printf( "%s %d %.3f %.3f %.3f\n", i, a, b/a, min, max ) | "sort -n"
}'  inputfile

I prefer printf() because I can control the number of decimal points for the irrational output (3 places in this case).

rdcwayx · October 12, 2011, 11:04pm

awk -F, '{a[$1]++;b[$1]+=$2;min[$1]=(min[$1]==""||min[$1]>$2)?$2:min[$1];max[$1]=max[$1]>$2?max[$1]:$2}
    END{for (i in a) print i, a,b/a,min,max |"sort -n"}' infile

B0000003 1 1.888 1.888 1.888
B0000024 3 1.99467 0.728 3.308
B0000027 1 1.185 1.185 1.185
B0000030 2 2.9875 2.688 3.287

paulie · October 12, 2011, 11:29pm

I hate not being able to work things out myself but sometimes someone else just knows the best way to do things.

I appreciate both of your input and both ways work exactly as I wanted.

You have both been very helpfull and gave me a great insight to awk's capabalilty which I have been reading up on since this thread. It is a lot more powerfull than I expected or have been using it for in the past.

Again thank you both for you time and input!!