File updation on matching key

I have input file like Input.dat with below content

RRD       0Z91YUn000000Lk 9000100001 103020151117   STMT151117155527001            0000 2 000000 000004
RRD       0Z91YUn00000ysj 9000100001 103020151117   STMT151117155527001            0000 3 000000 000003
RRD       0Z91YUn00001vGh 9000100002 103020151117   STMT151117155527002            0000 1 000000 000007
RRD       0Z91YUn00002tng 9000100002 103020151117   STMT151117155527002            0000 0 000000 000006
RRD       0Z91YUn00003qBe 9000100003 103020151117   STMT151117155527003            0000 1 000000 000007
RRD       0Z91YUn00004oid 9000100003 103020151117   STMT151117155527003            0000 0 000000 000006
RRD       0Z91YUn00005l6b 9000100004 103020151117   STMT151117155527004            0000 1 000000 000007
RRD       0Z91YUn00006jda 9000100004 103020151117   STMT151117155527004            0000 0 000000 000006

1				2				3		4					5						6	7	8 		9  <----- column number (This line is not part of file)

Here 3 column is the key column and 7th and 9th are the value columns. I want to do some arithematic operations on 7th & 9th column and update that
value to 8th column on matching of 3rd key column.
example : here first two row has same key column(3rd column) (9000100001) so I want to do
((((7th column value + 9th column value)of row 1)+((7th column value + 9th column value)of row 2))/2)
ie (((2+4)+(3+3))/2)=000006 (leading zeros for complete 6 digit number)
so i have to update the first 2 rows(key is 9000100001)of origal Input.dat files 8th column with 000006 value
So final output of above example is as

RRD       0Z91YUn000000Lk 9000100001 103020151117    STMT151117155527001            0000 2 000006 000004
RRD       0Z91YUn00000ysj 9000100001 103020151117    STMT151117155527001            0000 3 000006 000003
RRD       0Z91YUn00001vGh 9000100002 103020151117    STMT151117155527002            0000 1 000007 000007
RRD       0Z91YUn00002tng 9000100002 103020151117    STMT151117155527002            0000 0 000007 000006
RRD       0Z91YUn00003qBe 9000100003 103020151117    STMT151117155527003            0000 1 000007 000007
RRD       0Z91YUn00004oid 9000100003 103020151117    STMT151117155527003            0000 0 000007 000006
RRD       0Z91YUn00005l6b 9000100004 103020151117    STMT151117155527004            0000 1 000007 000007
RRD       0Z91YUn00006jda 9000100004 103020151117    STMT151117155527004            0000 0 000007 000006

I have done script for this requirement as below : Final_Count.sh

#!/bin/sh
JOBFOLDER=/opt/hpx/exstream/StatementAppA/inputs/POC/testSort
input_file=$JOBFOLDER/Input.dat
count_output_file=$JOBFOLDER/Count_output.dat
final_updated_file=$JOBFOLDER/Final_updated.dat
awk ' BEGIN {FS=" "}
function round(A) {
return int( A + 0.5 )
}
 {a[$3]+=$7; b[$3]+=$9} END{for(x in a) print x,round((a[x]+b[x])/2)
}' $input_file >>$count_output_file
echo "--- `basename $count_output_file `---"
cat $count_output_file
echo "---------------- ` basename $final_updated_file ` --------------"
awk ' BEGIN {FS=" "} FNR==NR{a[$1]=$2;next}
{
    if(a[$3] >= 0 && a[$3] <= 9)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"00000"a[$3]:"NA",$9,$10
    }
    else if(a[$3] >= 10 && a[$3] <= 99)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"0000"a[$3]:"NA",$9,$10
    }
    else if(a[$3] >= 100 && a[$3] <= 999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"000"a[$3]:"NA",$9,$10
    }
    else if(a[$3] >= 1000 && a[$3] <= 9999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"00"a[$3]:"NA",$9,$10
    }
    else if(a[$3] >= 10000 && a[$3] <= 99999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"0"a[$3]:"NA",$9,$10
    }
    else if(a[$3] >= 100000 && a[$3] <= 999999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?a[$3]:"NA",$9,$10
    }
}' $count_output_file $input_file >> $final_updated_file
echo "----------------------------------------------------"
cat $final_updated_file
rm $final_updated_file $count_output_file

and output is :
--- Count_output.dat ---

9000100002 7
9000100003 7
9000100004 7
9000100001 6

---------------- Final_updated.dat --------------
----------------------------------------------------

RRD       0Z91YUn000000Lk 9000100001 103020151117    STMT151117155527001            0000 2 000006 000004
RRD       0Z91YUn00000ysj 9000100001 103020151117    STMT151117155527001            0000 3 000006 000003
RRD       0Z91YUn00001vGh 9000100002 103020151117    STMT151117155527002            0000 1 000007 000007
RRD       0Z91YUn00002tng 9000100002 103020151117    STMT151117155527002            0000 0 000007 000006
RRD       0Z91YUn00003qBe 9000100003 103020151117    STMT151117155527003            0000 1 000007 000007
RRD       0Z91YUn00004oid 9000100003 103020151117    STMT151117155527003            0000 0 000007 000006
RRD       0Z91YUn00005l6b 9000100004 103020151117    STMT151117155527004            0000 1 000007 000007
RRD       0Z91YUn00006jda 9000100004 103020151117    STMT151117155527004            0000 0 000007 000006

-------------------------------------------------------------------------------------------------------------------

Here in this script for final updated file I am using static printing like

print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"00"a[$3]:"NA",$9,$10

so here I am printing spaces after column value manually. I want this as dynamic.
In short print all column with as it is(with space) and just update 8th column.
Please can you help me with this ?

Please refrain double posts in terms of highjacking your own other thread and start using code tags, thanks.

#!/bin/sh
JOBFOLDER=/opt/hpx/exstream/StatementAppA/inputs/POC/testSort
input_file=$JOBFOLDER/Input.dat
count_output_file=$JOBFOLDER/Count_output.dat
final_updated_file=$JOBFOLDER/Final_updated.dat
awk ' BEGIN {FS=" "}
function round(A) {
return int( A + 0.5 )
}
 {a[$3]+=$7; b[$3]+=$9} END{for(x in a) print x,round((a[x]+b[x])/2)
}' $input_file >>$count_output_file
echo "--- `basename $count_output_file `---"
cat $count_output_file
echo "---------------- ` basename $final_updated_file ` --------------"
awk ' BEGIN {FS=" "} FNR==NR{a[$1]=$2;next}
{
    if(a[$3] >= 0 && a[$3] <= 9)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"00000"a[$3]:"NA",$9,$10
    }
    else if(a[$3] >= 10 && a[$3] <= 99)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"0000"a[$3]:"NA",$9,$10
    }
    else if(a[$3] >= 100 && a[$3] <= 999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"000"a[$3]:"NA",$9,$10
    }
    else if(a[$3] >= 1000 && a[$3] <= 9999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"00"a[$3]:"NA",$9,$10
    }
    else if(a[$3] >= 10000 && a[$3] <= 99999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"0"a[$3]:"NA",$9,$10
    }
    else if(a[$3] >= 100000 && a[$3] <= 999999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?a[$3]:"NA",$9,$10
    }
}' $count_output_file $input_file >> $final_updated_file
echo "----------------------------------------------------"
cat $final_updated_file
rm $final_updated_file $count_output_file

Your calculations seem to assume that there will always be exactly two lines with a given 3rd field value. Are these pairs of lines always adjacent (as in your sample input and output)? Can there be just one or more than two lines with a given 3rd field value (and, if so, should the division be by 2 or by the number of lines with that 3rd field value)?

Why are you printing ten fields to your output file when there are only nine input fields? This is adding a trailing space to each line of your output file that is not present in the sample output that you say you want???

Does the width of any of your input columns vary, or are the column widths for each column constant for files this script will process?

Your calculations seem to assume that there will always be exactly two lines with a given 3rd field value. Are these pairs of lines always adjacent (as in your sample input and output)? Can there be just one or more than two lines with a given 3rd field value (and, if so, should the division be by 2 or by the number of lines with that 3rd field value)?

Why are you printing ten fields to your output file when there are only nine input fields? This is adding a trailing space to each line of your output file that is not present in the sample output that you say you want???

Does the width of any of your input columns vary, or are the column widths for each column constant for files this script will process?

You didn't answer the question about whether the calculations done in your 1st awk script should always divide by 2 or should instead divide by the number of lines containing a given 3rd column value.

And, it is past my bedtime and I have an early morning meeting tomorrow. I have working code (with no temp files and only one awk script) now for fixed width columns and two input lines per 3rd column value. I'll work on adjustments for varying field widths and varying numbers of lines per 3rd column value tomorrow afternoon (assuming you answer the questions above and below before I get up later this morning).

Do you need to print the count_output_fie contents you're producing now, or is that just debug information? You are currently removing both of the files your script is creating at the end of the script. Am I correct in assuming that these are temp files that do not need to be kept when your script exits?

My current script just prints the output produced by your 2nd awk script. Is that sufficient (with the other changes discussed above)?

Here is a new example to explain in detail-
new Input.dat will have content like

RRD       0Z91YUn000000Lk 9000100001 103020151117   STMT151117155527001            0000 2 000000 000004
RRD       0Z91YUn00000ysj 9000100001 103020151117   STMT151117155527001            0000 3 000000 000003
RRD       0Z91YUn00001vGh 9000100002 103020151117   STMT151117155527002            0000 1 000000 000007
RRD       0Z91YUn00002tng 9000100002 103020151117   STMT151117155527002            0000 0 000000 000006
RRD       0Z91YUn00003qBe 9000100003 103020151117   STMT151117155527003            0000 1 000000 000007
RRD       0Z8xgNq000000Lk 9000100005 33820160129    STMT151117155527001            0000 5 000000 000006
RRD       0Z8xgNq00000hs6 9000100005 33820160129    STMT151117155527001            0000 3 000000 000006
RRD       0Z8xgNq00001HxO 9000100006 33820160129    STMT151117155527002            0000 0 000000 000006
RRD       0Z91YUn000000Lk 9000100001 103020151117    STMT151117155527001            0000 2 000006 000004
RRD       0Z91YUn00000ysj 9000100001 103020151117    STMT151117155527001            0000 3 000006 000003
RRD       0Z91YUn00001vGh 9000100002 103020151117    STMT151117155527002            0000 1 000007 000007
RRD       0Z91YUn00002tng 9000100002 103020151117    STMT151117155527002            0000 0 000007 000006
RRD       0Z91YUn00003qBe 9000100003 103020151117    STMT151117155527003            0000 1 000004 000007
RRD       0Z8xgNq000000Lk 9000100005 33820160129    STMT151117155527001            0000 5 000010 000006
RRD       0Z8xgNq00000hs6 9000100005 33820160129    STMT151117155527001            0000 3 000010 000006
RRD       0Z8xgNq00001HxO 9000100006 33820160129    STMT151117155527002            0000 0 000003 000006
RRD       0Z91YUn000000Lk 9000100001 103020151117    STMT151117155527001            0000 2 000006 000004
RRD       0Z91YUn00000ysj 9000100001 103020151117    STMT151117155527001            0000 3 000006 000003
RRD       0Z91YUn00001vGh 9000100002 103020151117    STMT151117155527002            0000 1 000007 000007
RRD       0Z91YUn00002tng 9000100002 103020151117    STMT151117155527002            0000 0 000007 000006
RRD       0Z91YUn00003qBe 9000100003 103020151117    STMT151117155527003            0000 1 000004 000007
RRD       0Z8xgNq000000Lk 9000100005 33820160129     STMT151117155527001            0000 5 000010 000006
RRD       0Z8xgNq00000hs6 9000100005 33820160129     STMT151117155527001            0000 3 000010 000006
RRD       0Z8xgNq00001HxO 9000100006 33820160129     STMT151117155527002            0000 0 000003 000006

---------- Post updated at 02:12 PM ---------- Previous update was at 02:03 PM ----------

#!/bin/sh
JOBFOLDER=/opt/hpx/exstream/StatementAppA/inputs/POC/testSort
input_file=$JOBFOLDER/Input.dat
count_output_file=$JOBFOLDER/Count_output.dat
final_updated_file=$JOBFOLDER/Final_updated.dat
awk ' BEGIN {FS=" "}
function round(A) {
return int( A + 0.5 )
}
 {a[$3]+=$7; b[$3]+=$9} END{for(x in a) print x,round((a[x]+b[x])/2)
}' $input_file >>$count_output_file
echo "--- `basename $count_output_file `---"
cat $count_output_file
echo "---------------- ` basename $final_updated_file ` --------------"
awk ' BEGIN {FS=" "} FNR==NR{a[$1]=$2;next}
{
    if(a[$3] >= 0 && a[$3] <= 9)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"00000"a[$3]:"NA",$9
    }
    else if(a[$3] >= 10 && a[$3] <= 99)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"0000"a[$3]:"NA",$9
    }
    else if(a[$3] >= 100 && a[$3] <= 999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"000"a[$3]:"NA",$9
    }
    else if(a[$3] >= 1000 && a[$3] <= 9999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"00"a[$3]:"NA",$9
    }
    else if(a[$3] >= 10000 && a[$3] <= 99999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?"0"a[$3]:"NA",$9
    }
    else if(a[$3] >= 100000 && a[$3] <= 999999)
    {
        print $1"      ",$2,$3,$4"   ",$5"           ",$6,$7,a[$3]?a[$3]:"NA",$9
    }
}' $count_output_file $input_file >> $final_updated_file
echo "----------------------------------------------------"
cat $final_updated_file
rm $final_updated_file $count_output_file

Inferring from lines 5 and 8, the division should be done by constant 2, not the line count per key. Under the assumption that field 8, embedded in field separators, is NOT matched anywhere else in the line, try

awk 'NR == FNR {SUM[$3]+=$7+$9; next} {sub (FS $8 FS, FS sprintf ("%06d", SUM[$3]/2) FS)}1' file file
RRD       0Z91YUn000000Lk 9000100001 103020151117   STMT151117155527001            0000 2 000006 000004
RRD       0Z91YUn00000ysj 9000100001 103020151117   STMT151117155527001            0000 3 000006 000003
RRD       0Z91YUn00001vGh 9000100002 103020151117   STMT151117155527002            0000 1 000007 000007
RRD       0Z91YUn00002tng 9000100002 103020151117   STMT151117155527002            0000 0 000007 000006
RRD       0Z91YUn00003qBe 9000100003 103020151117   STMT151117155527003            0000 1 000004 000007
RRD       0Z8xgNq000000Lk 9000100005 33820160129    STMT151117155527001            0000 5 000010 000006
RRD       0Z8xgNq00000hs6 9000100005 33820160129    STMT151117155527001            0000 3 000010 000006
RRD       0Z8xgNq00001HxO 9000100006 33820160129    STMT151117155527002            0000 0 000003 000006

Should the line count matter, try

awk 'NR == FNR {SUM[$3]+=$7+$9; CNT[$3]++; next} {sub (FS $8 FS, FS sprintf ("%06d", SUM[$3]/CNT[$3]) FS)}1' file file
1 Like

Hi Rudic,

Thanks for your update on this. But it is not working for my case. And one more thing I have not got your code

awk 'NR == FNR {SUM[$3]+=$7+$9; next} {sub (FS $8 FS, FS sprintf ("%06d", SUM[$3]/2) FS)}1' file file

Here I have not got "file" means which file ? I have used your code for my input but it not worked.
I brief u about input and output . Input as below

RRD       0Z9He4r000000Lk 200741     00120160129    STMT151117055527001            0000 1 000000 000001
RRD       0Z9He4r00001Gnl 200745     00120160129    STMT151117055527001            0000 1 000000 000001
RRD       0Z9He4r00001IPC 3025339512 33820160129    STMT151117055527001            0000 1 000000 000005
RRD       0Z9He4r00000pTO 4025339512 33820160129    STMT151117055527001            0000 0 000000 000002
RRD       0Z9He4r000001xE 5025339518 33820160129    STMT151117055527001            0000 0 000000 000004
RRD       0Z9He4r00000SHB 7025339518 33820160129    STMT151117055527001            0000 1 000000 000003
RRD       0Z9He4r00000XXX 7025339518 33823333333    STMT151117055527001            0000 3 000000 000002

in the Input given column 3 is the key column. consider row 1

RRD       0Z9He4r000000Lk 200741     00120160129    STMT151117055527001            0000 1 000000 000001

Here key is 200741. So search for 200741 in remaining rows of column 3. Here is no other key like 200741. So here calculation will be like
((7th column value + 9th column value)of row 1)/2 ie (1+1)/2=1. So update column 8 of row 1 with value 000001.

Explanation 2 : Consider row 6

RRD       0Z9He4r00000SHB 7025339518 33820160129    STMT151117055527001            0000 1 000000 000003

here key is 7025339518. So search for 7025339518 in remaining rows of column 3. Here you will find key 7025339518 in row 7 as well. so for this the calculation
will be like ((7th column value + 9th column value)of row 6 + (7th column value + 9th column value)of row 7)/2 => ((1+3)+(3+2))/2=5. So update column 8 of row 6
and 7 with value 000005

So Expected output for above result is :

RRD       0Z9JS1l000000Lk 200741     00120160129    STMT151117055527001            0000 1 000001 000001
RRD       0Z9JS1l00001Gnl 200745     00120160129    STMT151117055527001            0000 1 000001 000001
RRD       0Z9JS1l000001xE 5025339518 33820160129    STMT151117055527001            0000 0 000002 000004
RRD       0Z9JS1l00000pTO 4025339512 33820160129    STMT151117055527001            0000 0 000001 000002
RRD       0Z9JS1l00001IPC 3025339512 33820160129    STMT151117055527001            0000 1 000003 000005
RRD       0Z9JS1l00000SHB 7025339518 33820160129    STMT151117055527001            0000 1 000005 000003
RRD       0Z9He4r00000XXX 7025339518 33823333333    STMT151117055527001            0000 3 000005 000002

Please update code accordingly..

---------- Post updated at 08:42 PM ---------- Previous update was at 08:00 PM ----------

Hi Ridic,
your code worked. :slight_smile:

awk 'NR == FNR {SUM[$3]+=$7+$9; next} {sub (FS $8 FS, FS sprintf ("%06d", SUM[$3]/2) FS)}1' $input_file $input_file

for below input :

RRD       0Z9He4r000000Lk 200741     00120160129    STMT151117055527001            0000 1 000000 000001
RRD       0Z9He4r00001Gnl 200745     00120160129    STMT151117055527001            0000 1 000000 000001
RRD       0Z9He4r00001IPC 3025339512 33820160129    STMT151117055527001            0000 1 000000 000005
RRD       0Z9He4r00000pTO 4025339512 33820160129    STMT151117055527001            0000 0 000000 000002
RRD       0Z9He4r000001xE 5025339518 33820160129    STMT151117055527001            0000 0 000000 000004
RRD       0Z9He4r00000SHB 7025339518 33820160129    STMT151117055527001            0000 1 000000 000003
RRD       0Z9He4r00000XXX 7025339518 33823333333    STMT151117055527001            0000 3 000000 000002

output as below :

RRD       0Z9He4r000000Lk 200741     00120160129    STMT151117055527001            0000 1 000001 000001
RRD       0Z9He4r00001Gnl 200745     00120160129    STMT151117055527001            0000 1 000001 000001
RRD       0Z9He4r00001IPC 3025339512 33820160129    STMT151117055527001            0000 1 000003 000005
RRD       0Z9He4r00000pTO 4025339512 33820160129    STMT151117055527001            0000 0 000001 000002
RRD       0Z9He4r000001xE 5025339518 33820160129    STMT151117055527001            0000 0 000002 000004
RRD       0Z9He4r00000SHB 7025339518 33820160129    STMT151117055527001            0000 1 000004 000003
RRD       0Z9He4r00000XXX 7025339518 33823333333    STMT151117055527001            0000 3 000004 000002

Just one minor updation needed. If the division result is decimal value ie for row 6 and 7 it is 4.5 but it shows it as 4 . Can we make it as round value as 5. ie if result is 6.5 then it should be display 7

WHAT is not working for your case?

What code then have you got?

The approach of these forums could be "Teach me to do it myself" ("Hilf mir, es selbst zu tun"; is a quote from Maria Montessori). So the preferred method would be to show an effort by yourself and ask for improvement potential, not a request like above.

It is appreciated that you learned your lesson reg. code tags! But ... quote tags are for quotes, i.e. if you copy other posts or threads, or even outside contributions, use those. NOT for your own contributions!

Reg. your problems:
For "file" read "your input file". Supply it twice to the script.

my solution applied to your new sample data yields:

RRD       0Z9He4r000000Lk 200741     00120160129    STMT151117055527001            0000 1 000001 000001
RRD       0Z9He4r00001Gnl 200745     00120160129    STMT151117055527001            0000 1 000001 000001
RRD       0Z9He4r00001IPC 3025339512 33820160129    STMT151117055527001            0000 1 000003 000005
RRD       0Z9He4r00000pTO 4025339512 33820160129    STMT151117055527001            0000 0 000001 000002
RRD       0Z9He4r000001xE 5025339518 33820160129    STMT151117055527001            0000 0 000002 000004
RRD       0Z9He4r00000SHB 7025339518 33820160129    STMT151117055527001            0000 1 000004 000003
RRD       0Z9He4r00000XXX 7025339518 33823333333    STMT151117055527001            0000 3 000004 000002

You seem to want the result rounded, not truncated? Put SUM[$3]/2+.5) into the sprintf statement.

How about the specification: I want field 8 to be the rounded sum of the $7 and $9 averages per key value?

1 Like

Obviously your (late) update crossed my post referring to your former post.

Thank you Ridic for your update. I will remember your quotes.
And rounding result value code also worked. I liked "Teach me to do it myself" thing and will do the same in future whenever needed.

Thanks again for help.

I'm glad you got it to work. If you'd like to show us how you changed the code to perform the rounding you wanted, we might be able to offer suggestions for alternative ways to do it. (And, having your completed solution in the thread helps others who may have a similar problem in the future if they can see how you solved the problem.)

If field 8 always 6 characters? Or, can the length of that field vary from file to file too? (Making that field variable length isn't complicated, if the length varies.)

Can the original value in field 8 ever appear as the value of field 2, 3, 4, 5, 6, or 7 in any of your input files? (As long as the 1st field is always left aligned and the aren't any empty fields in your input files, it wouldn't be too hard to be sure that field 8 is modified in the output and not an earlier field with the same contents. But, the code is faster and simpler as provided in RudiC's suggestion if this isn't a concern for your input files.)