Awk/sed summation of one column based on some entry in first column

kshitij · December 7, 2019, 5:02am

Hi All ,

I am having an input file as stated below

Input file


6  ddk/djhdj/djhdj/Q  10 0.5 
    dhd/jdjd.djd.nd/QB 01 0.5 
    hdhd/jd/jd/jdj/Q  10 0.5
512 hd/hdh/gdh/Q 01 0.5
      jdjd/jd/ud/j/QB 10 0.5 
      HD/jsj/djd/Q  01 0.5 
71 hdh/jjd/dj/jd/Q  10 0.5
    jd/jdld/je/j/QB 01 0.5
    IDP/jd/jdd/Q 10 0.5 
    1K/JDJ/JDJK/QL 01 0.5

I need to take the summation of the 4th column and it should start based on the first column and print at the 5th column at every entry of first column .

My output file

6  ddk/djhdj/djhdj/Q  10 0.5  1.5
    dhd/jdjd.djd.nd/QB 01 0.5 
    hdhd/jd/jd/jdj/Q  10 0.5
512 hd/hdh/gdh/Q 01 0.5  1.5
      jdjd/jd/ud/j/QB 10 0.5 
      HD/jsj/djd/Q  01 0.5 
71 hdh/jjd/dj/jd/Q  10 0.5 2.0
    jd/jdld/je/j/QB 01 0.5
    IDP/jd/jdd/Q 10 0.5 
    1K/JDJ/JDJK/QL 01 0.5

I tried something like this but not able to extract the right values

awk '
BEGIN { FS=OFS="\t" }
NR==FNR { s[$1]+=$4; next }
{ print $0,s[$1] }
' Final.tran.map.pattern5

Please let me know.

RudiC · December 7, 2019, 5:57am

Try

tac file | awk -F" +" '{SUM += $4} $1 {$5 = SUM; SUM = 0} 1' | tac

kshitij · December 7, 2019, 6:25am

Thanks a lot Rudic ,

Could you give a brief explanation of this code, I am not able to get why we are selecting the field separator as " +"
?

Thanks and Regards
Kshitij Kulshreshtha

RavinderSingh13 · December 7, 2019, 6:34am

Hello kshitij,

Could you please try following.

awk '
FNR==NR{
  if($0~/^[0-9]+/){
     ++count
  }
  sum[count]+=$NF
  next
}
/^[0-9]+/{
  print $0,sum[++var]
  next
}
1
'  Input_file  Input_file

Output will be as follows.

6  ddk/djhdj/djhdj/Q  10 0.5  1.5
    dhd/jdjd.djd.nd/QB 01 0.5
    hdhd/jd/jd/jdj/Q  10 0.5
512 hd/hdh/gdh/Q 01 0.5 1.5
      jdjd/jd/ud/j/QB 10 0.5
      HD/jsj/djd/Q  01 0.5
71 hdh/jjd/dj/jd/Q  10 0.5 2
    jd/jdld/je/j/QB 01 0.5
    IDP/jd/jdd/Q 10 0.5
    1K/JDJ/JDJK/QL 01 0.5

Thanks,
R. Singh

RudiC · December 7, 2019, 6:36am

Your input file is not too consistent in its usage of spaces as field separators. It has one or two between fields, one up to six from BOL to $2 if $1 is missing, and zero or one at EOL. That's what " +" is for: it stands for "one or many" spaces. c.f. man regex for further reference.

RavinderSingh13 · December 7, 2019, 7:23am

ravindersingh13:

Hello kshitij,
Could you please try following.

awk '
FNR==NR{
  if($0~/^[0-9]+/){
   ++count
  }
  sum[count]+=$NF
  next
}
/^[0-9]+/{
  print $0,sum[++var]
  next
}
1
'  Input_file  Input_file

Output will be as follows.

6  ddk/djhdj/djhdj/Q  10 0.5  1.5
   dhd/jdjd.djd.nd/QB 01 0.5
   hdhd/jd/jd/jdj/Q  10 0.5
512 hd/hdh/gdh/Q 01 0.5 1.5
   jdjd/jd/ud/j/QB 10 0.5
   HD/jsj/djd/Q  01 0.5
71 hdh/jjd/dj/jd/Q  10 0.5 2
   jd/jdld/je/j/QB 01 0.5
   IDP/jd/jdd/Q 10 0.5
   1K/JDJ/JDJK/QL 01 0.5

Thanks,
R. Singh

Hello kshitij,

Adding a detailed level explanation for my code.

awk '                            ##Starting awk program from here.
FNR==NR{                         ##Checking condition FNR==NR which will be TRUE when Input_file is being read first time.
  if($0~/^[0-9]+/){              ##Checking condition if a line starts from digit then do following.
     ++count                     ##Increment variable count with 1, each time cursor comes here.
  }                              ##Closing BLOCK for if condition here.
  sum[count]+=$NF                ##Creating an array named sum with index of count and keep on adding value of $NF to its own value.
  next                           ##next will skip all further statements from here.
}                                ##Closing BLOCK for condition FNR==NR here.
/^[0-9]+/{                       ##Checking condition if a line starts from digit then do following.
  print $0,sum[++var]            ##Printing current line and array sum with index of variable var with its increment of 1 each time cursor comes here.
  next                           ##next will skip all further statements from here.
}                                ##Closing BLOCK for /^[0-9]+/ condition here.
1                                ##Mentioning 1 for printing edited/non-edited line here.
'  Input_file  Input_file        ##Mentioning Input_file 2 times here.

Thanks,
R. Singh