Do replace operation and awk to sum multiple columns if another column has duplicate values

as7951 · May 4, 2018, 11:42am

Hi Experts,
Please bear with me, i need help
I am learning AWk and stuck up in one issue.
First point : I want to sum up column value for column 7, 9, 11,13 and column15 if rows in column 5 are duplicates.No action to be taken for rows where value in column 5 is unique.
Second point : For duplicate rows, need to make them as single row in output file.
Third point : For duplicate rows, want to put all value of column 2 other from input file in column 16 in format (IN272018000235^IN27201523963) in output file other than value which is stored(IN27201800023963) in column 2 in output file.

Ex - In output file, please see column 16 it should have value as IN272018000235^IN27201523963(taken from column2) only from first three rows(duplicate) .Here we have not taken 3rd value(IN27201800023963) in column 16 of output file from column 2 of input file bcz we have already stored in column 2(IN27201800023963) of output file.

Have record of thousand of line..so duplicate rows are not limited to three, there can be 40, 50 any number of duplicate rows. So accordingly need to have value in column 57.

Please let me know if still my query is not clear

Input File

a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p
    POS|IN27201800023963|2018-04-24||27AACCE5198E1ZJ||500|0||9|45|9|45|||
    POS|IN272018000235|2018-04-24||27AACCE5198E1ZJ||500|0||9|45|9|45|||
    POS|IN27201523963|2018-04-24||27AACCE5198E1ZJ||500|0||9|45|9|45|||
    POS|IN27201800022938|2018-04-05||27AAJFH2012G1ZS||2|4||6||7|8|||

Expected Output File

   a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p
    POS|IN27201800023963|2018-04-24||27AACCE5198E1ZJ||1500|0||9|135|9|135|||IN272018000235^IN27201523963|
    POS|IN27201800022938|2018-04-05||27AAJFH2012G1ZS||2|4||6||7|8|||

Below is code tried to sum up value but giving output as

awk 'BEGIN{ FS = OFS = "|" }
       NR > 1{
           if ($5 == f5) {
               $7 += f7; $9 += f9; $11 += f11; $13 += f13;
               $15 += f15; $16 = $16 f2"^"$2
           } else { print rec }
       }
       {
           rec = $0; f5 = $5; f7 = $7; f9 = $9;
           f11 = $11; f13 = $13; f15 = $15; f2 = $2
       }
       END{ print rec }' input.txt

Code output: Not correct because getting column 2 value "IN27201523963" also in column 16 instead it should be like "IN272018000235^IN27201800023963"
The value which is present in column13 should not be present in column16.

POS|IN27201523963|2018-04-24||27AACCE5198E1ZJ||1500|0|0|9|135|9|135||0|IN272018000235^IN27201523963
POS|IN27201800022938|2018-04-05||27AAJFH2012G1ZS||2|4||6||7|8|||

rbatte1 · May 4, 2018, 12:39pm

In what way is it 'not working'? Some output/errors would be useful to see.

Thanks, in advance,
Robin

as7951 · May 4, 2018, 12:51pm

Hi rbatte1,

The code is not complete and giving error "Unmatched '." .

Corona688 · May 4, 2018, 2:19pm

There was an unmatched '.

awk 'BEGIN {FS=OFS="|"} FNR==1 {a[$1]+= (f4[key]+=$4;f6[key]+=$6;f7[key]+=$7;f8[key]+=$8;f10[key]+=$10;f11[key]+=$11;)}' input.txt > output.txt

Try it now?

RudiC · May 4, 2018, 2:36pm

I don't think it's worth trying, as there are quite some errors in that one-liner. Amongst others: the a[$1] assignment is missing (unless you mean the parenthesized part, which a) is an error, b) returns nothing), key is undefined, all the arithmetics is done for the first line ONLY, no output is printed.

Don_Cragun · May 4, 2018, 2:45pm

Is this a homework assignment? Homework and coursework questions can only be posted in the Homework & Coursework forum under special homework rules.

Please review the rules, which you agreed to when you registered, if you have not already done so.

If you did not post homework, please explain the source of the data you have shown us and explain why you are trying to do this. The input you have shown us does not match the description you have provided, and the output you say you want from the sample input you provided cannot be derived from the instructions you have provided. If this is not homework, please also explain where you found the start of the awk script that you have shown us, and why you are unable to complete it. (Adding a missing single-quote as noted in the diagnostic you mentioned in post #3 in this thread should not be hard for most awk or shell script writers.)

If you did post homework in the main forums, please review the guidelines for posting homework and repost.

as7951 · May 4, 2018, 11:59pm

Hi Rudic,
I am confused and not able to complete the code for above query.
can you please help.

---------- Post updated at 10:59 PM ---------- Previous update was at 10:57 PM ----------

Hi Don,
it's not a home work assignment,
i am little confused here and not able to resolve above query, as it is seems complex to me
, i need your help in this.
I write multiple code but none is working.

Don_Cragun · May 5, 2018, 12:08am

I repeat:

as7951 · May 5, 2018, 6:28am

Hi RudiC,

Please check i have put my code and its response error in my post.
please check and help

Don_Cragun · May 5, 2018, 12:39pm

Hi as7951,
I am very disappointed by your activities in the first day in the life of this thread....

You started out with a post that contained inconsistent data that had sample inputs that did not match the sample output you said you wanted to produce, and code that came from an unspecified source with mismatched quotes and uninitialized variables. These problems were mentioned by other posts in this thread over the next 12 hours and then six hours later you made all of those comments look silly by silently replacing the sample input and sample output, and by replacing the code you had originally provided with code that will at least run without producing mismatched quote diagnostic messages. Those changes mean that anyone reading your thread now will be confused about what post #2 through post #7 are talking about.

In the future, if you want to change the samples you provided for input, output, and code you're discussing; add them in a new post instead of replacing them in an earlier post.

RudiC · May 5, 2018, 2:24pm

I'm confused as well as I lost track of what is the input and what are the requirements and errors. What $13 is present in $16 in the (modified) post#1? Where are the summations in post#1 desired output?

RudiC · May 6, 2018, 3:09am

Trying hard to infer from your verbal specification and (incosistent, as said before) samples, and polishing my crystal ball to its extreme, I came up with

awk -F\| '
NR == 1         {print
                 next
                }
                {for (i=7; i<=15; i+=2) if ((SUM[i,$5] $i) != "") SUM[i,$5] += $i
                }

!REC[$5]        {REC[$5] = $0
                 next
                }
                {F16[$5] = F16[$5] DL[$5] $2
                 DL[$5]  = "^"
                }
END             {for (r in REC) {$0 = REC[r]
                                 for (i=7; i<=15; i+=2) $i = SUM[i,r]
                                 $16 = F16[r]
                                 print 
                                }
                }
' OFS="|" file

How close would that be?

Don_Cragun · May 6, 2018, 2:22pm

Like RudiC, I find the statement in the updated post #1:

to be confusing and to make no sense. It might make sense if the "column13" in that statement had been replaced by "column 2".

Like RudiC, I find it strange that an output field that is supposed to be a sum of one or more input fields is sometimes shown to have a sum that is an empty field. (I would expect a sum of one or more empty or non-empty fields to have a numeric value. But the desired output shown in post #1 has some empty fields that are supposed to be sums.)

Here is an alternative to the code RudiC suggested in post #12:

awk -v sum_fields='7,9,11,13,15' '
BEGIN {	# Run before reading 1st input file record.  Set input and output field
	# separators and create array sf[] with field values being the field
	# numbers of fields that are to be summed in the output records.  Also
	# set nsf to the number of output fields to be summed.
	FS = OFS = "|"
	nsf = split(sum_fields, sf, /,/)
}
NR == 1 {
	# Save header record to be used as first output record and skip to next
	# input record.
	rec = $0
	next
}
{	# Process each non-header input record:
	if ($5 == f[5]) {
		# This adds up fields to be summed in input records that have
		# the same ID as the previous input record.
		# Add in the previously accumulated values for the fields to
		# be summed.
		for(i = 1; i <= nsf; i++)
			f[sf] = $sf += f[sf]
		# Update the field 16 value and set the field separator for
		# any remaining additions to this field in subsequent records.
		f[16] = $16 = f[16] sep $2
		sep = "^"
		# Reset field 2 to the value that was in the first input record
		# in this group.
		$2 = f[2]
	} else {
		# This is the first record in a new set.
		# Print the accumulated results for the last set of input
		# records and clear the field separator for field 16.
		print rec
		sep = ""
		# Turn empty fields to be summed into zero fields, uave initial
		# values for fields 2 and 5, and clear field 16.
		for(i = 1; i <= nsf; i++)
			f[sf] = $sf += 0
		f[2] = $2
		f[5] = $5
		f[16] = $16 = ""
	}
	# Save current updated input record.  It will be an output record if the
	# next input record has a different ID.
	rec = $0
}
END {	# After processing the last input record, print the last outptu record.
	print rec
}' input.txt

The output produced by the code RudiC suggested in post #12 will have output records in random order. His code will also combine output from input records with the same ID in field #5 whether or not those records are adjacent in the input file. And, fields being summed that only sum empty fields, the output for those fields will be empty fields.

The output produced by the code I suggested above will have output records in the same order as the records that are found in the input file. But, it will only combine output from input records with the same ID in field #5 if those records are adjacent in the input file. For fields being being summed that only sum empty fields, the output for those fields will be zero fields.

With either of these suggestions, if you want to run these on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

With the input currently in post #1, my suggestion above produces the output:

a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p
    POS|IN27201800023963|2018-04-24||27AACCE5198E1ZJ||1500|0|0|9|135|9|135||0|IN272018000235^IN27201523963
    POS|IN27201800022938|2018-04-05||27AAJFH2012G1ZS||2|4|0|6|0|7|8||0|

instead of the output that was requested:

   a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p
    POS|IN27201800023963|2018-04-24||27AACCE5198E1ZJ||1500|0||9|135|9|135|||IN272018000235^IN27201523963|
    POS|IN27201800022938|2018-04-05||27AAJFH2012G1ZS||2|4||6||7|8|||

I see no reason why spaces have been added to the start of the 1st line of output nor why an empty 17th field has been added to the 2nd line of the desired output. As you can see, the code I provided does not provide either of these requested, but unexplained anomalies.