Awk: subset of fields as variable with sprint

beca123456 · May 5, 2016, 12:00am

Dear Unix Gurus,

input:

A|1|2|3|4|5
B|1|2|3|4|3
C|1|2|3|4|1
D|1|9|3|4|12

output:

A_(5);B(3);C(1)|1|2|3|4
D_(12)|1|9|3|4

Details:
If $2, $3, $3, $5 are identical, concatenate $1 and associated $NF together in the first field.
But I am trying to do the above by passing the identical fields to a sprintf variable.

With my code below, sprintf variable write all the fields on a new line, so I assume it comes from the for loop in the main block, but I don't see what is wrong:

gawk 'BEGIN{FS=OFS="|"; start=2; end=5}{for(i=start; i<=end; i++){var=sprintf("%s%s",$i, i<e ? OFS : "\n")}; common[var] = common[var] (common[var]? ";" : "") $1"_("$NF")"}END{for(j in common){print common[j] OFS j}}' input

RavinderSingh13 · May 5, 2016, 1:08am

Hello beca123456,

Not sure if in output shown A_(5) and D_(12) where _ is required, if not then following may help you in same. If you are not considering the order of the output then following may help you.

awk -F"|" '{XYZ[$2 FS $3 FS $3 FS $4]=XYZ[$2 FS $3 FS $3 FS $4]?XYZ[$2 FS $3 FS $3 FS $4]";"$1"("$NF")":$1"("$NF")"} END{for(i in XYZ){print XYZ FS i}}'  Input_file
OR
awk -F"|" '{Q=$2 FS $3 FS $3 FS $4;XYZ[Q]=XYZ[Q]?XYZ[Q]";"$1"("$NF")":$1"("$NF")"} END{for(i in XYZ){print XYZ FS i}}'  Input_file

Output will be as follows then.

D(12)|1|9|9|3
A(5);B(3);C(1)|1|2|2|3

If you need the correct order in output too as per your Input_file then following may help you in same.

awk -F"|" 'FNR==NR{XYZ[$2 FS $3 FS $4 FS $5]=XYZ[$2 FS $3 FS $4 FS $5]?XYZ[$2 FS $3 FS $4 FS $5]";"$1"("$NF")":$1"("$NF")";next} (($2 FS $3 FS $4 FS $5) in XYZ){print XYZ[$2 FS $3 FS $4 FS $5] OFS $2 FS $3 FS $4 FS $5;delete XYZ[$2 FS $3 FS $4 FS $5]}' Input_file OFS="|" Input_file
OR
awk -F"|" 'FNR==NR{Q=$2 FS $3 FS $4 FS $5;XYZ[Q]=XYZ[Q]?XYZ[Q]";"$1"("$NF")":$1"("$NF")";next} {P=$2 FS $3 FS $4 FS $5} (P in XYZ){print XYZ[P] OFS P;delete XYZ[P]}' Input_file OFS="|"  Input_file

Output will be as follows then.

A(5);B(3);C(1)|1|2|3|4
D(12)|1|9|3|4

If above is not completing your requirements then please show us more Input_file and expected output with more specific details on same, hope this helps you.
EDIT: Adding a non-one liner form of solutions now.

awk -F"|" '{
                XYZ[$2 FS $3 FS $3 FS $4]=XYZ[$2 FS $3 FS $3 FS $4]?XYZ[$2 FS $3 FS $3 FS $4]";"$1"("$NF")":$1"("$NF")"
           }
           END{
                for(i in XYZ){
                                print XYZ FS i
                             }
              }
          ' Input_file
  
 
awk -F"|" 'FNR==NR{
                        Q=$2 FS $3 FS $4 FS $5;
                        XYZ[Q]=XYZ[Q]?XYZ[Q]";"$1"("$NF")":$1"("$NF")";
                        next
                  }
                  {
                        P=$2 FS $3 FS $4 FS $5
                  }
                  (P in XYZ){
                                print XYZ[P] OFS P;
                                delete XYZ[P]
                            }
          ' Input_file OFS="|" Input_file

Thanks,
R. Singh

beca123456 · May 5, 2016, 1:35am

Thanks for your help !

However, the point of my post was to use sprintf to store the identical fields in a variable ($2, $3, $4, $5 here). I need to use sprintf because in my real file there are more than 30 fields that need to be identical between records in order to concatenate the first and last field together.
As I would like to avoid writing the 30 fields, I was thinkingusing a for loop and sprintf

Sorry for the confusion

RudiC · May 5, 2016, 3:07am

Without further commenting on your approach, I can see that in the conditional assignment i<e will always be FALSE as e is undefined thus zero.

Actually I can't see the advantages of using sprintf , so why do you insist on it?

beca123456 · May 5, 2016, 5:46pm

Correct. This is a typo. My code should have been:

gawk 'BEGIN{FS=OFS="|"; start=2; end=5}{for(i=start; i<=end; i++){var=sprintf("%s%s",$i, i<end ? OFS : "\n")}; common[var] = common[var] (common[var]? ";" : "") $1"_("$NF")"}END{for(j in common){print common[j] OFS j}}' input

Alright. Let me be more explicit.

input:

A|1|2|3|4|A|Q|W|S|E|D|F|R|C|D|E|S|S|W|Q|D|C|E|F|E|V|F|R|R|E|W|Q|Z|V|L|L|H|5
B|1|2|3|4|A|Q|W|S|E|D|F|R|C|D|E|S|S|W|Q|D|C|E|F|E|V|F|R|R|E|W|Q|Z|V|L|L|H|3
C|1|2|3|4|A|Q|W|S|E|D|F|R|C|D|E|S|S|W|Q|D|C|E|F|E|V|F|R|R|E|W|Q|Z|V|L|L|H|1
D|1|2|3|4|S|D|E|Q|O|P|V|S|E|R|E|E|E|E|J|A|R|J|L|U|E|L|O|P|I|T|A|L|Y|D|T|B|12

output:

A_(5);B(3);C(1)|1|2|3|4|A|Q|W|S|E|D|F|R|C|D|E|S|S|W|Q|D|C|E|F|E|V|F|R|R|E|W|Q|Z|V|L|L|H
D_(12)|1|2|3|4|S|D|E|Q|O|P|V|S|E|R|E|E|E|E|J|A|R|J|L|U|E|L|O|P|I|T|A|L|Y|D|T|B

The common fields between records starting by A, B, and C are fields from $2 up to $36.
Since I don't want to write the following because it is error-prone:

gawk 'BEGIN{FS=OFS="|"}{common[$2FS$3FS$4FS$5FS$6FS$7FS$8FS$9FS$10FS$11FS$12FS$13FS$14FS$15FS$16FS$17FS$18FS$19FS$20$FS$21FS$22FS$23FS$24FS$25FS$26FS$27FS$28FS$29FS$30FS$31FS$32FS$33FS$34FS$35FS$36] = common[$2FS$3FS$4FS$5FS$6FS$7FS$8FS$9FS$10FS$11FS$12FS$13FS$14FS$15FS$16FS$17FS$18FS$19FS$20$FS$21FS$22FS$23FS$24FS$25FS$26FS$27FS$28FS$29FS$30FS$31FS$32FS$33FS$34FS$35FS$36] (common[$2FS$3FS$4FS$5FS$6FS$7FS$8FS$9FS$10FS$11FS$12FS$13FS$14FS$15FS$16FS$17FS$18FS$19FS$20$FS$21FS$22FS$23FS$24FS$25FS$26FS$27FS$28FS$29FS$30FS$31FS$32FS$33FS$34FS$35FS$36]? ";" : "") $1"_("$NF")"}END{for(j in common){print common[j] OFS j}}' input

I thought I could replace

common[$2FS$3FS$4FS$5FS$6FS$7FS$8FS$9FS$10FS$11FS$12FS$13FS$14FS$15FS$16FS$17FS$18FS$19FS$20$FS$21FS$22FS$23FS$24FS$25FS$26FS$27FS$28FS$29FS$30FS$31FS$32FS$33FS$34FS$35FS$36]

by

for(i=2; i<=36; i++){var=sprintf("%s%s",$i, i<36 ? OFS : "\n"); common[var]}

Don_Cragun · May 6, 2016, 1:12am

You might want to consider an alternative approach. No sprintf() calls, no for loops to gather your keys, and no long lists of explicitly copied arguments; just a single substr() call. And, it doesn't care how many input fields are in your keys.

awk -F'|' '
{	key = substr($0, length($1) + 1, length - length($1 FS $NF))
	if(key in list)
		list[key] = list[key] ";" $1 "(" $NF ")"
	else {	list[key] = $1 "_(" $NF ")"
		keylist[++keycount] = key
	}
}
END {	for(i = 1; i <= keycount; i++)
		print list[keylist]  keylist
}' file

If file is a file that contains slightly modified versions of your two input samples:

A|1|2|3|4|5
B|1|2|3|4|3
C|1|2|3|4|1
D|1|9|3|4|12
E|1|2|3|4|A|Q|W|S|E|D|F|R|C|D|E|S|S|W|Q|D|C|E|F|E|V|F|R|R|E|W|Q|Z|V|L|L|H|5
F|1|2|3|4|A|Q|W|S|E|D|F|R|C|D|E|S|S|W|Q|D|C|E|F|E|V|F|R|R|E|W|Q|Z|V|L|L|H|3
G|1|2|3|4|A|Q|W|S|E|D|F|R|C|D|E|S|S|W|Q|D|C|E|F|E|V|F|R|R|E|W|Q|Z|V|L|L|H|1
H|1|2|3|4|S|D|E|Q|O|P|V|S|E|R|E|E|E|E|J|A|R|J|L|U|E|L|O|P|I|T|A|L|Y|D|T|B|12

it produces the output:

A_(5);B(3);C(1)|1|2|3|4
D_(12)|1|9|3|4
E_(5);F(3);G(1)|1|2|3|4|A|Q|W|S|E|D|F|R|C|D|E|S|S|W|Q|D|C|E|F|E|V|F|R|R|E|W|Q|Z|V|L|L|H
H_(12)|1|2|3|4|S|D|E|Q|O|P|V|S|E|R|E|E|E|E|J|A|R|J|L|U|E|L|O|P|I|T|A|L|Y|D|T|B

If someone wants to try this on a Solaris/SunOS system, change awk in this script to /usr/xpg4/bin/awk or nawk .

Obviously, you can convert my suggestion to a 1-liner; but I'll take this more readable, more easily maintained version of the code over a 1-liner any day.

beca123456 · May 6, 2016, 12:08pm

Thanks Don Cragun, this is brilliant !