awk - Remove duplicates during array build

chill3chee · July 4, 2016, 1:17pm

Greetings Experts,
Issue: Within awk script, remove the duplicate occurrences that are space (1 single space character) separated

Description: I am processing 2 files using awk and during processing, I am building an array and there are duplicates on this; how can I delete the duplicates within an awk without moving out of it; To put in a simple way, I am building an array as

awk -F "@" '
......
v_array[$1 OFS $2]=(v_array[$1 OFS $2] ? v_array[$1 OFS $2] "," $3 : $3)
.....
' file1.txt  file2.txt

File1.txt

col1          col2          col3
abc           def           xyz
abc           efg           pqr
abc           def           qrs
stu           vwx           yz
abc           def           xyz

current contents in v_array:

v_array[abc def]=xyz,qrs,xyz
v_array[abc efg]=pqr
v_array[stu vwx]=yz

Expected contents in v_array: As you can see xyz is repeated for the combination of abc def; hence it needs to be picked only once

v_array[abc def]=xyz,qrs   
v_array[abc efg]=pqr
v_array[stu vwx]=yz

Ordering is not required. It can be xyz,qrs or qrs,xyz

I can check for the presence of $3 in the v_array using the split function as

if (v_array) {
for (i in v_array) {
v_dup_check="not present"
v_cnt=split(i,v_a_tmp,",");
for (k=1;k<=v_cnt;k++) {
if (a_tmp[k]==$3) {
v_dup_check="present"} }
if (v_dup_check=="not present") {
v_array[$1 OFS $2]=(v_array[$1 OFS $2] ? v_array[$1 OFS $2] "," $3 : $3)
}
else {
v_array[$1 OFS $2]=v_array[$1 OFS $2] }
}}

This is what I can think as of now; hope there would be a much better approach to handle this within awk;

Also, how to sort the array index and array elements after completion of array build as I am learning awk through the forums; I mean
v_array[$1 OFS $2] -- how to process the elements in the order of $1 OFS $2
and also how to sort on the array values as
v_array[$1 OFS $2]=$3 -- how to process the elements in array in the order of $3

Thank you for your valuable time..

Edit:
Please note that for further processing, the array index should not be changed v_array[$1 OFS $2]

RudiC · July 4, 2016, 1:42pm

Make (and test) the combination of $1 , $2 , and $3 unique:

awk  '
FNR == 1        {next
                }
!T[$1,$2,$3]++  {v_array[$1 OFS $2]=v_array[$1 OFS $2] ? v_array[$1 OFS $2] "," $3 : $3
                }
END             {for (v in v_array) print v, v_array[v]
                }
' file
abc def xyz,qrs
abc efg pqr
stu vwx yz

While some awk implementations provide a sort function, and you could build one yourself in others, piping through sort might be the easiest way to get what you want:

. . . | sort -k3
abc efg pqr
abc def xyz,qrs
stu vwx yz

chill3chee · July 6, 2016, 11:17am

I aligned your post to suit my requirement and works great. As always, awesome and thank you RudiC.
I have a small question here. My script uses 2 input files is facing issues when reading the second file as

awk -F "@" ' NR==FNR { ....; next; } {  #second file processing }' file1.txt file2.txt

It doesn't even read the 2nd file; Tested with some print statements in the 2nd file processing and they never get printed. When I changed it to if (NR!=FNR) the same part works great.

awk -F "@" ' { if(NR==FNR) {
....; next; }
if(NR!=FNR){
#second file processing } }' file1.txt file2.txt

and this works; I am not complaining about awk; I am sure that I messed up some where and not able to figure it out. But if(NR!=FNR) comes to my rescue at this point and hence I am using it.

Though I know it is not possible to figure out the issue without looking into the script and files, looking for some guess; did someone ever face similar issue.
!T[$1 OFS $2 OFS $3]++ should have worked in my script. But strangely, it didn't. So, I just tweaked as

if (!T[$1 OFS $2 OFS $3] {
v_array[$1 OFS $2]=(v_array[$1 OFS $2]? v_array[$1 OFS $2] "," $3 : $3)
T[$1 OFS $2 OFS $3="1"
}

and this works. I am not sure what difference does ++ and the if make as they should be ideal.

Thank you for your time.

RudiC · July 6, 2016, 11:45am

There shouldn't be any NR == FNR nor NR != FNR ; I simply put in FNR == 1 to exclude the header line(s). The scriptlet should work on any number of files supplied to it as one single stream of data (unless you terribly messed up something).