Remove duplicate nodes

Hi all,

I have a list of node pairs separated with a comma and also, associated with their respective values. For example:

b0015,b1224    1.1
b0015,b2576    1.4
b0015,b3162    2.5
b0528,b1086    1.7
b0528,b1269    5.4
b0528,b3602    2.1
b0948,b2581    3.2
b1224,b0015    1.1
b1086,b0528    1.7

Here, b0015,b1224 and b1224,b0015 should be considered as same/duplicates (similarly b0528,b1086 and b1086,b0528) and any one of them needs to be removed from the list. So the desired output would be:

b0015,b1224    1.1
b0015,b2576    1.4
b0015,b3162    2.5
b0528,b1086    1.7
b0528,b1269    5.4
b0528,b3602    2.1
b0948,b2581    3.2

Any help would be highly appreciated.

Thanks in advance.

Try:

awk '{  split($1, f, /,/)
        if((f[2]","f[1]) in o) next
        o[$1]
        print
}' input

As always, if you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk .

Note that this won't skip an input line if the 1st field contains the same two nodes in the same order; it will just skip the line if the 1st field contains the same two nodes in reverse order. This script will also skip lines even if the second field contains a different value than the previously printed entry. If this isn't what you want, you need to give more complete requirements.

1 Like

This works for me (using gawk):

gawk -F',| ' '!(a[$1,$2]++ + a[$2,$1]++)'
5 Likes

Thanks for the help but, although it successfully removes the duplicates in column 1, it does not print last (value) column along.

Broken awk? See the hints posted by Don Cragun.

This should work with any recent awk ( /usr/xpg4/bin/awk or nawk on Solaris systems); it doesn't use any non-stamdard gawk extensions.

Unlike the script I gave, this won't print any duplicated nodes when the nodes.

Perhaps there are tabs present in the input file?

awk -F'[, \t]' ...
1 Like

I don't think this would make a difference, as the OP reported that the given FS is already good enough to detect the duplicates.

Cragun, your script is working fine and gives the desired results. Thank you very much.