Remove duplicate nodes

AshwaniSharma09 · February 12, 2013, 3:12am

Hi all,

I have a list of node pairs separated with a comma and also, associated with their respective values. For example:

b0015,b1224    1.1
b0015,b2576    1.4
b0015,b3162    2.5
b0528,b1086    1.7
b0528,b1269    5.4
b0528,b3602    2.1
b0948,b2581    3.2
b1224,b0015    1.1
b1086,b0528    1.7

Here, b0015,b1224 and b1224,b0015 should be considered as same/duplicates (similarly b0528,b1086 and b1086,b0528) and any one of them needs to be removed from the list. So the desired output would be:

b0015,b1224    1.1
b0015,b2576    1.4
b0015,b3162    2.5
b0528,b1086    1.7
b0528,b1269    5.4
b0528,b3602    2.1
b0948,b2581    3.2

Any help would be highly appreciated.

Thanks in advance.

Don_Cragun · February 12, 2013, 4:38am

Try:

awk '{  split($1, f, /,/)
        if((f[2]","f[1]) in o) next
        o[$1]
        print
}' input

As always, if you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk .

Note that this won't skip an input line if the 1st field contains the same two nodes in the same order; it will just skip the line if the 1st field contains the same two nodes in reverse order. This script will also skip lines even if the second field contains a different value than the previously printed entry. If this isn't what you want, you need to give more complete requirements.

user8 · February 12, 2013, 5:42am

This works for me (using gawk):

gawk -F',| ' '!(a[$1,$2]++ + a[$2,$1]++)'

AshwaniSharma09 · February 12, 2013, 10:54am

Thanks for the help but, although it successfully removes the duplicates in column 1, it does not print last (value) column along.

user8 · February 12, 2013, 11:04am

Broken awk? See the hints posted by Don Cragun.

Don_Cragun · February 12, 2013, 12:30pm

This should work with any recent awk ( /usr/xpg4/bin/awk or nawk on Solaris systems); it doesn't use any non-stamdard gawk extensions.

Unlike the script I gave, this won't print any duplicated nodes when the nodes.

Scrutinizer · February 12, 2013, 12:32pm

Perhaps there are tabs present in the input file?

awk -F'[, \t]' ...

user8 · February 12, 2013, 12:45pm

I don't think this would make a difference, as the OP reported that the given FS is already good enough to detect the duplicates.

AshwaniSharma09 · February 12, 2013, 3:08pm

Cragun, your script is working fine and gives the desired results. Thank you very much.