Reduce redundant file

giuliangiuseppe · March 21, 2016, 8:36am

Dear All,
I have to reduce the redundancy of a file that is like this:

a b 0
a c 0
a f 1
b a 1
b a 0
b c 1
d f 0 
g h 1
f d 1

Basically, this file describe a network with relative nodes and edges.
The nodes are the different letters and the edges are represented by the numbers (in particluar 0, means that the direction of edges is from left to right, 1 is viceversa).

As you may notice, some interaction are duplicates (in bold). For example interaction:

a b 0
b a 1

a-->b
b<--a

Are exactly the same. The first line interaction go from a to b (0 means inreaction go from left to right), in second line interaction still go from a to b (1 means interaction go from right to left).

What I would like is to filter the file above and output a file like this:

a b 0
a c 0
a f 1
b a 0
b c 1
d f 0 
g h 1

So, all the duplicated interaction are removed.
!Interactions

a b 0
b a 0

are not the same! Both go from left to right but is different the starting node.
a-->b
b-->a

Hope is clear.

Best

Giuliano

RudiC · March 21, 2016, 10:08am

Any attempts/thoughts/ideas from your side?

---------- Post updated at 15:08 ---------- Previous update was at 15:00 ----------

Howsoever, try

awk '
($2,$1) in B &&
B[$2,$1] != $3  {next
                }
!(($1,$2) in B) {B[$1,$2] = $3
                }
END     {for (b in B)   {split (b, C, SUBSEP)
                         print C[1], C[2], B
                        }
        }
' file
a b 0
a c 0
a f 1
b a 0
b c 1
d f 0
g h 1

The order of the output lines cannot be guaranteed.

giuliangiuseppe · March 21, 2016, 10:12am

Well, basically I can filter the file and exlude the interaction that are single.

a b 0
a c 0 
a f 1 
b a 1 
b a 0 
b c 1 
d f 0  
g h 1 
f d 1

In this case for each row I could check if value in column 1 is present in column 2 and viceversa. If so (present) the interaction is bidirectional.

But still, no ideas in how to apply subsequent filter that is the most important.

I am trying to concatenate the column, sort but really I can't figure out anything.

Best

Giuliano

RudiC · March 21, 2016, 10:20am

Although above solution works for the sample given, it will fail for others, e.g. the sequence of a b 0 and a b 1 . Try this instead:

awk '
($2,$1,!$3) in B        {next
                        }

                        {B[$0]
                        }
END                     {for (b in B)   {split (b, C)
                         print C[1], C[2], C[3]
                        }
        }
' SUBSEP=" " file

rdrtx1 · March 21, 2016, 10:33am

try also:

awk '!a[($3) ? $1 : $2, ($3) ? $2 : $1]++' infile

Don_Cragun · March 21, 2016, 4:30pm

You could also write that so it only evaluates field 3 once:

awk '!($3 ? A[$2,$1]++ : A[$1,$2]++)' infile

Aia · March 21, 2016, 7:01pm

perl -lane '!$seen{$F[2]?"$F[1] $F[0] 0":$_}++ and print' giuliangiuseppe.input

a b 0
a c 0
a f 1
b a 0
b c 1
d f 0
g h 1

rovf · March 22, 2016, 3:33am

Wouldn't it much easier (from the viewpoint of understandibility) to first transform the file into a format, where the third column is always zero, i.e. if you have a line

X Y 1

you would replace it by

Y X 0

After this, you could simply use

sort -u

to remove duplicates.

By the way, if you ensure that the third column is always 0, it becomes redundant and you could remove it completely, making the file format even simpler.