Reduce redundant file

Dear All,
I have to reduce the redundancy of a file that is like this:

a b 0
a c 0
a f 1
b a 1
b a 0
b c 1
d f 0 
g h 1
f d 1

Basically, this file describe a network with relative nodes and edges.
The nodes are the different letters and the edges are represented by the numbers (in particluar 0, means that the direction of edges is from left to right, 1 is viceversa).

As you may notice, some interaction are duplicates (in bold). For example interaction:

a b 0
b a 1

a-->b
b<--a

Are exactly the same. The first line interaction go from a to b (0 means inreaction go from left to right), in second line interaction still go from a to b (1 means interaction go from right to left).

What I would like is to filter the file above and output a file like this:

a b 0
a c 0
a f 1
b a 0
b c 1
d f 0 
g h 1

So, all the duplicated interaction are removed.
!Interactions

a b 0
b a 0

are not the same! Both go from left to right but is different the starting node.
a-->b
b-->a

Hope is clear.

Best

Giuliano

Any attempts/thoughts/ideas from your side?

---------- Post updated at 15:08 ---------- Previous update was at 15:00 ----------

Howsoever, try

awk '
($2,$1) in B &&
B[$2,$1] != $3  {next
                }
!(($1,$2) in B) {B[$1,$2] = $3
                }
END     {for (b in B)   {split (b, C, SUBSEP)
                         print C[1], C[2], B
                        }
        }
' file
a b 0
a c 0
a f 1
b a 0
b c 1
d f 0
g h 1

The order of the output lines cannot be guaranteed.

Well, basically I can filter the file and exlude the interaction that are single.

a b 0
a c 0 
a f 1 
b a 1 
b a 0 
b c 1 
d f 0  
g h 1 
f d 1

In this case for each row I could check if value in column 1 is present in column 2 and viceversa. If so (present) the interaction is bidirectional.

But still, no ideas in how to apply subsequent filter that is the most important.

I am trying to concatenate the column, sort but really I can't figure out anything.

Best

Giuliano

Although above solution works for the sample given, it will fail for others, e.g. the sequence of a b 0 and a b 1 . Try this instead:

awk '
($2,$1,!$3) in B        {next
                        }

                        {B[$0]
                        }
END                     {for (b in B)   {split (b, C)
                         print C[1], C[2], C[3]
                        }
        }
' SUBSEP=" " file
1 Like

try also:

awk '!a[($3) ? $1 : $2, ($3) ? $2 : $1]++' infile
1 Like

You could also write that so it only evaluates field 3 once:

awk '!($3 ? A[$2,$1]++ : A[$1,$2]++)' infile
2 Likes
perl -lane '!$seen{$F[2]?"$F[1] $F[0] 0":$_}++ and print' giuliangiuseppe.input
a b 0
a c 0
a f 1
b a 0
b c 1
d f 0
g h 1
1 Like

Wouldn't it much easier (from the viewpoint of understandibility) to first transform the file into a format, where the third column is always zero, i.e. if you have a line

X Y 1

you would replace it by

Y X 0

After this, you could simply use

sort -u

to remove duplicates.

By the way, if you ensure that the third column is always 0, it becomes redundant and you could remove it completely, making the file format even simpler.

1 Like