Reporting where values match and mismatch across three columns

MDeBiasse · March 26, 2015, 2:14pm

Hello,
I have values in three columns. Some values occur in all three columns, other values are present in only one or two columns. I would like to be able to see where the matches and mismatches occur. Thanks in advance for any advice!

I have:

A     B     C
1     1     2
2     3     3
3     4     5
4            6
5

I would like:

A     B     C
1     1      X
2     X      2
3     3      3
4     4      X
5     X      5
X     X      6

RavinderSingh13 · March 26, 2015, 3:02pm

Hello MDeBiasse,

If I caught your logic correctly then last line should be X 6 X Following may help you in same. Please let me know if this helps.

awk -vs1=`cat Input_file | wc -l` 'BEGIN{print "A" OFS "B" OFS "C"} FNR==NR && NR>1{A[$1];B[$2];C[$3];next} END{o=1;for(k=1;k<=s1;k++){S=(k in A)?o:"X";U=(k in B)?o:"X";V=(k in C)?o:"X";print S OFS U OFS V;S=U=V="";o++}}' Input_file Input_file

Where input_file is as follows:

Output will be as follows.

A B C
1 1 X
2 X 2
3 3 3
4 4 X
5 X 5
X 6 X

EDIT: Seems moderator has modified post now by adding code tags, thank you for same. Then Following is the input file per user.
Input_file:

cat testtest13
A     B     C
1     1     2
2     3     3
3     4     5
4           6
5

Then following code may help.

awk -vs1=`cat testtest13 | wc -l` -F"     " 'BEGIN{print "A" OFS "B" OFS "C"} FNR==NR && NR>1{sub(/[[:space:]]/,X,$1);sub(/[[:space:]]/,X,$2);sub(/[[:space:]]/,X,$3);A[$1];B[$2];C[$3];next} END{o=1;for(k=1;k<=s1;k++){S=(k in A)?o:"X";U=(k in B)?o:"X";V=(k in C)?o:"X";print S OFS U OFS V;S=U=V="";o++}}' testtest13 testtest13

Output will be as follows.

A B C
1 1 X
2 X 2
3 3 3
4 4 X
5 X 5
X X 6

EDIT: Adding a non one liner form for solution following.

awk -vs1=`cat testtest13 | wc -l` -F"     " 'BEGIN                      {print "A" OFS "B" OFS "C"}
                                             FNR==NR && NR>1            {sub(/[[:space:]]/,X,$1);
                                                                         sub(/[[:space:]]/,X,$2);
                                                                         sub(/[[:space:]]/,X,$3);
                                                                         A[$1];
                                                                         B[$2];
                                                                         C[$3];
                                                                         next
                                                                        }
                                             END                        {o=1;
                                                                                for(k=1;k<=s1;k++){
                                                                                                        S=(k in A)?o:"X";
                                                                                                        U=(k in B)?o:"X";
                                                                                                        V=(k in C)?o:"X";
                                                                                                        print S OFS U OFS V;
                                                                                                        S=U=V="";
                                                                                                        o++
                                                                                                  }
                                                                        }
                                             ' testtest13 testtest13

Thanks,
R. Singh

RudiC · March 26, 2015, 3:45pm

Try (untested):

awk  '            {A[$1]=$1; B[$2]=$2; C[$3]=$3; for (i=1; i<=3; i++) D[$i]}
         END     {for (d in D) print A[d]+0, B[d]+0, C[d]+0}
        ' file

Replace 0 s with X s and sort, if need be.

RavinderSingh13 · March 26, 2015, 4:01pm

Hello MDeBiasse,

My solution in post#2 will not look for the highest number in all columns it will look for the number of lines in input file, now let's say our input file is as follows.
Input_file:

cat testtest13
A     B     C
1     1     12
2     3     3
3     14     5
4           6
5

Now following code may help.

awk -F"     " 'BEGIN{print "A" OFS "B" OFS "C"} FNR==NR && NR>1{k=0;for(i=1;i<=NF;i++){k=k<$i?$i:k};q=q<k?k:q;sub(/[[:space:]]/,X,$1);sub(/[[:space:]]/,X,$2);sub(/[[:space:]]/,X,$3);A[$1];B[$2];C[$3];next} END{o=1;for(k=1;k<=q;k++){S=(k in A)?o:"X";U=(k in B)?o:"X";V=(k in C)?o:"X";print S OFS U OFS V;S=U=V="";o++}}' testtest13 testtest13

Output will be as follows.

A B C
1 1 X
2 X X
3 3 3
4 X X
5 X 5
X X 6
X X X
X X X
X X X
X X X
X X X
X X 12
X X X
X 14 X

Hope this helps, will be glad if this works as per your expectations.

Thanks,
R. Singh

durden_tyler · March 26, 2015, 9:22pm

If the delimiter is a Tab character, then:

$
$ cat -t data1.txt
A^IB^IC
1^I1^I2
2^I3^I3
3^I4^I5
4^I^I6
5
$
$ perl -lne '@tokens = split/\t/;
             if ($.==1) { @cols = @tokens; next }
             foreach $i (0..$#tokens) {
                 next if $tokens[$i] eq "";
                 if (not defined $occurs{$tokens[$i]}) {
                     $occurs{$tokens[$i]} = [ "X", "X", "X" ]
                 }
                 $occurs{$tokens[$i]}->[$i] = $tokens[$i]
             }
             END {
                 print join " ", @cols;
                 foreach $k (sort keys %occurs) {
                     print join " ", @{$occurs{$k}}
                 }
             }
            ' data1.txt
A B C
1 1 X
2 X 2
3 3 3
4 4 X
5 X 5
X X 6
$
$

Otherwise, if the delimiter is one-or-more-blank-spaces and the file is fixed-format (i.e. the "6" in the penultimate line is in the same column as "C" in the first line), then:

$
$ cat -t data2.txt
A     B       C
1     1       2
2     3       3
3     4       5
4             6
5
$
$
$ perl -lne 'if ($.==1) {
                 @cols = /^(\S+)\s+(\S+)\s+(\S+)/;
                 $template = sprintf("A%d A%d A*", $-[2]-$-[1], $-[3]-$-[2]);
                 next;
             }
             @tokens = unpack($template, $_);
             foreach $i (0..$#tokens) {
                 next if $tokens[$i] eq "";
                 if (not defined $occurs{$tokens[$i]}) {
                     $occurs{$tokens[$i]} = [ "X", "X", "X" ]
                 }
                 $occurs{$tokens[$i]}->[$i] = $tokens[$i]
             }
             END {
                 print join " ", @cols;
                 foreach $k (sort keys %occurs) {
                     print join " ", @{$occurs{$k}}
                 }
             }
            ' data2.txt
A B C
1 1 X
2 X 2
3 3 3
4 4 X
5 X 5
X X 6
$
$

MDeBiasse · March 27, 2015, 10:03am

Hello RudiC. I found your script to work well. However, when I change the 0's to X's, the output remains with 0's. Do you have any suggestions on how to fix this? Thank you!

RudiC · March 27, 2015, 2:01pm

Try

awk     'NR==1          {print; next}
                        {A[$1]=$1; B[$2]=$2; C[$3]=$3; for (i=1; i<=3; i++) if ($i) D[$i]}
         END            {for (d in D)   {TMP=sprintf ("%s\t%s\t%s", A[d]+0, B[d]+0, C[d]+0)
                                         gsub (/0/, "X", TMP)
                                         print TMP
                                        }
                        }
        ' FS="\t" file
A    B    C
1    1    X
2    X    2
3    3    3
4    4    X
5    X    5
X    X    6