Reporting where values match and mismatch across three columns

Hello,
I have values in three columns. Some values occur in all three columns, other values are present in only one or two columns. I would like to be able to see where the matches and mismatches occur. Thanks in advance for any advice!

I have:

A     B     C
1     1     2
2     3     3
3     4     5
4            6
5

I would like:

A     B     C
1     1      X
2     X      2
3     3      3
4     4      X
5     X      5
X     X      6

Hello MDeBiasse,

If I caught your logic correctly then last line should be X 6 X Following may help you in same. Please let me know if this helps.

awk -vs1=`cat Input_file | wc -l` 'BEGIN{print "A" OFS "B" OFS "C"} FNR==NR && NR>1{A[$1];B[$2];C[$3];next} END{o=1;for(k=1;k<=s1;k++){S=(k in A)?o:"X";U=(k in B)?o:"X";V=(k in C)?o:"X";print S OFS U OFS V;S=U=V="";o++}}' Input_file Input_file
 

Where input_file is as follows:

A B C
1 1 2
2 3 3
3 4 5
4 6
5

Output will be as follows.

A B C
1 1 X
2 X 2
3 3 3
4 4 X
5 X 5
X 6 X

EDIT: Seems moderator has modified post now by adding code tags, thank you for same. Then Following is the input file per user.
Input_file:

cat testtest13
A     B     C
1     1     2
2     3     3
3     4     5
4           6
5

Then following code may help.

awk -vs1=`cat testtest13 | wc -l` -F"     " 'BEGIN{print "A" OFS "B" OFS "C"} FNR==NR && NR>1{sub(/[[:space:]]/,X,$1);sub(/[[:space:]]/,X,$2);sub(/[[:space:]]/,X,$3);A[$1];B[$2];C[$3];next} END{o=1;for(k=1;k<=s1;k++){S=(k in A)?o:"X";U=(k in B)?o:"X";V=(k in C)?o:"X";print S OFS U OFS V;S=U=V="";o++}}' testtest13 testtest13

Output will be as follows.

A B C
1 1 X
2 X 2
3 3 3
4 4 X
5 X 5
X X 6

EDIT: Adding a non one liner form for solution following.

awk -vs1=`cat testtest13 | wc -l` -F"     " 'BEGIN                      {print "A" OFS "B" OFS "C"}
                                             FNR==NR && NR>1            {sub(/[[:space:]]/,X,$1);
                                                                         sub(/[[:space:]]/,X,$2);
                                                                         sub(/[[:space:]]/,X,$3);
                                                                         A[$1];
                                                                         B[$2];
                                                                         C[$3];
                                                                         next
                                                                        }
                                             END                        {o=1;
                                                                                for(k=1;k<=s1;k++){
                                                                                                        S=(k in A)?o:"X";
                                                                                                        U=(k in B)?o:"X";
                                                                                                        V=(k in C)?o:"X";
                                                                                                        print S OFS U OFS V;
                                                                                                        S=U=V="";
                                                                                                        o++
                                                                                                  }
                                                                        }
                                             ' testtest13 testtest13

Thanks,
R. Singh

1 Like

Try (untested):

awk  '            {A[$1]=$1; B[$2]=$2; C[$3]=$3; for (i=1; i<=3; i++) D[$i]}
         END     {for (d in D) print A[d]+0, B[d]+0, C[d]+0}
        ' file

Replace 0 s with X s and sort, if need be.

1 Like

Hello MDeBiasse,

My solution in post#2 will not look for the highest number in all columns it will look for the number of lines in input file, now let's say our input file is as follows.
Input_file:

cat testtest13
A     B     C
1     1     12
2     3     3
3     14     5
4           6
5

Now following code may help.

awk -F"     " 'BEGIN{print "A" OFS "B" OFS "C"} FNR==NR && NR>1{k=0;for(i=1;i<=NF;i++){k=k<$i?$i:k};q=q<k?k:q;sub(/[[:space:]]/,X,$1);sub(/[[:space:]]/,X,$2);sub(/[[:space:]]/,X,$3);A[$1];B[$2];C[$3];next} END{o=1;for(k=1;k<=q;k++){S=(k in A)?o:"X";U=(k in B)?o:"X";V=(k in C)?o:"X";print S OFS U OFS V;S=U=V="";o++}}' testtest13 testtest13

Output will be as follows.

A B C
1 1 X
2 X X
3 3 3
4 X X
5 X 5
X X 6
X X X
X X X
X X X
X X X
X X X
X X 12
X X X
X 14 X

Hope this helps, will be glad if this works as per your expectations.

Thanks,
R. Singh

1 Like

If the delimiter is a Tab character, then:

$
$ cat -t data1.txt
A^IB^IC
1^I1^I2
2^I3^I3
3^I4^I5
4^I^I6
5
$
$ perl -lne '@tokens = split/\t/;
             if ($.==1) { @cols = @tokens; next }
             foreach $i (0..$#tokens) {
                 next if $tokens[$i] eq "";
                 if (not defined $occurs{$tokens[$i]}) {
                     $occurs{$tokens[$i]} = [ "X", "X", "X" ]
                 }
                 $occurs{$tokens[$i]}->[$i] = $tokens[$i]
             }
             END {
                 print join " ", @cols;
                 foreach $k (sort keys %occurs) {
                     print join " ", @{$occurs{$k}}
                 }
             }
            ' data1.txt
A B C
1 1 X
2 X 2
3 3 3
4 4 X
5 X 5
X X 6
$
$

Otherwise, if the delimiter is one-or-more-blank-spaces and the file is fixed-format (i.e. the "6" in the penultimate line is in the same column as "C" in the first line), then:

$
$ cat -t data2.txt
A     B       C
1     1       2
2     3       3
3     4       5
4             6
5
$
$
$ perl -lne 'if ($.==1) {
                 @cols = /^(\S+)\s+(\S+)\s+(\S+)/;
                 $template = sprintf("A%d A%d A*", $-[2]-$-[1], $-[3]-$-[2]);
                 next;
             }
             @tokens = unpack($template, $_);
             foreach $i (0..$#tokens) {
                 next if $tokens[$i] eq "";
                 if (not defined $occurs{$tokens[$i]}) {
                     $occurs{$tokens[$i]} = [ "X", "X", "X" ]
                 }
                 $occurs{$tokens[$i]}->[$i] = $tokens[$i]
             }
             END {
                 print join " ", @cols;
                 foreach $k (sort keys %occurs) {
                     print join " ", @{$occurs{$k}}
                 }
             }
            ' data2.txt
A B C
1 1 X
2 X 2
3 3 3
4 4 X
5 X 5
X X 6
$
$
1 Like

Hello RudiC. I found your script to work well. However, when I change the 0's to X's, the output remains with 0's. Do you have any suggestions on how to fix this? Thank you!

Try

awk     'NR==1          {print; next}
                        {A[$1]=$1; B[$2]=$2; C[$3]=$3; for (i=1; i<=3; i++) if ($i) D[$i]}
         END            {for (d in D)   {TMP=sprintf ("%s\t%s\t%s", A[d]+0, B[d]+0, C[d]+0)
                                         gsub (/0/, "X", TMP)
                                         print TMP
                                        }
                        }
        ' FS="\t" file
A    B    C
1    1    X
2    X    2
3    3    3
4    4    X
5    X    5
X    X    6