Data filtering and category assigning

Please consider the following file, I have many groups which can be of 3 types, T1 (Serial_Number 1) T2 (Serial_Number 2) and T1*T2 (all other Serial_Number).

I want to only consider groups that have both T1 and T2 present and their values are different from each other. In the example file, Group3 and Group 5 are not to be considered for the same reasons.
Important to mention that the data is not sorted, so T1, T2 and T1*T2 rows are scattered in the file, in no particular order.

Group	Type	Value	Serial_number
Group1	T1	aa	1
Group1	T2	tt	2
Group1	T1*T2	at	3
Group1	T1*T2	tt	4
Group2	T1	gg	1
Group2	T2	tt	2
Group2	T1*T2	gg	3
Group2	T1*T2	tt	4
Group2	T1*T2	gt	5
Group3	T1	gg	1
Group3	T2	gg	2
Group3	T1*T2	gg	3
Group3	T1*T2	gg	5
Group4	T1	gg	1
Group4	T2	tt	2
Group4	T1*T2	gt	4
Group4	T1*T2	gg	5
Group5	T1	gg	1
Group5	T1*T2	gt	5

I want to add a column to the output , only for types T1*T2 that states if they match the corrsponding value of T1 in the group, or T2 in the group or doesnt match any of T1 or T2.

For example for Group1, the value of T1*T2 (Serial_number 3) is 'at' which
doesnt match its T1 value of 'aa' or T2 value of 'tt'. So it is 'different'

For Group1, the value of T1*T2 (Serial_number 4) is 'tt' which matches T2 value of 'tt' , so it assigned 'T2-like'

Group	Type	Value	Serial_number	Similar_To
Group1	T1*T2	at	3	different
Group1	T1*T2	tt	4	T2-like
Group2	T1*T2	gg	3	T1-like
Group2	T1*T2	tt	4	T2-like
Group2	T1*T2	gt	5	different
Group4	T1*T2	gt	4	different
Group4	T1*T2	gg	5	T1-like

This is my feeble attempt, which doesn't work.

awk 
' {
   if(!($1 in grp)) {
      grp[$1]++
      type[$1]=$2
      val[$1,1]=$3 FS $4
      next
   }
  NR != 1 {
 grp[$1]++
 type[$2]++
 val[$3]++
 a[$1,$2]=a[$1,$2]" "$3
 
 if($3=="T1") categ="T1-like"
 else if ($3=="P2") categ="T2-like"
 else categ="different"
 
 if($3="T1*T2")
   for (i=1;i<length(grp);i++)
   	if (grp==grp[i-1])
   		catg1=categ
   print $1 FS $2 FS $3 FS $4 FS catg1
 
 }' infile  
 

Strange, complex conditions ... there might be a more elegant solution, but try

awk     'NR==1          {print $0, "similar to"}
         NR==FNR        {if ($2 !~/\*/) {L[$1,$2]=$3
                                         G[$1]++
                                         T[$2]}
                         next}
         $2 ~ /\*/      {C=0
                         K="diff"
                         for (i in T) if (L[$1,i] == $3) {C++; K=i}
                         if (C<2 && G[$1]>=2) print $0, K  
                        }
        ' SUBSEP="," OFS="\t" file4 file4
Group    Type    Value    Serial_number    similar to
Group1    T1*T2    at    3    diff
Group1    T1*T2    tt    4    T2
Group2    T1*T2    gg    3    T1
Group2    T1*T2    tt    4    T2
Group2    T1*T2    gt    5    diff
Group4    T1*T2    gt    4    diff
Group4    T1*T2    gg    5    T1
1 Like

Thank you, my only concern is with reading these data files twice. They can be upto 25Gbs in the future, right now, they are in manageable range.

Is you input file (other than the header line) always in sorted order with the Group (field 1) being the primary sort key and the Serial Number (field 4) being the secondary sort key (as in your sample input file)?

When present, can the Type T1 and T2 Values be anything other than "aa", "gg", or "tt"?

1 Like

Don, they are sorted

-k1,1 -k4,4

.

T1 and T2 values can be among the set of 6 : aa,gg,tt,cc,dd,ii

Important to note that the groups with T1 and T2 values equal and groups with either/both T1 and T2 being absent must be ignored,please consider that as a preprocessing filter if you will.

Thanks for helping !

Could this help you ?
script name - Forum1.pl

#!/usr/bin/perl

my %hashTree;
my @array;

while (<>) {
  chomp;
  @array=split;
  $hashTree{$array[0]}{$array[3]}=[$array[1] ,$array[2]];
}

my ($hashTreeRef,$groupName,$srNo,$status,$groupName);

foreach ( keys %hashTree ) {
  undef $hashTreeRef;
  undef $groupName;

  $hashTreeRef=$hashTree{$_};
  $groupName=$_;

  undef $srNo;
  undef $status;
  undef $vT1;
  undef $vT2;
  undef $vT1T2;

  foreach ( sort { $a <=> $b} keys %$hashTreeRef ) {
    $srNo=$_;
    if($$hashTreeRef{$_}[0] eq "T1") { $vT1=$$hashTreeRef{$_}[1]; }
    if($$hashTreeRef{$_}[0] eq "T2") { $vT2=$$hashTreeRef{$_}[1]; }
    if($$hashTreeRef{$_}[0] eq 'T1*T2') {
      if ( ( defined ($vT1) and defined ($vT2) ) and $vT1 ne $vT2 ) {
          $vT1T2=$$hashTreeRef{$_}[1];
          $status= $vT1T2 eq $vT1 ? "T1-like" : $vT1T2 eq $vT2 ? "T2-like" : "different";
          printf "$groupName\t$$hashTreeRef{$_}[0]\t$vT1T2\t$srNo\t$status\n";
      }
    }
  }
}

Invocation

perl Forum1.pl infile
1 Like

Hello Praveen, thank you for the solution. However there might be some problem. The number of rows returned is much less than what is in the input

 $ cat prav
Group77 T1      AA      Snum048PX-02052
Group77 T2      GG      Snum060PX-02118
Group77 T1*T2   AA      Snum046PX-03072
Group77 T1*T2   AA      Snum048PX-00003
Group77 T1*T2   AA      Snum048PX-00008
Group77 T1*T2   AA      Snum048PX-00010
Group77 T1*T2   AA      Snum048PX-00014
Group77 T1*T2   AA      Snum048PX-00015
Group77 T1*T2   AA      Snum048PX-00016
Group77 T1*T2   AA      Snum048PX-00019

perl prav.pl prav
Group77 T1*T2   AA      Snum048PX-00014 T1-like
Group77 T1*T2   AA      Snum048PX-00008 T1-like
Group77 T1*T2   AA      Snum048PX-00016 T1-like
Group77 T1*T2   AA      Snum048PX-00003 T1-like

Hello Rudi,

For the following set of data no output should be produced since T1 and T2 values are same. Would you please look into this problem?

$ cat  rudi
Group78 T1      AA      Snum048PX02065
Group78 T2      AA      Snum060PX02052
Group78 T1*T2   AA      Snum048PX02068
Group78 T1*T2   AA      Snum048PX02069
Group78 T1*T2   AA      Snum048PX02070
Group78 T1*T2   AG      Snum048PX02093


$ awk     'NR==FNR        {if ($2 !~/\*/) {L[$1,$2]=$3
>                                          G[$1]++
>                                          T[$2]}
>                          next}
>          $2 ~ /\*/      {C=0
>                          K="diff"
>                          for (i in T) if (L[$1,i] == $3) {C++; K=i}
>                          if (C<2 && G[$1]>=2) print $0, K
>                         }
>         ' SUBSEP="," OFS="\t" rudi rudi 
Group78 T1*T2   AG      Snum048PX02093  diff

It produce wrong result bcoz of in your first post serial no is number and in your last post serial no alpha numeric. Modified my code. should give desired o/p.

#!/usr/bin/perl

my %hashTree;
my @array;

while (<>) {
  chomp;
  @array=split;
  $hashTree{$array[0]}{$array[3]}=[$array[1] ,$array[2]];
}

my ($hashTreeRef,$groupName,$srNo,$status,$groupName);

foreach ( keys %hashTree ) {
  undef $hashTreeRef;
  undef $groupName;

  $hashTreeRef=$hashTree{$_};
  $groupName=$_;

  undef $srNo;
  undef $status;
  undef $vT1;
  undef $vT2;
  undef @vT1T2;

  foreach ( keys %$hashTreeRef ) {
    #printf  "$_ ------\n";
    $srNo=$_;
    if($$hashTreeRef{$_}[0] eq "T1") { $vT1=$$hashTreeRef{$_}[1]; $vT1Flag=1;}
    if($$hashTreeRef{$_}[0] eq "T2") { $vT2=$$hashTreeRef{$_}[1]; $vT2Flag=1;}
    if($$hashTreeRef{$_}[0] eq 'T1*T2') {
          push @vT1T2,$$hashTreeRef{$_}[1];
    }
  }
  if ( ( defined ($vT1) and defined ($vT2) ) and $vT1 ne $vT2 ) {
          foreach my $i (@vT1T2) {
            $status= $vT1T2[$i] eq $vT1 ? "T1-like" : $vT1T2[$i] eq $vT2 ? "T2-like" : "different";
            printf "$groupName\t$$hashTreeRef{$_}[0]\t$vT1T2[$i]\t$srNo\t$status\n";
      }
    }

}

I don't think that's a valid group, as either T1 or T2 should be "GG" according to the last line.