Count multiple columns and print original file

nans · May 26, 2017, 7:00am

Hello, I have two tab files with headers

File1: with 4 columns

header1    header2    header3    header4
44    a    bb    1
57    c    ab    4
64    d    d    5

File2: with 26 columns

header1..    header5    header6    header7 ... header 22...header26
id1 44    a    bb    
id2 57    c    ab    
id3 64    d    d
id4 103  e g

Output

header1..    header5    header6    header7 ... header 22...header26
id2.. 57    c    ab...    4 ...
id3.. 64    d    d... 5 ...
id4.. 103  e g ...   Unknown ...

I want to compare File1.$1$2$3 with File2.$5$6$7 and print its value in column 22 of File2 only if it matches value of '4' or '5' from File1.$4 or has no entry in File1

I started by trying to comparing columns to see if it produces any output but have been stuck since

awk -F, 'NR == FNR {
  a[$1FS$2FS$3] = $5FS$6$FS7; next 
  }
$4 in a {
  print $0, a[$4]
  }' OFS='\t' file1.txt file2.txt > output.txt

Any help is appreciated. Thank you

Scrutinizer · May 26, 2017, 7:24am

Try something like this:

awk '
  NR == FNR {
    a[$1,$2,$3] = $4
    next 
  }

  ($5,$6,$7) in a {
    if ($22 == a[$5,$6,$7]){
      print
    }
    next
  }

  {
    $22="Unknown"
    print
  }
' FS='\t' OFS='\t' file1.txt file2.txt > output.txt

nans · May 26, 2017, 7:36am

@Scruitizer
Thank you but it seems to be printing everything as 'Unknown' and not the matching values of 4/5 if present in File 1.

Scrutinizer · May 26, 2017, 9:19am

Your requirements are not immediately clear to me. Perhaps this is more what you mean:

awk '
  NR == FNR {
    a[$1,$2,$3] = $4
    next 
  } 

  FNR==1 {
    print
    next
  }

  {
    if(($5,$6,$7) in a) {
      $22 = a[$5,$6,$7]
    }
    else {
      $22="Unknown"
    }
    print
  }
' FS='\t' OFS='\t' file1 file2

nans · May 26, 2017, 9:34am

Sorry, maybe I should have been more clear. So File 1, which has 4 columns, is my reference file. the 4th column has frequencies value from 1 to 5

The second file with 26 columns is my query file and I want all the rows in the output file, that has a frequency (File1.$4) with no value(ie not in file 1) or 4 or 5 (which is in file1).

To put it the other way around, i want to delete all rows that has frequency value 1,2 and 3 and keep everything else by matching File1.$1$2$3 with File2.$5$6$7

Thank you

Scrutinizer · May 26, 2017, 1:52pm

OK, so like this?

awk '
  NR == FNR {
    a[$1,$2,$3] = $4
    next 
  } 

  FNR==1 {
    print
    next
  }

  ($5,$6,$7) in a {
    if (a[$5,$6,$7]>3){
      $22=a[$5,$6,$7]
      print
    }
    next
  }

  {
    $22="Unknown"
    print
  }
' FS='\t' OFS='\t' file1 file2

nans · May 31, 2017, 3:43am

yes, thank you very much.