Matched a pattern from multiple columns

redse171 · February 7, 2014, 10:26am

Hi,

I need to extract an info in $1 based on a matched pattern in $2,$3,$4, and $5.
The sample input file as follows:-

ID                 Pat1    Pat2    Pro1     use1 

add41              M                 M        M
add87              M        M        M        M
add32                       M        M        M
add66              M        
add55              M        M        M        M

The output should be:-

add87
add55

My input file is tab-delimited and i tried coding using awk:-

awk 'x[$2,$3,$4,$5]++' FS=="\t" inputfile

but it gives me the whole ID with 'M'. I just need those that have M in all columns (2,3,4,5). Appreciate your kind help. Thanks

vgersh99 · February 7, 2014, 10:33am

to start with:

awk 'NF==5 {print $1}' myFile

redse171 · February 7, 2014, 10:46am

Hi vgersh99,

Your code does work. thanks so much!
But i have another issue. If i want to get output on a specified columns, how should i do it? if i set the NF==2, i will get the result for any 2 columns that have "M" but i would not know which columns they are. unless i print the whole thing and check which columns. But, this going to make me check all of them and i have huge files to work on. Therefore, i need to set the specific columns. Let say, i want to get ID which $2 and $3 have "M". How should i do it? thanks

vgersh99 · February 7, 2014, 10:56am

awk -f red.awk myFile
OR
awk -v cols='2 3' -f  red.awk myFile

where red.awk:

BEGIN {
  if (!cols) cols="2 3 4 5"
  if (!val) val="M"
  nc=split(cols, colsA, FS)
}
NF==5 {
  p=1
  for(i=1;i<=nc;i++)
    if ($colsA != val) {
       p=0
       break
    }
  if (p) print $1
}

redse171 · February 7, 2014, 11:57am

Hi vgersh99,

i tried your above codes, but the result is wrong. if possible, can you please explain to me the coding?
also, just for clarification, for the example that i gave earlier ($2 and $3), it means that i want to extract ID in condition where both columns has "M" at the same time. i might need to extract ID where $2 and $5 have "M" at the same time too. thanks

Akshay_Hegde · February 7, 2014, 12:30pm

Could this help you ?

$ cat file
ID                 Pat1    Pat2    Pro1     use1 

add41              M                 M        M
add87              M        M        M        M
add32                       M        M        M
add66              M        
add55              M        M        M        M

awk 'NR==1{
           n=split(col,A," ")
          }
          {
           flag=0
           for(i=1;i<=n;i++)
           if($A==key)++flag
           if(flag==n)print $1
          }' col='2 3 4 5' key='M' file

Resulting

add87
add55

redse171 · February 7, 2014, 12:45pm

Hi Akhsay,

Your code does work for the first condition that i mentioned above. But if i choose $2 and $3 to have "M" at the same time, it didnt work. ok, i will give another example here:-

ID                 Pat1    Pat2    Pro1     use1 

add11              M        M        
add17              M                M        M
add32              M        M        
add66              M        M        
add55              M        M       M       
add47                       M       M

If i choose $2 and $3, the results should be

add11
add32
add66

thanks.

Akshay_Hegde · February 7, 2014, 1:13pm

redse171:

Hi Akhsay,

Your code does work for the first condition that i mentioned above. But if i choose $2 and $3 to have "M" at the same time, it didnt work. ok, i will give another example here:-
ID                 Pat1    Pat2    Pro1     use1 

add11              M        M        
add17              M                M        M
add32              M        M        
add66              M        M        
add55              M        M       M       
add47                       M       M   
If i choose $2 and $3, the results should be
add11
add32
add66
thanks.

Why not ?

add11
add32
add66
add55

post output of cat -A from your new input.

redse171 · February 7, 2014, 2:01pm

I just want "M" that exist in 2 columns not more or less than that. add55 has M in column 4, which is not what i want. thanks

vgersh99 · February 8, 2014, 10:23am

awk -v cols='2 3' -f red.awk myFile

red.awk:

BEGIN {
  FS="\t"
  if (!cols) cols="2 3 4 5"
  if (!val) val="M"
  nc=split(cols, colsA, " ")
}
{
  p=0
  for(i=1;i<=nc;i++)
    ($colsA==val)?p++:p--
  if (p==nc && p+1==NF) print $1
}

if it still doesn't work, please provide the output of cat -vet myFile using the code tags.

redse171 · February 8, 2014, 11:36am

Hi,

i guess it is the problem with FS. Below is cat results:-

ID                                 use1                           P00dfd|Pat_2          P00dddf|PRO1                        P00ddff|Pat1          $
Add7G00710                                                                                               M                                                                     $
Add6G11920                         M                                                                     M                                                                     $
Add6G00630                                                                                               M                                                                     $
Add5G10140                                                                                               M                                                                     $
Add4G00470                                                                                               M                                                                     $
Add3G02060                                                                                               M                                                                     $
Add2G17360                                                            M                                  M                                                                     $
Add2G12550                                                                                               M                                                                     $
Add2G05350                         M                                  M                                                                                                        $
Add6G00710                                                                                               M                                                                     $
Add5G01350                                                                                               M                                                                     $
Add4G14230                                                                                               M                                                                     $
Add4G06080                         M                                  M                                  M                                  M                                  $
Add4G03920                                                                                               M                                                                     $
Add3G11760                                                                                               M                                                                     $
Add3G02890                                                                                               M                                                                     $
Add1G11830                                                            M                                  M                                  M                                  $
Add2G12710                                                                                               M                                                                     $
Add5G13290                         M                                  M                                                                     $

Actually my data was in vertical. I used an awk script to make it horizontal so that it is easier for me to see where "M" is. But, when i do column -t to clean it, the alignment goes haywire. I am still trying to figure out how to deal with this. If you have suggestion, pls let me know. Thanks

vgersh99 · February 8, 2014, 12:02pm

I don't know what you mean by vertical vs horizontal, but what you quoted is not TAB separated fields, but rather multi-space separated fields.
Pls post the original file before any transformations.

redse171 · February 8, 2014, 12:27pm

Hi,

a sample of original data as follows:-

Add7G00710	213	235	P00dddf|PRO1	M
Add6G11920    	65	511	use1		M                                                                     
Add6G11920     	133	149	P00dddf|PRO1    M                                                                                                                                          
Add6G00630     	78	510	P00dddf|PRO1    M                                                                     
Add5G10140     	93	539	P00dddf|PRO1    M                                                                     
Add4G00470     	10	471	P00dddf|PRO1    M
Add4G00470     	109	351	P00dddf|PRO1    M                                                                     
Add3G02060     	1	299	P00dddf|PRO1    M
Add2G17360     	1	116	P00dfd|Pat_2    M                                                         
Add2G17360     	27	478	P00dddf|PRO1    M                                                                     
Add2G12550     	5	22	P00dddf|PRO1    M                                                                     
Add2G05350     	1	450	use1            M        
Add2G05350     88	624	P00dfd|Pat_2    M                                                                                                        
Add6G00710     11	447	P00dddf|PRO1    M                                                                     
Add5G01350     149	164	P00dddf|PRO1    M                                                                     
Add4G14230     11	447	P00dddf|PRO1    M                                                                     
Add4G06080     132	157	use1            M                                                                  
Add4G06080     113	514	P00dfd|Pat_2    M 
Add4G06080     24	481	P00dddf|PRO1    M 
Add4G06080     320	337	P00ddff|Pat1    M 
Add4G03920     75	507	P00dddf|PRO1    M                                                                     
Add3G11760     27	530	P00dddf|PRO1    M                                                                     
Add3G02890     19	473	P00dddf|PRO1    M                                                                     
Add1G11830     36	472	P00dddf|PRO1    M  
Add1G11830     66	561	P00ddff|Pat1    M     
Add1G11830     27	536	P00ddfd|Pat_2   M                                  
Add2G12710     23	543	P00dddf|PRO1    M                                                                     
Add5G13290     5	22	use1            M  
Add5G13290     37	54	P00dfd|Pat_2	M

There are duplicates as they are different in $2 and $3. But, whichever duplicates, i just take one. for example, Add4G00470 in $1 has appeared twice due to value in $2 and $3. Since they are in the same group (in $4), i take it as 1 only to my output. Thanks.

Scrutinizer · February 8, 2014, 12:48pm

Please note that your latest sample differs considerably from your samples in posts #1, #7 and #11 and these also differ from one another. Is this the correct sample this time?

*snip*

redse171 · February 8, 2014, 1:23pm

Hi Scrutinizer,

Sorry for the confusion. The first sample (#1) is just a toy example. When things didnt work out, then i gave the real sample post (#7 and #11). However, since there is issue with field separator, vgersh99 asked for the original file. Therefore, the latest data is the original data before being transformed into #7 or #11 as per request. Thanks

juzz4fun · February 10, 2014, 9:32am

Hi,
From what I understand, you want to print first column whenever there are exactly two "M"s in 2nd, 3rd, 4th and 5th columns combined. If that is the case (and file is tab delimited), I would create a korn shell script as below and name it as script.ksh :

#!/bin/ksh

infile=$1

nawk -F"        " <${infile} \
     'BEGIN{cnt=0}           \
      NR>=2{hdr=$1; if ( $2 == "M" ) cnt++;
                    if ( $3 == "M" ) cnt++;
                    if ( $4 == "M" ) cnt++;
                    if ( $5 == "M" ) cnt++;
                    if ( cnt == 2 ) print hdr; cnt=0}'

Then I execute this script as below:

script.ksh infile
where infile is the tab delimited file of interest.
Hope this helps

redse171 · February 10, 2014, 10:11am

Hi juzz4fun,

thanks so much for the codes. I have been thinking about the FS issue. I have decided to work on the original file post #13 (before transformation) to get what i wanted instead of input file in post #11. It is much easier i guess as the original file is tab delimeted. Thanks