Find matched patterns and print them with other patterns not the whole line

Hi,

I am trying to extract some patterns from a line. The input file is space delimited and i could not use column to get value after "IN" or "OUT" patterns as there could be multiple white spaces before the next digits that i need to print in the output file . I need to print 3 patterns in a line as i bold them below:

inputfile

>RP: 123 DSU17281T6 DSU17281  Dressrossa crassa PT7T0 hypo prot (124 aa) OUT   0 
>RP: 286 DSU17282T0 DSU17282  Dressrossa crassa PT7T0 hypo prot (287 aa) OUT   5   51   70  111  130  170  189  204  223  234  253 
>RP: 110 DSU17283T0 DSU17283  Dressrossa crassa PT7T0 hypo prot (111 aa) OUT   0 
>RP: 230 DSU17284T2 DSU17284  Dressrossa crassa PT7T0 hypo prot (231 aa)  IN   1   18   35 
>RP: 54 DSU16024T3 DSU16024  Dressrossa crassa PT7T0 mo ATP unit 8 (55 aa) OUT   1   13   32 
>RP: 261 DSU16025T2 DSU16025  Dressrossa crassa PT7T0 mo ATP unit 6 (262 aa) OUT   7   41   60   96  118  127  146  153  172  183  206  213  231  236  254 
>RP: 480 DSU16026T0 DSU16026  Dressrossa crassa PT7T0 mo (481 aa)  IN   3   41   58   96  113  120  137 
>RP: 74 DSU16027T1 DSU16027  Dressrossa crassa PT7T0 mo ATP unit 9 (75 aa)  IN   2   11   35   48   72 
>RP: 250 DSU16028T0 DSU16028  Dressrossa crassa PT7T0 mo cytochrome c oxidase subunit 2 (251 aa) OUT   2   40   59   78   97 

Expected Output (in tab delimited)

DSU17281T6	OUT	0 
DSU17282T0	OUT	5
DSU17283T0	OUT	0 
DSU17284T2	IN	1 
DSU16024T3	OUT	1 
DSU16025T2	OUT	7 
DSU16026T0	IN	3 
DSU16027T1	IN	2 
DSU16028T0	OUT	2 

I have been trying many things but it did not give what i want. my best that i could do as below:

grep -wE "DSU.*T[0-9]|IN[[:space:]]*[0-9]|OUT[[:space:]]*[0-9]"

IT shows that the patterns that i wanted are matched good but still it prints the whole line. Then i tried changing "grep -wE" to "grep -oE" and the output that i got are not on the same line as below. I need them to be on the same line as i showed in my expected output above:

DSU17281T6	
OUT	0 
DSU17282T0	
OUT	5
DSU17283T0	
OUT	0 
DSU17284T2	
IN	1 
DSU16024T3	
OUT	1 
DSU16025T2	
OUT	7 
DSU16026T0	
IN	3 
DSU16027T1	
IN	2 
DSU16028T0	
OUT	2 

I tried sed and awk, but i always get the whole lines being printed. Can anyone here show me where do i need to change here? also, may i know how to do it in sed and awk? Thanks.

This should work albeit untested... awk '{for(i=1;i<=NF;i++) if($i ~ "^(IN|OUT)$") print $3,$i,$(i+1)}' file

1 Like
$ awk  'function p(regex){match($0,regex);return substr($0,RSTART,RLENGTH)}{print p("DSU[0-9]+T[0-9]"),p("(IN|OUT)[[:space:]]+[0-9]")}' file

DSU17281T6 OUT   0
DSU17282T0 OUT   5
DSU17283T0 OUT   0
DSU17284T2 IN   1
DSU16024T3 OUT   1
DSU16025T2 OUT   7
DSU16026T0 IN   3
DSU16027T1 IN   2
DSU16028T0 OUT   2

---------- Post updated at 11:39 PM ---------- Previous update was at 11:34 PM ----------

---

for tab separated fields

$ awk  'function p(regex){match($0,regex); return substr($0,RSTART,RLENGTH)}{s = p("DSU[0-9]+T[0-9]") FS p("(IN|OUT)[[:space:]]+[0-9]"); gsub(/[[:space:]]+/,OFS,s); print s}' OFS='\t'  file
1 Like

Hi shamrock,

It worked as expected. I just need to add OFS="\t" at the end. thanks a lot! :slight_smile:

---------- Post updated at 01:12 PM ---------- Previous update was at 01:11 PM ----------

Hi Akshay Hegde,

It worked perfectly.. Thanks a lot :slight_smile: