Match 2 patterns together

abh.kumar · March 3, 2015, 6:06pm

How can I quickly print out lines in a datafile which has presence of both patterns in a row of another file. Maybe awk can do it much faster than bash.

Patternfile

ID1 PAT11 PAT12
ID1 PAT21 PAT22
ID2 PAT31 PAT32

datafile

headerline
rgthhrhhhhhtnjttntjjtjtjtjtjtjtjjjtPAT31rf3fffffPAT32efgreggeeeeggge
fgegegPAT11.ewdwd88weded((gfefggegrg!///*...PAT12uuhhggggg
rgthhrhhhhhtnjttntjjtjtjtjtjtjtjjjtPAT41rf3fffffPAT32efgreggeeeeggge
fgegegPAT21.ewdwd88weded((gfefggegrg!///*..ttttuuu.PAT22uuhhggggg
fgegegPAT11.ewdwd88weded((gfefggegrg!///*...PAT12uuhhggggg====

The outputs must be split by the ID (col1) that the patterns belong to.

Outputs

ID1

headerline
fgegegPAT11.ewdwd88weded((gfefggegrg!///*...PAT12uuhhggggg
fgegegPAT21.ewdwd88weded((gfefggegrg!///*..ttttuuu.PAT22uuhhggggg
fgegegPAT11.ewdwd88weded((gfefggegrg!///*...PAT12uuhhggggg====


ID2

headerline
rgthhrhhhhhtnjttntjjtjtjtjtjtjtjjjtPAT31rf3fffffPAT32efgreggeeeeggge

My attempt is very slow in bash,

while read pat
do
while read data
do
if  grep -q $pat[1] $data
if  grep -q $pat[2] $data
echo $data >> $pat[0]
fi
fi
done < datafile
done < patfile

clx · March 4, 2015, 12:47am

Few points,

You don't need inner loop as grep access files as parameters not string. If you want to pass string, you have to pass as STDIN.
You seem to be using array, but it doesnt work like this.

As your pattern file is delimited by white-spaces :

while read id pat1 pat2
do
  echo $id  >> results_file # print ID
  echo >> results_file # print newline
  grep $pat1 datafile | grep $pat2 >>  results_file  # print matching lines
done < patfile

RudiC · March 4, 2015, 3:53am

Try

awk     'FNR==NR        {SP[$1,NR]=$2".*"$3; ID[$1]
                         next
                        }
         FNR==1         {for (i in ID) print > i
                         next
                        }
                        {for (s in SP) if ($0 ~ SP) {split (s, FN, SUBSEP); print > FN[1]}
                        }
        ' patfile datafile