extracting lines from a file1 which maches a pattern in file2

Hi guys,
Can you help me in solving ths problem?
I have two files file1 and file2 as following:
===FILE1====
>LOC21
MASSKFCTVLSLALFLVLLTHANSAELFSFNFQTFNAANLILQGNASVSSSGQLRLTEVKSNGEPKVASL
VASFATAFTFNILAPILSNSADGLAFALVPVGSQPKFNGGFLGLFQNVTYDP
>LOC05
MASSKFSTVLSLALFLVLLTHANSAELFSFNFQTFNAANLILQGNASVSSSGQLRLTEVKSNGEPKVASL
GRAFYSAPIQIWDSTTGKVASFATAFTFNILAPILSNSADGLAFALVPVGSQPKFNGGFLGLFQNVTYDP
AKVLITYDSSTKLLVASLVYPSGS
>LOC48
MASLQTQMISFYAIFLSILLTTILFFKVNSTGEITSFSIPKFRPDQPNLIFQGGGYTTKEKLTLTKAVK

====FILE2====
LOC21
LOC48

I want to write the complete record form FILE1 (which starts from '>' sign) which matches the pattern in FILE2 into a new file FILE3 which shold look like -
>LOC21
MASSKFCTVLSLALFLVLLTHANSAELFSFNFQTFNAANLILQGNASVSSSGQLRLTEVKSNGEPKVASL
VASFATAFTFNILAPILSNSADGLAFALVPVGSQPKFNGGFLGLFQNVTYDP
>LOC48
MASLQTQMISFYAIFLSILLTTILFFKVNSTGEITSFSIPKFRPDQPNLIFQGGGYTTKEKLTLTKAVK

your help is highly appretiated :slight_smile:

Thanks

Try this awk program :

awk '
NR==FNR { keys[">" $1]++ ; next }
/^>/    { selected = ($1 in keys) }
selected
' FILE2 FILE1

Jean-Pierre.

I could get 'aigles' code to work on my cygwin

Anyway, this is my version:

#! /bin/sh

if [ $# -ne 2 ]; then
        echo "Usage: $0 <file1> <file2>"
        exit 1
fi

awk -v f2=$2 '
BEGIN {
        ok=0
        count=1
        while ( getline < f2 ) {
                file2[count]=sprintf(">%s",$0)
                ++count
        }
}
/^>LOC/ {
        for (i=1;i<count;++i) {
                if ($0 == file2) {
                        print $0
                        ok=1
                        next
                }
        }
        ok=0
}
ok==1 {
        print
}' $1

Run it

$ ./file.sh file1 file2
>LOC21
MASSKFCTVLSLALFLVLLTHANSAELFSFNFQTFNAANLILQGNASVSSSGQLRLTEVKSNGEPKVASL
VASFATAFTFNILAPILSNSADGLAFALVPVGSQPKFNGGFLGLFQNVTYDP
>LOC48
MASLQTQMISFYAIFLSILLTTILFFKVNSTGEITSFSIPKFRPDQPNLIFQGGGYTTKEKLTLTKAVK

Thanks Jean-Pierre..

I tried to run ur code but it didn't produce any output or error.

smriti.

Thanks for your help.. the code is running perfect but i hv one more problem.

actually the line begining with '>' contain other words also and i have different files in which LOC can be smthn els like ABC or GNL but the first three letters after '>' will be same. I solved that by replacing the line
/^>LOC/ {
with
/^>/ {

my file is like this..
>LOC21 ths is a seq of protein bla-bla-bla
MASSKFCTVLSLALFLVLLTHANSAELFSFNFQTFNAANLILQGNASVSSSGQLRLTEVKSNGEPKVASL
VASFATAFTFNILAPILSNSADGLAFALVPVGSQPKFNGGFLGLFQNVTYDP

so whn i tried it on my actual file it could't work as far as i understood words with spaces in header line(begining with '>') is causing a trouble.

I will be thankful if you can help me to solve this out.

cheers!
:slight_smile:
smriti

The script works fine on my box with your example data files.

> cat -n smriti1.dat
     1  >LOC21
     2  MASSKFCTVLSLALFLVLLTHANSAELFSFNFQTFNAANLILQGNASVSSSGQLRLTEVKSNGEPKVASL
     3  VASFATAFTFNILAPILSNSADGLAFALVPVGSQPKFNGGFLGLFQNVTYDP
     4  >LOC05
     5  MASSKFSTVLSLALFLVLLTHANSAELFSFNFQTFNAANLILQGNASVSSSGQLRLTEVKSNGEPKVASL
     6  GRAFYSAPIQIWDSTTGKVASFATAFTFNILAPILSNSADGLAFALVPVGSQPKFNGGFLGLFQNVTYDP
     7  AKVLITYDSSTKLLVASLVYPSGS
     8  >LOC48
     9  MASLQTQMISFYAIFLSILLTTILFFKVNSTGEITSFSIPKFRPDQPNLIFQGGGYTTKEKLTLTKAVK
> cat -n smriti2.dat
     1  LOC21
     2  LOC48
> cat -n smriti.sh
     1  awk '
     2  NR==FNR { keys[">" $1]++ ; next }
     3  /^>/    { selected = ($1 in keys) }
     4  selected
     5  ' smriti2.dat smriti1.dat
     6
> smriti.sh
>LOC21
MASSKFCTVLSLALFLVLLTHANSAELFSFNFQTFNAANLILQGNASVSSSGQLRLTEVKSNGEPKVASL
VASFATAFTFNILAPILSNSADGLAFALVPVGSQPKFNGGFLGLFQNVTYDP
>LOC48
MASLQTQMISFYAIFLSILLTTILFFKVNSTGEITSFSIPKFRPDQPNLIFQGGGYTTKEKLTLTKAVK
>

Beware, the order for input files is FILE2 FILE1.
if you specify FILE1 FILE2, the script doesn't product any result.

Jean-Pierre.

Hi aigles,

Its working now although I hd put the order of files correctly before as well.

Actually i tried to run it as single line on command line. I think it shudn't make any
difference.
But anyways its working fine nw and it solved my other problem also as it works even if my header line (the one begining with '>' ) contain more words.

Thanks Jean :b:
smriti

I tried the same script on a little modified file where headre line is
>LOC_Os01g57570.1|12001.m11908|protein minor allergen Alt a 7, putative, [expressed]
MAVKVYVVYYSMYGHVAKLAEEIKKGASSIEGVEAKIWQVPETLHEEVLGKMGAPPKPDV
PTITPQELTEADGILFGFP

in place of

>LOC575
MAVKVYVVYYSMYGHVAKLAEEIKKGASSIEGVEAKIWQVPETLHEEVLGKMGAPPKPDV
PTITPQELTEADGILFGFP

script is not working on ths.. can you please help me out??

Thanks
Smriti

What is the contents of FILE2 ?

Hi aigles,

here are the file details -

===FILE1===
>LOC_Os01g57570.1|12001.m11908|protein minor allergen Alt a 7, [expressed]
MAVKVYVVYYSMYGHVAKLAEEIKKGASSIEGVEAKIWQVPETLHEEVLGKMGAPPKPDV
PTITPQELTEADGILFGFP
>LOC_Os01g57640.1|12001.m11908|protein lectin 7, (putative), expressed
MAVKVYVVYYSMYGHVAKLAEEIKKGASSIEGVEAKIWQVPETLHEEVLGKMGAPPKPDV
PTITPQELTEADGILFGFPTRFGMMAAQMKAFFDATGGLWSEQSLAGKPAGIFFS
>LOC_Os01g57000.2|12001.m43222|protein minor allergen Alt a 7
MAVKVYVVYYSMYGHVAKLAEEIKKGASSIEGVEAKIWQVPETLHEEVLGKMGAPPKPDV
PTITPQELTEADGILFGFPTRFGMMAAQMKAFFDATGGLWSEQSL

====FILE2====
LOC_Os01g57570
LOC_Os01g57000

and ths LOC can be any three letters such as ABC or GNL but they will be same in every header (line with a '>' symbol)

Thanks :slight_smile:
smriti

If the first character '.' in records of FILE1 can act as a field separator :

awk -F. '
NR==FNR { keys[">" $1]++ ; next }
/^>/    { selected = ($1 in keys) }
selected
' FILE2 FILE1

If it is not the case, if all values in FILE2 have the same length :

awk '
NR==1   { key_length = length($1)+1 }
NR==FNR { keys[">" $1]++ ; next }
/^>/    { selected = (substr($0, 1, key_length) in keys) }
selected
' FILE2 FILE1

Otherwise :

awk '
NR==FNR { keys[">" $1] = length($1)+1 ; next }
/^>/    {
   selected = 0;
   for (k in keys) {
      if (substr($0,1,keys[k]) == k) {
         selected = 1;
         break;
      }
   }
}
selected
' FILE2 FILE1

Jean-Pierre.

The code is running perfect. Thanks a lot.

smriti :slight_smile: