Matching the entries and printing data

Hi all,

I have a file with Id which I want to compare it with other file to get the sequence of a particular id.

File 1 with ID

Q7L8J4
Q676U5
Q8NAA4
Q5TYW2
Q5SQ80
Q5VUR7
Q4UJ75
Q96IX9
Q7Z4T9
Q6NTF7
Q8IZP0
Q9NYB9
Q9P2A4
O14639
Q9ULW3
Q969K4
Q15057
Q5T8D3
Q8N7X0
Q9Y2D8
Q8TED9
Q8N4X5

File 2 with sequence information

>sp|Q7L8J4|3BP5L_HUMAN SH3 domain-binding protein 5-like OS=Homo sapiens GN=SH3BP5L PE=1 SV=1
MAELRQVPGGRETPQGELRPEVVEDEVPRSPVAEEPGGGGSSSSEAKLSPREEEELDPRI
QEELEHLNQASEEINQVELQLDEARTTYRRILQESARKLNTQGSHLGSCIEKARPYYEAR
RLAKEAQQETQKAALRYERAVSMHNAAREMVFVAEQGVMADKNRLDPTWQEMLNHATCKV
NEAEEERLRGEREHQRVTRLCQQAEARVQALQKTLRRAIGKSRPYFELKAQFSQILEEHK
AKVTELEQQVAQAKTRYSVALRNLEQISEQIHARRRGGLPPHPLGPRRSSPVGAEAGPED
MEDGDSGIEGAEGAGLEEGSSLGPGPAPDTDTLSLLSLRTVASDLQKCDSVEHLRGLSDH
VSLDGQELGTRSGGRRGSDGGARGGRHQRSVSL
>sp|Q676U5|A16L1_HUMAN Autophagy-related protein 16-1 OS=Homo sapiens GN=ATG16L1 PE=1 SV=2
MSSGLRAADFPRWKRHISEQLRRRDRLQRQAFEEIILQYNKLLEKSDLHSVLAQKLQAEK
HDVPNRHEISPGHDGTWNDNQLQEMAQLRIKHQEELTELHKKRGELAQLVIDLNNQMQRK
DREMQMNEAKIAECLQTISDLETECLDLRTKLCDLERANQTLKDEYDALQITFTALEGKL
RKTTEENQELVTRWMAEKAQEANRLNAENEKDSRRRQARLQKELAEAAKEPLPVEQDDDI
EVIVDETSDHTEETSPVRAISRAATKRLSQPAGGLLDSITNIFGRRSVSSFPVPQDNVDT
HPGSGKEVRVPATALCVFDAHDGEVNAVQFSPGSRLLATGGMDRRVKLWEVFGEKCEFKG
SLSGSNAGITSIEFDSAGSYLLAASNDFASRIWTVDDYRLRHTLTGHSGKVLSAKFLLDN
ARIVSGSHDRTLKLWDLRSKVCIKTVFAGSSCNDIVCTEQCVMSGHFDKKIRFWDIRSES
IVREMELLGKITALDLNPERTELLSCSRDDLLKVIDLRTNAIKQTFSAPGFKCGSDWTRV
VFSPDGSYVAAGSAEGSLYIWSVLTGKVEKVLSKQHSSSINAVAWSPSGSHVVSVDKGCK
AVLWAQY
>sp|Q8NAA4|A16L2_HUMAN Autophagy-related protein 16-2 OS=Homo sapiens GN=ATG16L2 PE=2 SV=2
MAGPGVPGAPAARWKRHIVRQLRLRDRTQKALFLELVPAYNHLLEKAELLDKFSKKLQPE
PNSVTPTTHQGPWEESELDSDQVPSLVALRVKWQEEEEGLRLVCGEMAYQVVEKGAALGT
LESELQQRQSRLAALEARVAQLREARAQQAQQVEEWRAQNAVQRAAYEALRAHVGLREAA
LRRLQEEARDLLERLVQRKARAAAERNLRNERRERAKQARVSQELKKAAKRTVSISEGPD
TLGDGMRERRETLALAPEPEPLEKEACEKWKRPFRSASATSLTLSHCVDVVKGLLDFKKR
RGHSIGGAPEQRYQIIPVCVAARLPTRAQDVLDAHLSEVNAVRFGPNSSLLATGGADRLI
HLWNVVGSRLEANQTLEGAGGSITSVDFDPSGYQVLAATYNQAAQLWKVGEAQSKETLSG
HKDKVTAAKFKLTRHQAVTGSRDRTVKEWDLGRAYCSRTINVLSYCNDVVCGDHIIISGH
NDQKIRFWDSRGPHCTQVIPVQGRVTSLSLSHDQLHLLSCSRDNTLKVIDLRVSNIRQVF
RADGFKCGSDWTKAVFSPDRSYALAGS

I want these two files to be compared by comapring ID in the first file with the ID encoded between the pipeline "||" in the second file. If it is same then print the complete sequence.

For example, Q676U5 is found in first file and in the second file. So in the output I should have something like this given below
Expected output

>sp|Q676U5|A16L1_HUMAN Autophagy-related protein 16-1 OS=Homo sapiens GN=ATG16L1 PE=1 SV=2
MSSGLRAADFPRWKRHISEQLRRRDRLQRQAFEEIILQYNKLLEKSDLHSVLAQKLQAEK
HDVPNRHEISPGHDGTWNDNQLQEMAQLRIKHQEELTELHKKRGELAQLVIDLNNQMQRK
DREMQMNEAKIAECLQTISDLETECLDLRTKLCDLERANQTLKDEYDALQITFTALEGKL
RKTTEENQELVTRWMAEKAQEANRLNAENEKDSRRRQARLQKELAEAAKEPLPVEQDDDI
EVIVDETSDHTEETSPVRAISRAATKRLSQPAGGLLDSITNIFGRRSVSSFPVPQDNVDT
HPGSGKEVRVPATALCVFDAHDGEVNAVQFSPGSRLLATGGMDRRVKLWEVFGEKCEFKG
SLSGSNAGITSIEFDSAGSYLLAASNDFASRIWTVDDYRLRHTLTGHSGKVLSAKFLLDN
ARIVSGSHDRTLKLWDLRSKVCIKTVFAGSSCNDIVCTEQCVMSGHFDKKIRFWDIRSES
IVREMELLGKITALDLNPERTELLSCSRDDLLKVIDLRTNAIKQTFSAPGFKCGSDWTRV
VFSPDGSYVAAGSAEGSLYIWSVLTGKVEKVLSKQHSSSINAVAWSPSGSHVVSVDKGCK
AVLWAQY

Thanks in advance

I assume you want any entry from the second file to print if it's ID is in the first. This might do it:

awk -F "|" '
    NR == FNR { idx[$1] = 1; next; }
    /^>sp/ { snarf = $2 in idx  }
    snarf
' file1 file2 >output-file

Hi,

It does work but it prints me other records which are not in the file 1 with IDs. These records which are not in file 1 are printed after the last entry in file 1. Why does this happen?

try the below code put it in some script like test.sh
give it the permissions to execute and then

run it

for i in `cat pathto file1`
do
grep $i file2 >> outputfile
done

thanks

Unsure. I tested my code with a file that contained entries that weren't matched in file 1 and never got anything but what was matched. Only thing I can think of is it's an awk version issue. Do you get the same results with this:

awk -F "|" '
    NR == FNR { idx[$1] = 1; next; }
    /^>sp/ {  snarf = idx[$2]+0; }
    snarf
' file1 file2 >output-file

Are the data lines split or have you wrapped the output in you demo file2?