Retrieve data from one file comparing the ID in the second file

Hi all,

I have one file with IDs

Q8NDM7
P0C1S8
Q8TF30
Q9BRP8
O00258
Q6AWC2
Q9ULE0
Q702N8
A4UGR9
Q13426
Q6P2D8
Q9ULM3
A8MXQ7

I want to compare ID file with another file which has complete information about these IDs and also about other IDs which are not in the above ID file. As a result I want only information about the entries in the ID file. The second file has information such as

ID   3BP5L_HUMAN             Reviewed;         393 AA.
AC   Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3;
DT   05-FEB-2008, integrated into UniProtKB/Swiss-Prot.
DT   05-JUL-2004, sequence version 1.
DT   05-SEP-2012, entry version 71.
DE   RecName: Full=SH3 domain-binding protein 5-like;
DE            Short=SH3BP-5-like;
GN   Name=SH3BP5L; Synonyms=KIAA1720; ORFNames=UNQ2766/PRO7133;
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC   Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   TISSUE=Brain;
RX   MEDLINE=21082932; PubMed=11214970; DOI=10.1093/dnares/7.6.347;
RA   Nagase T., Kikuno R., Hattori A., Kondo Y., Okumura K., Ohara O.;
RT   "Prediction of the coding sequences of unidentified human genes. XIX.
RT   The complete sequences of 100 new cDNA clones from brain which code
RT   for large proteins in vitro.";
RL   DNA Res. 7:347-355(2000).
RN   [2] //

We need more records from the second file to see how they are separated. Also please post desired output for that sample data.

As @bartus11 says, your two files aren't connected by any means. Usually (depending on the grep yersion you have installed) grep -f file1 file2 would do the job of finding all lines in file2 that have an ID from file1.

So you want to extract everything from an ID entry to the next ID entry if the AC entry in the record contains an "ID" which is present in your file?

#!/usr/bin/perl

use strict;
use warnings;

open(my $id_file, "<", "id_file"); # list of ids
my $in_record=0;
my @ids=<$id_file>;
close $id_file;
chomp(@ids);
my %id_check;
map {$_++} @id_check{@ids};
open(my $records, "<", "tmp.dat"); # records of the form above
my $head;
while(<$records>){
    $head=$_ if (/^ID/);
    if (/^AC/){
        $in_record=0;
        my @entries=$_=~/\s+([^;]+);/g;
        for my$id(@entries){
            $in_record=1 if ($id_check{$id});
        }
    print $head if $in_record;
    }
print if $in_record;
}

Hi all,

Thanks for reply.

Here are sample example of two records: each record is separated by "//"

ID   3BP5L_HUMAN             Reviewed;         393 AA.
AC   Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3;
DT   05-FEB-2008, integrated into UniProtKB/Swiss-Prot.
DT   05-JUL-2004, sequence version 1.
DT   05-SEP-2012, entry version 71.
DE   RecName: Full=SH3 domain-binding protein 5-like;
DE            Short=SH3BP-5-like;
GN   Name=SH3BP5L; Synonyms=KIAA1720; ORFNames=UNQ2766/PRO7133;
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC   Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   TISSUE=Brain;
RX   MEDLINE=21082932; PubMed=11214970; DOI=10.1093/dnares/7.6.347;
RA   Nagase T., Kikuno R., Hattori A., Kondo Y., Okumura K., Ohara O.;
RT   "Prediction of the coding sequences of unidentified human genes. XIX.
RT   The complete sequences of 100 new cDNA clones from brain which code
RT   for large proteins in vitro.";
RL   DNA Res. 7:347-355(2000).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   TISSUE=Amygdala;
RX   MEDLINE=21154917; PubMed=11230166; DOI=10.1101/gr.GR1547R;
RA   Wiemann S., Weil B., Wellenreuther R., Gassenhuber J., Glassl S.,
RA   Ansorge W., Boecher M., Bloecker H., Bauersachs S., Blum H.,
RA   Lauber J., Duesterhoeft A., Beyer A., Koehrer K., Strack N.,
RA   Mewes H.-W., Ottenwaelder B., Obermaier B., Tampe J., Heubner D.,
RA   Wambutt R., Korn B., Klein M., Poustka A.;
RT   "Towards a catalog of human genes and proteins: sequencing and
RT   analysis of 500 novel complete protein coding human cDNAs.";
RL   Genome Res. 11:422-435(2001).
RN   [3]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RX   MEDLINE=22887296; PubMed=12975309; DOI=10.1101/gr.1293003;
RA   Clark H.F., Gurney A.L., Abaya E., Baker K., Baldwin D.T., Brush J.,
RA   Chen J., Chow B., Chui C., Crowley C., Currell B., Deuel B., Dowd P.,
RA   Eaton D., Foster J.S., Grimaldi C., Gu Q., Hass P.E., Heldens S.,
RA   Huang A., Kim H.S., Klimowski L., Jin Y., Johnson S., Lee J.,
RA   Lewis L., Liao D., Mark M.R., Robbie E., Sanchez C., Schoenfeld J.,
RA   Seshagiri S., Simmons L., Singh J., Smith V., Stinson J., Vagts A.,
RA   Vandlen R.L., Watanabe C., Wieand D., Woods K., Xie M.-H.,
RA   Yansura D.G., Yi S., Yu G., Yuan J., Zhang M., Zhang Z., Goddard A.D.,
RA   Wood W.I., Godowski P.J., Gray A.M.;
RT   "The secreted protein discovery initiative (SPDI), a large-scale
RT   effort to identify novel human secreted and transmembrane proteins: a
RT   bioinformatics assessment.";
RL   Genome Res. 13:2265-2270(2003).
RN   [4]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RX   PubMed=14702039; DOI=10.1038/ng1285;
RA   Ota T., Suzuki Y., Nishikawa T., Otsuki T., Sugiyama T., Irie R.,
RA   Wakamatsu A., Hayashi K., Sato H., Nagai K., Kimura K., Makita H.,
RA   Sekine M., Obayashi M., Nishi T., Shibahara T., Tanaka T., Ishii S.,
RA   Yamamoto J., Saito K., Kawai Y., Isono Y., Nakamura Y., Nagahari K.,
RA   Murakami K., Yasuda T., Iwayanagi T., Wagatsuma M., Shiratori A.,
RA   Sudo H., Hosoiri T., Kaku Y., Kodaira H., Kondo H., Sugawara M.,
RA   Takahashi M., Kanda K., Yokoi T., Furuya T., Kikkawa E., Omura Y.,
RA   Abe K., Kamihara K., Katsuta N., Sato K., Tanikawa M., Yamazaki M.,
RA   Ninomiya K., Ishibashi T., Yamashita H., Murakawa K., Fujimori K.,
RA   Tanai H., Kimata M., Watanabe M., Hiraoka S., Chiba Y., Ishida S.,
RA   Ono Y., Takiguchi S., Watanabe S., Yosida M., Hotuta T., Kusano J.,
RA   Kanehori K., Takahashi-Fujii A., Hara H., Tanase T.-O., Nomura Y.,
RA   Togiya S., Komai F., Hara R., Takeuchi K., Arita M., Imose N.,
RA   Musashino K., Yuuki H., Oshima A., Sasaki N., Aotsuka S.,
RA   Yoshikawa Y., Matsunawa H., Ichihara T., Shiohata N., Sano S.,
RA   Moriya S., Momiyama H., Satoh N., Takami S., Terashima Y., Suzuki O.,
RA   Nakagawa S., Senoh A., Mizoguchi H., Goto Y., Shimizu F., Wakebe H.,
RA   Hishigaki H., Watanabe T., Sugiyama A., Takemoto M., Kawakami B.,
RA   Yamazaki M., Watanabe K., Kumagai A., Itakura S., Fukuzumi Y.,
RA   Fujimori Y., Komiyama M., Tashiro H., Tanigami A., Fujiwara T.,
RA   Ono T., Yamada K., Fujii Y., Ozaki K., Hirao M., Ohmori Y.,
RA   Kawabata A., Hikiji T., Kobatake N., Inagaki H., Ikema Y., Okamoto S.,
RA   Okitani R., Kawakami T., Noguchi S., Itoh T., Shigeta K., Senba T.,
RA   Matsumura K., Nakajima Y., Mizuno T., Morinaga M., Sasaki M.,
RA   Togashi T., Oyama M., Hata H., Watanabe M., Komatsu T.,
RA   Mizushima-Sugano J., Satoh T., Shirai Y., Takahashi Y., Nakagawa K.,
RA   Okumura K., Nagase T., Nomura N., Kikuchi H., Masuho Y., Yamashita R.,
RA   Nakai K., Yada T., Nakamura Y., Ohara O., Isogai T., Sugano S.;
RT   "Complete sequencing and characterization of 21,243 full-length human
RT   cDNAs.";
RL   Nat. Genet. 36:40-45(2004).
RN   [5]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=16710414; DOI=10.1038/nature04727;
RA   Gregory S.G., Barlow K.F., McLay K.E., Kaul R., Swarbreck D.,
RA   Dunham A., Scott C.E., Howe K.L., Woodfine K., Spencer C.C.A.,
RA   Jones M.C., Gillson C., Searle S., Zhou Y., Kokocinski F.,
RA   McDonald L., Evans R., Phillips K., Atkinson A., Cooper R., Jones C.,
RA   Hall R.E., Andrews T.D., Lloyd C., Ainscough R., Almeida J.P.,
RA   Ambrose K.D., Anderson F., Andrew R.W., Ashwell R.I.S., Aubin K.,
RA   Babbage A.K., Bagguley C.L., Bailey J., Beasley H., Bethel G.,
RA   Bird C.P., Bray-Allen S., Brown J.Y., Brown A.J., Buckley D.,
RA   Burton J., Bye J., Carder C., Chapman J.C., Clark S.Y., Clarke G.,
RA   Clee C., Cobley V., Collier R.E., Corby N., Coville G.J., Davies J.,
RA   Deadman R., Dunn M., Earthrowl M., Ellington A.G., Errington H.,
RA   Frankish A., Frankland J., French L., Garner P., Garnett J., Gay L.,
RA   Ghori M.R.J., Gibson R., Gilby L.M., Gillett W., Glithero R.J.,
RA   Grafham D.V., Griffiths C., Griffiths-Jones S., Grocock R.,
RA   Hammond S., Harrison E.S.I., Hart E., Haugen E., Heath P.D.,
RA   Holmes S., Holt K., Howden P.J., Hunt A.R., Hunt S.E., Hunter G.,
RA   Isherwood J., James R., Johnson C., Johnson D., Joy A., Kay M.,
RA   Kershaw J.K., Kibukawa M., Kimberley A.M., King A., Knights A.J.,
RA   Lad H., Laird G., Lawlor S., Leongamornlert D.A., Lloyd D.M.,
RA   Loveland J., Lovell J., Lush M.J., Lyne R., Martin S.,
RA   Mashreghi-Mohammadi M., Matthews L., Matthews N.S.W., McLaren S.,
RA   Milne S., Mistry S., Moore M.J.F., Nickerson T., O'Dell C.N.,
RA   Oliver K., Palmeiri A., Palmer S.A., Parker A., Patel D., Pearce A.V.,
RA   Peck A.I., Pelan S., Phelps K., Phillimore B.J., Plumb R., Rajan J.,
RA   Raymond C., Rouse G., Saenphimmachak C., Sehra H.K., Sheridan E.,
RA   Shownkeen R., Sims S., Skuce C.D., Smith M., Steward C.,
RA   Subramanian S., Sycamore N., Tracey A., Tromans A., Van Helmond Z.,
RA   Wall M., Wallis J.M., White S., Whitehead S.L., Wilkinson J.E.,
RA   Willey D.L., Williams H., Wilming L., Wray P.W., Wu Z., Coulson A.,
RA   Vaudin M., Sulston J.E., Durbin R.M., Hubbard T., Wooster R.,
RA   Dunham I., Carter N.P., McVean G., Ross M.T., Harrow J., Olson M.V.,
RA   Beck S., Rogers J., Bentley D.R.;
RT   "The DNA sequence and biological annotation of human chromosome 1.";
RL   Nature 441:315-321(2006).
RN   [6]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RA   Mural R.J., Istrail S., Sutton G.G., Florea L., Halpern A.L.,
RA   Mobarry C.M., Lippert R., Walenz B., Shatkay H., Dew I., Miller J.R.,
RA   Flanigan M.J., Edwards N.J., Bolanos R., Fasulo D., Halldorsson B.V.,
RA   Hannenhalli S., Turner R., Yooseph S., Lu F., Nusskern D.R.,
RA   Shue B.C., Zheng X.H., Zhong F., Delcher A.L., Huson D.H.,
RA   Kravitz S.A., Mouchard L., Reinert K., Remington K.A., Clark A.G.,
RA   Waterman M.S., Eichler E.E., Adams M.D., Hunkapiller M.W., Myers E.W.,
RA   Venter J.C.;
RL   Submitted (JUL-2005) to the EMBL/GenBank/DDBJ databases.
RN   [7]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   TISSUE=Colon, and Lung;
RX   PubMed=15489334; DOI=10.1101/gr.2596504;
RG   The MGC Project Team;
RT   "The status, quality, and expansion of the NIH full-length cDNA
RT   project: the Mammalian Gene Collection (MGC).";
RL   Genome Res. 14:2121-2127(2004).
RN   [8]
RP   PHOSPHORYLATION [LARGE SCALE ANALYSIS] AT SER-343; SER-350 AND
RP   SER-362, AND MASS SPECTROMETRY.
RC   TISSUE=Cervix carcinoma;
RX   PubMed=18669648; DOI=10.1073/pnas.0805139105;
RA   Dephoure N., Zhou C., Villen J., Beausoleil S.A., Bakalarski C.E.,
RA   Elledge S.J., Gygi S.P.;
RT   "A quantitative atlas of mitotic phosphorylation.";
RL   Proc. Natl. Acad. Sci. U.S.A. 105:10762-10767(2008).
RN   [9]
RP   PHOSPHORYLATION [LARGE SCALE ANALYSIS] AT SER-362, AND MASS
RP   SPECTROMETRY.
RC   TISSUE=Leukemic T-cell;
RX   PubMed=19690332; DOI=10.1126/scisignal.2000007;
RA   Mayya V., Lundgren D.H., Hwang S.-I., Rezaul K., Wu L., Eng J.K.,
RA   Rodionov V., Han D.K.;
RT   "Quantitative phosphoproteomic analysis of T cell receptor signaling
RT   reveals system-wide modulation of protein-protein interactions.";
RL   Sci. Signal. 2:RA46-RA46(2009).
CC   -!- SIMILARITY: Belongs to the SH3BP5 family.
CC   -!- SEQUENCE CAUTION:
CC       Sequence=BAB21811.1; Type=Erroneous initiation;
CC   -----------------------------------------------------------------------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs License
CC   -----------------------------------------------------------------------
DR   EMBL; AB051507; BAB21811.1; ALT_INIT; mRNA.
DR   EMBL; AL136569; CAB66504.1; -; mRNA.
DR   EMBL; AY358453; AAQ88818.1; -; mRNA.
DR   EMBL; AK056382; BAB71171.1; -; mRNA.
DR   EMBL; AL732583; CAI18798.1; -; Genomic_DNA.
DR   EMBL; CH471257; EAW57534.1; -; Genomic_DNA.
DR   EMBL; BC010871; AAH10871.1; -; mRNA.
DR   EMBL; BC017254; AAH17254.1; -; mRNA.
DR   IPI; IPI00028359; -.
DR   RefSeq; NP_085148.1; NM_030645.1.
DR   UniGene; Hs.298573; -.
DR   ProteinModelPortal; Q7L8J4; -.
DR   IntAct; Q7L8J4; 2.
DR   MINT; MINT-1688351; -.
DR   PhosphoSite; Q7L8J4; -.
DR   DMDM; 74749902; -.
DR   PRIDE; Q7L8J4; -.
DR   Ensembl; ENST00000366472; ENSP00000355428; ENSG00000175137.
DR   GeneID; 80851; -.
DR   KEGG; hsa:80851; -.
DR   UCSC; uc001iev.1; human.
DR   CTD; 80851; -.
DR   GeneCards; GC01M249104; -.
DR   H-InvDB; HIX0160026; -.
DR   HGNC; HGNC:29360; SH3BP5L.
DR   HPA; HPA038068; -.
DR   neXtProt; NX_Q7L8J4; -.
DR   PharmGKB; PA142670923; -.
DR   eggNOG; NOG263345; -.
DR   GeneTree; ENSGT00390000018500; -.
DR   HOGENOM; HOG000190360; -.
DR   HOVERGEN; HBG057307; -.
DR   InParanoid; Q7L8J4; -.
DR   OMA; GVRGGRH; -.
DR   OrthoDB; EOG4PZJ78; -.
DR   GenomeRNAi; 80851; -.
DR   NextBio; 71284; -.
DR   ArrayExpress; Q7L8J4; -.
DR   Bgee; Q7L8J4; -.
DR   CleanEx; HS_SH3BP5L; -.
DR   Genevestigator; Q7L8J4; -.
DR   InterPro; IPR007940; SH3-bd_5.
DR   PANTHER; PTHR19423; SH3_bd_5; 1.
DR   Pfam; PF05276; SH3BP5; 1.
PE   1: Evidence at protein level;
KW   Coiled coil; Complete proteome; Phosphoprotein; Reference proteome.
FT   CHAIN         1    393       SH3 domain-binding protein 5-like.
FT                                /FTId=PRO_0000317508.
FT   COILED       59    140       Potential.
FT   COILED      169    272       Potential.
FT   COMPBIAS     37     40       Poly-Gly.
FT   COMPBIAS     41     44       Poly-Ser.
FT   COMPBIAS     52     55       Poly-Glu.
FT   MOD_RES     343    343       Phosphoserine.
FT   MOD_RES     350    350       Phosphoserine.
FT   MOD_RES     362    362       Phosphoserine.
FT   MOD_RES     378    378       Phosphoserine (By similarity).
SQ   SEQUENCE   393 AA;  43499 MW;  3693431765F90FDC CRC64;
     MAELRQVPGG RETPQGELRP EVVEDEVPRS PVAEEPGGGG SSSSEAKLSP REEEELDPRI
     QEELEHLNQA SEEINQVELQ LDEARTTYRR ILQESARKLN TQGSHLGSCI EKARPYYEAR
     RLAKEAQQET QKAALRYERA VSMHNAAREM VFVAEQGVMA DKNRLDPTWQ EMLNHATCKV
     NEAEEERLRG EREHQRVTRL CQQAEARVQA LQKTLRRAIG KSRPYFELKA QFSQILEEHK
     AKVTELEQQV AQAKTRYSVA LRNLEQISEQ IHARRRGGLP PHPLGPRRSS PVGAEAGPED
     MEDGDSGIEG AEGAGLEEGS SLGPGPAPDT DTLSLLSLRT VASDLQKCDS VEHLRGLSDH
     VSLDGQELGT RSGGRRGSDG GARGGRHQRS VSL
//
ID   A16L1_HUMAN             Reviewed;         607 AA.
AC   Q676U5; A3EXK9; A3EXL0; B6ZDH0; Q6IPN1; Q6UXW4; Q6ZVZ5; Q8NCY2;
AC   Q96JV5; Q9H619;
DT   12-APR-2005, integrated into UniProtKB/Swiss-Prot.
DT   12-APR-2005, sequence version 2.
DT   05-SEP-2012, entry version 92.
DE   RecName: Full=Autophagy-related protein 16-1;
DE   AltName: Full=APG16-like 1;
GN   Name=ATG16L1; Synonyms=APG16L; ORFNames=UNQ9393/PRO34307;
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC   Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1), AND VARIANT ALA-300.
RC   TISSUE=Fetal brain;
RX   PubMed=15620219; DOI=10.1080/10425170400004104;
RA   Zheng H., Ji C., Li J., Jiang H., Ren M., Lu Q., Gu S., Mao Y.,
RA   Xie Y.;
RT   "Cloning and analysis of human Apg16L.";
RL   DNA Seq. 15:303-305(2004).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORMS 2 AND 5), AND ASSOCIATION OF
RP   VARIANT ALA-300 WITH SUSCEPTIBILITY TO IBD10.
RX   PubMed=17200669; DOI=10.1038/ng1954;
RA   Hampe J., Franke A., Rosenstiel P., Till A., Teuber M., Huse K.,
RA   Albrecht M., Mayr G., De La Vega F.M., Briggs J., Guenther S.,
RA   Prescott N.J., Onnie C.M., Haesler R., Sipos B., Foelsch U.R.,
RA   Lengauer T., Platzer M., Mathew C.G., Krawczak M., Schreiber S.;
RT   "A genome-wide association scan of nonsynonymous SNPs identifies a
RT   susceptibility variant for Crohn disease in ATG16L1.";
RL   Nat. Genet. 39:207-211(2007).
RN   [3]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 3).
RX   MEDLINE=22887296; PubMed=12975309; DOI=10.1101/gr.1293003;
RA   Clark H.F., Gurney A.L., Abaya E., Baker K., Baldwin D.T., Brush J.,
RA   Chen J., Chow B., Chui C., Crowley C., Currell B., Deuel B., Dowd P.,
RA   Eaton D., Foster J.S., Grimaldi C., Gu Q., Hass P.E., Heldens S.,
RA   Huang A., Kim H.S., Klimowski L., Jin Y., Johnson S., Lee J.,
RA   Lewis L., Liao D., Mark M.R., Robbie E., Sanchez C., Schoenfeld J.,
RA   Seshagiri S., Simmons L., Singh J., Smith V., Stinson J., Vagts A.,
RA   Vandlen R.L., Watanabe C., Wieand D., Woods K., Xie M.-H.,
RA   Yansura D.G., Yi S., Yu G., Yuan J., Zhang M., Zhang Z., Goddard A.D.,
RA   Wood W.I., Godowski P.J., Gray A.M.;
RT   "The secreted protein discovery initiative (SPDI), a large-scale
RT   effort to identify novel human secreted and transmembrane proteins: a
RT   bioinformatics assessment.";
RL   Genome Res. 13:2265-2270(2003).
RN   [4]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 4), AND NUCLEOTIDE
RP   SEQUENCE [LARGE SCALE MRNA] OF 55-607 (ISOFORM 2).
RC   TISSUE=Brain, Placenta, and Small intestine;
RX   PubMed=14702039; DOI=10.1038/ng1285;
RA   Ota T., Suzuki Y., Nishikawa T., Otsuki T., Sugiyama T., Irie R.,
RA   Wakamatsu A., Hayashi K., Sato H., Nagai K., Kimura K., Makita H.,
RA   Sekine M., Obayashi M., Nishi T., Shibahara T., Tanaka T., Ishii S.,
RA   Yamamoto J., Saito K., Kawai Y., Isono Y., Nakamura Y., Nagahari K.,
RA   Murakami K., Yasuda T., Iwayanagi T., Wagatsuma M., Shiratori A.,
RA   Sudo H., Hosoiri T., Kaku Y., Kodaira H., Kondo H., Sugawara M.,
RA   Takahashi M., Kanda K., Yokoi T., Furuya T., Kikkawa E., Omura Y.,
RA   Abe K., Kamihara K., Katsuta N., Sato K., Tanikawa M., Yamazaki M.,
RA   Ninomiya K., Ishibashi T., Yamashita H., Murakawa K., Fujimori K.,
RA   Tanai H., Kimata M., Watanabe M., Hiraoka S., Chiba Y., Ishida S.,
RA   Ono Y., Takiguchi S., Watanabe S., Yosida M., Hotuta T., Kusano J.,
RA   Kanehori K., Takahashi-Fujii A., Hara H., Tanase T.-O., Nomura Y.,
RA   Togiya S., Komai F., Hara R., Takeuchi K., Arita M., Imose N.,
RA   Musashino K., Yuuki H., Oshima A., Sasaki N., Aotsuka S.,
RA   Yoshikawa Y., Matsunawa H., Ichihara T., Shiohata N., Sano S.,
RA   Moriya S., Momiyama H., Satoh N., Takami S., Terashima Y., Suzuki O.,
RA   Nakagawa S., Senoh A., Mizoguchi H., Goto Y., Shimizu F., Wakebe H.,
RA   Hishigaki H., Watanabe T., Sugiyama A., Takemoto M., Kawakami B.,
RA   Yamazaki M., Watanabe K., Kumagai A., Itakura S., Fukuzumi Y.,
RA   Fujimori Y., Komiyama M., Tashiro H., Tanigami A., Fujiwara T.,
RA   Ono T., Yamada K., Fujii Y., Ozaki K., Hirao M., Ohmori Y.,
RA   Kawabata A., Hikiji T., Kobatake N., Inagaki H., Ikema Y., Okamoto S.,
RA   Okitani R., Kawakami T., Noguchi S., Itoh T., Shigeta K., Senba T.,
RA   Matsumura K., Nakajima Y., Mizuno T., Morinaga M., Sasaki M.,
RA   Togashi T., Oyama M., Hata H., Watanabe M., Komatsu T.,
RA   Mizushima-Sugano J., Satoh T., Shirai Y., Takahashi Y., Nakagawa K.,
RA   Okumura K., Nagase T., Nomura N., Kikuchi H., Masuho Y., Yamashita R.,
RA   Nakai K., Yada T., Nakamura Y., Ohara O., Isogai T., Sugano S.;
RT   "Complete sequencing and characterization of 21,243 full-length human
RT   cDNAs.";
RL   Nat. Genet. 36:40-45(2004).
RN   [5]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=15815621; DOI=10.1038/nature03466;
RA   Hillier L.W., Graves T.A., Fulton R.S., Fulton L.A., Pepin K.H.,
RA   Minx P., Wagner-McPherson C., Layman D., Wylie K., Sekhon M.,
RA   Becker M.C., Fewell G.A., Delehaunty K.D., Miner T.L., Nash W.E.,
RA   Kremitzki C., Oddy L., Du H., Sun H., Bradshaw-Cordum H., Ali J.,
RA   Carter J., Cordes M., Harris A., Isak A., van Brunt A., Nguyen C.,
RA   Du F., Courtney L., Kalicki J., Ozersky P., Abbott S., Armstrong J.,
RA   Belter E.A., Caruso L., Cedroni M., Cotton M., Davidson T., Desai A.,
RA   Elliott G., Erb T., Fronick C., Gaige T., Haakenson W., Haglund K.,
RA   Holmes A., Harkins R., Kim K., Kruchowski S.S., Strong C.M.,
RA   Grewal N., Goyea E., Hou S., Levy A., Martinka S., Mead K.,
RA   McLellan M.D., Meyer R., Randall-Maher J., Tomlinson C.,
RA   Dauphin-Kohlberg S., Kozlowicz-Reilly A., Shah N.,
RA   Swearengen-Shahid S., Snider J., Strong J.T., Thompson J., Yoakum M.,
RA   Leonard S., Pearman C., Trani L., Radionenko M., Waligorski J.E.,
RA   Wang C., Rock S.M., Tin-Wollam A.-M., Maupin R., Latreille P.,
RA   Wendl M.C., Yang S.-P., Pohl C., Wallis J.W., Spieth J., Bieri T.A.,
RA   Berkowicz N., Nelson J.O., Osborne J., Ding L., Meyer R., Sabo A.,
RA   Shotland Y., Sinha P., Wohldmann P.E., Cook L.L., Hickenbotham M.T.,
RA   Eldred J., Williams D., Jones T.A., She X., Ciccarelli F.D.,
RA   Izaurralde E., Taylor J., Schmutz J., Myers R.M., Cox D.R., Huang X.,
RA   McPherson J.D., Mardis E.R., Clifton S.W., Warren W.C.,
RA   Chinwalla A.T., Eddy S.R., Marra M.A., Ovcharenko I., Furey T.S.,
RA   Miller W., Eichler E.E., Bork P., Suyama M., Torrents D.,
RA   Waterston R.H., Wilson R.K.;
RT   "Generation and annotation of the DNA sequences of human chromosomes 2
RT   and 4.";
RL   Nature 434:724-731(2005).
RN   [6]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RA   Mural R.J., Istrail S., Sutton G.G., Florea L., Halpern A.L.,
RA   Mobarry C.M., Lippert R., Walenz B., Shatkay H., Dew I., Miller J.R.,
RA   Flanigan M.J., Edwards N.J., Bolanos R., Fasulo D., Halldorsson B.V.,
RA   Hannenhalli S., Turner R., Yooseph S., Lu F., Nusskern D.R.,
RA   Shue B.C., Zheng X.H., Zhong F., Delcher A.L., Huson D.H.,
RA   Kravitz S.A., Mouchard L., Reinert K., Remington K.A., Clark A.G.,
RA   Waterman M.S., Eichler E.E., Adams M.D., Hunkapiller M.W., Myers E.W.,
RA   Venter J.C.;
RL   Submitted (JUL-2005) to the EMBL/GenBank/DDBJ databases.
RN   [7]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] OF 114-607 (ISOFORM 2).
RC   TISSUE=Mammary gland;
RX   PubMed=15489334; DOI=10.1101/gr.2596504;
RG   The MGC Project Team;
RT   "The status, quality, and expansion of the NIH full-length cDNA
RT   project: the Mammalian Gene Collection (MGC).";
RL   Genome Res. 14:2121-2127(2004).
RN   [8]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] OF 513-607.
RC   TISSUE=Testis;
RX   PubMed=17974005; DOI=10.1186/1471-2164-8-399;
RA   Bechtel S., Rosenfelder H., Duda A., Schmidt C.P., Ernst U.,
RA   Wellenreuther R., Mehrle A., Schuster C., Bahr A., Bloecker H.,
RA   Heubner D., Hoerlein A., Michel G., Wedler H., Koehrer K.,
RA   Ottenwaelder B., Poustka A., Wiemann S., Schupp I.;
RT   "The full-ORF clone resource of the German cDNA consortium.";
RL   BMC Genomics 8:399-399(2007).
RN   [9]
RP   PHOSPHORYLATION [LARGE SCALE ANALYSIS] AT SER-287; SER-290 AND
RP   SER-304, AND MASS SPECTROMETRY.
RC   TISSUE=Cervix carcinoma;
RX   PubMed=17924679; DOI=10.1021/pr070152u;
RA   Yu L.-R., Zhu Z., Chan K.C., Issaq H.J., Dimitrov D.S., Veenstra T.D.;
RT   "Improved titanium dioxide enrichment of phosphopeptides from HeLa
RT   cells and high confident phosphopeptide identification by cross-
RT   validation of MS/MS and MS/MS/MS spectra.";
RL   J. Proteome Res. 6:4150-4162(2007).
RN   [10]
RP   PHOSPHORYLATION [LARGE SCALE ANALYSIS] AT SER-287, AND MASS
RP   SPECTROMETRY.
RC   TISSUE=Cervix carcinoma;
RX   PubMed=18669648; DOI=10.1073/pnas.0805139105;
RA   Dephoure N., Zhou C., Villen J., Beausoleil S.A., Bakalarski C.E.,
RA   Elledge S.J., Gygi S.P.;
RT   "A quantitative atlas of mitotic phosphorylation.";
RL   Proc. Natl. Acad. Sci. U.S.A. 105:10762-10767(2008).
RN   [11]
RP   ASSOCIATION OF VARIANT ALA-300 WITH SUSCEPTIBILITY TO IBD10.
RX   PubMed=17435756; DOI=10.1038/ng2032;
RA   Rioux J.D., Xavier R.J., Taylor K.D., Silverberg M.S., Goyette P.,
RA   Huett A., Green T., Kuballa P., Barmada M.M., Datta L.W.,
RA   Shugart Y.Y., Griffiths A.M., Targan S.R., Ippoliti A.F.,
RA   Bernard E.-J., Mei L., Nicolae D.L., Regueiro M., Schumm L.P.,
RA   Steinhart A.H., Rotter J.I., Duerr R.H., Cho J.H., Daly M.J.,
RA   Brant S.R.;
RT   "Genome-wide association study identifies new susceptibility loci for
RT   Crohn disease and implicates autophagy in disease pathogenesis.";
RL   Nat. Genet. 39:596-604(2007).
CC   -!- FUNCTION: Plays an essential role in autophagy (By similarity).
CC   -!- SUBUNIT: Homooligomer. Interacts with ATG5. Part of either the
CC       minor and major complexes respectively composed of 4 sets of
CC       ATG12-ATG5 and ATG16L1 (400 kDa) or 8 sets of ATG12-ATG5 and
CC       ATG16L1 (800 kDa) (By similarity).
CC   -!- INTERACTION:
CC       Q9GZQ8:MAP1LC3B; NbExp=2; IntAct=EBI-535909, EBI-373144;
CC       Q9BXW4:MAP1LC3C; NbExp=4; IntAct=EBI-535909, EBI-2603996;
CC   -!- SUBCELLULAR LOCATION: Cytoplasm (By similarity). Preautophagosomal
CC       structure membrane; Peripheral membrane protein (By similarity).
CC       Note=Localized to preautophagosomal structure (PAS) where it is
CC       involved in the membrane targeting of ATG5 (By similarity).
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=5;
CC       Name=1; Synonyms=APG16L beta;
CC         IsoId=Q676U5-1; Sequence=Displayed;
CC       Name=2;
CC         IsoId=Q676U5-2; Sequence=VSP_013386;
CC         Note=May be produced at very low levels due to a premature stop
CC         codon in the mRNA, leading to nonsense-mediated mRNA decay;
CC       Name=3;
CC         IsoId=Q676U5-3; Sequence=VSP_013387, VSP_013388;
CC         Note=No experimental confirmation available;
CC       Name=4;
CC         IsoId=Q676U5-4; Sequence=VSP_013389, VSP_013390;
CC         Note=No experimental confirmation available;
CC       Name=5;
CC         IsoId=Q676U5-5; Sequence=VSP_013389, VSP_013386;
CC         Note=No experimental confirmation available;
CC   -!- DISEASE: Genetic variations in ATG16L1 are associated with
CC       susceptibility to inflammatory bowel disease type 10 (IBD10)
CC       [MIM:611081]. IBD is characterized by a chronic relapsing
CC       intestinal inflammation. IBD is subdivided into Crohn disease (CD)
CC       and ulcerative colitis phenotypes. IBD10 individuals show the
CC       phenotype characteristic to CD. It may involve any part of the
CC       gastrointestinal tract, but most frequently the terminal ileum and
CC       colon. CD is commonly classified as autoimmune disease.
CC   -!- SIMILARITY: Belongs to the WD repeat ATG16 family.
CC   -!- SIMILARITY: Contains 7 WD repeats.
CC   -!- SEQUENCE CAUTION:
CC       Sequence=BAB15448.1; Type=Erroneous translation; Note=Wrong choice of CDS;
CC       Sequence=BAB55412.1; Type=Erroneous initiation;
CC   -----------------------------------------------------------------------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs License
CC   -----------------------------------------------------------------------
DR   EMBL; AY398617; AAR32130.1; -; mRNA.
DR   EMBL; EF079889; ABN48554.1; -; mRNA.
DR   EMBL; EF079890; ABN48555.1; -; mRNA.
DR   EMBL; AY358182; AAQ88549.1; -; mRNA.
DR   EMBL; AK026330; BAB15448.1; ALT_SEQ; mRNA.
DR   EMBL; AK027854; BAB55412.1; ALT_INIT; mRNA.
DR   EMBL; AK123876; BAC85713.1; -; mRNA.
DR   EMBL; AC013726; -; NOT_ANNOTATED_CDS; Genomic_DNA.
DR   EMBL; CH471063; EAW71034.1; -; Genomic_DNA.
DR   EMBL; BC071846; AAH71846.1; -; mRNA.
DR   EMBL; AL834526; CAD39182.1; -; mRNA.
DR   IPI; IPI00432751; -.
DR   IPI; IPI00446614; -.
DR   IPI; IPI00470446; -.
DR   IPI; IPI00555905; -.
DR   IPI; IPI00797150; -.
DR   RefSeq; NP_001177195.1; NM_001190266.1.
DR   RefSeq; NP_001177196.1; NM_001190267.1.
DR   RefSeq; NP_060444.3; NM_017974.3.
DR   RefSeq; NP_110430.5; NM_030803.6.
DR   RefSeq; NP_942593.2; NM_198890.2.
DR   UniGene; Hs.529322; -.
DR   ProteinModelPortal; Q676U5; -.
DR   SMR; Q676U5; 310-606.
DR   DIP; DIP-27552N; -.
DR   IntAct; Q676U5; 20.
DR   MINT; MINT-1141152; -.
DR   STRING; Q676U5; -.
DR   PhosphoSite; Q676U5; -.
DR   DMDM; 62510482; -.
DR   PRIDE; Q676U5; -.
DR   DNASU; 55054; -.
DR   Ensembl; ENST00000347464; ENSP00000318259; ENSG00000085978.
DR   Ensembl; ENST00000373525; ENSP00000362625; ENSG00000085978.
DR   Ensembl; ENST00000392017; ENSP00000375872; ENSG00000085978.
DR   Ensembl; ENST00000392020; ENSP00000375875; ENSG00000085978.
DR   GeneID; 55054; -.
DR   KEGG; hsa:55054; -.
DR   UCSC; uc002vty.3; human.
DR   UCSC; uc002vtz.3; human.
DR   UCSC; uc002vua.3; human.
DR   CTD; 55054; -.
DR   GeneCards; GC02P234118; -.
DR   HGNC; HGNC:21498; ATG16L1.
DR   HPA; HPA012577; -.
DR   MIM; 610767; gene.
DR   MIM; 611081; phenotype.
DR   neXtProt; NX_Q676U5; -.
DR   Orphanet; 206; Crohn disease.
DR   PharmGKB; PA134902949; -.
DR   eggNOG; COG2319; -.
DR   GeneTree; ENSGT00670000097918; -.
DR   HOGENOM; HOG000112569; -.
DR   HOVERGEN; HBG050534; -.
DR   OrthoDB; EOG4SXNC8; -.
DR   GenomeRNAi; 55054; -.
DR   NextBio; 58531; -.
DR   ArrayExpress; Q676U5; -.
DR   Bgee; Q676U5; -.
DR   CleanEx; HS_ATG16L1; -.
DR   Genevestigator; Q676U5; -.
DR   GermOnline; ENSG00000085978; Homo sapiens.
DR   GO; GO:0005776; C:autophagic vacuole; ISS:UniProtKB.
DR   GO; GO:0034045; C:pre-autophagosomal structure membrane; IEA:UniProtKB-SubCell.
DR   GO; GO:0000045; P:autophagic vacuole assembly; NAS:UniProtKB.
DR   GO; GO:0051260; P:protein homooligomerization; NAS:UniProtKB.
DR   GO; GO:0015031; P:protein transport; IEA:UniProtKB-KW.
DR   Gene3D; G3DSA:2.130.10.10; WD40/YVTN_repeat-like; 2.
DR   InterPro; IPR013923; Autophagy-rel_prot_16.
DR   InterPro; IPR020472; G-protein_beta_WD-40_rep.
DR   InterPro; IPR015943; WD40/YVTN_repeat-like_dom.
DR   InterPro; IPR001680; WD40_repeat.
DR   InterPro; IPR019775; WD40_repeat_CS.
DR   InterPro; IPR017986; WD40_repeat_dom.
DR   Pfam; PF08614; ATG16; 1.
DR   Pfam; PF00400; WD40; 5.
DR   PRINTS; PR00320; GPROTEINBRPT.
DR   SMART; SM00320; WD40; 7.
DR   SUPFAM; SSF50978; WD40_like; 1.
DR   PROSITE; PS00678; WD_REPEATS_1; 3.
DR   PROSITE; PS50082; WD_REPEATS_2; 6.
DR   PROSITE; PS50294; WD_REPEATS_REGION; 1.
PE   1: Evidence at protein level;
KW   Alternative splicing; Autophagy; Coiled coil; Complete proteome;
KW   Cytoplasm; Membrane; Phosphoprotein; Polymorphism; Protein transport;
KW   Reference proteome; Repeat; Transport; WD repeat.
FT   CHAIN         1    607       Autophagy-related protein 16-1.
FT                                /FTId=PRO_0000050848.
FT   REPEAT      320    359       WD 1.
FT   REPEAT      364    403       WD 2.
FT   REPEAT      406    445       WD 3.
FT   REPEAT      447    484       WD 4.
FT   REPEAT      486    525       WD 5.
FT   REPEAT      532    573       WD 6.
FT   REPEAT      575    607       WD 7.
FT   COILED       78    230       Potential.
FT   MOD_RES     287    287       Phosphoserine.
FT   MOD_RES     289    289       Phosphoserine (By similarity).
FT   MOD_RES     290    290       Phosphoserine.
FT   MOD_RES     304    304       Phosphoserine.
FT   VAR_SEQ      70    213       Missing (in isoform 4 and isoform 5).
FT                                /FTId=VSP_013389.
FT   VAR_SEQ     266    284       Missing (in isoform 2 and isoform 5).
FT                                /FTId=VSP_013386.
FT   VAR_SEQ     334    368       Missing (in isoform 4).
FT                                /FTId=VSP_013390.
FT   VAR_SEQ     443    470       IKTVFAGSSCNDIVCTEQCVMSGHFDKK -> EEIQSLCLC
FT                                ICLDVSVEVCVCTSEPAFM (in isoform 3).
FT                                /FTId=VSP_013387.
FT   VAR_SEQ     471    607       Missing (in isoform 3).
FT                                /FTId=VSP_013388.
FT   VARIANT     300    300       T -> A (associated with susceptibility to
FT                                IBD10; dbSNP:rs2241880).
FT                                /FTId=VAR_021834.
FT   VARIANT     307    307       E -> K (in dbSNP:rs1866878).
FT                                /FTId=VAR_053386.
FT   CONFLICT    151    151       K -> R (in Ref. 6; BAB55412).
FT   CONFLICT    328    328       V -> A (in Ref. 6; BAB55412).
FT   CONFLICT    529    529       P -> T (in Ref. 6; BAB55412).
SQ   SEQUENCE   607 AA;  68265 MW;  5A5816AE2CF03CA0 CRC64;
     MSSGLRAADF PRWKRHISEQ LRRRDRLQRQ AFEEIILQYN KLLEKSDLHS VLAQKLQAEK
     HDVPNRHEIS PGHDGTWNDN QLQEMAQLRI KHQEELTELH KKRGELAQLV IDLNNQMQRK
     DREMQMNEAK IAECLQTISD LETECLDLRT KLCDLERANQ TLKDEYDALQ ITFTALEGKL
     RKTTEENQEL VTRWMAEKAQ EANRLNAENE KDSRRRQARL QKELAEAAKE PLPVEQDDDI
     EVIVDETSDH TEETSPVRAI SRAATKRLSQ PAGGLLDSIT NIFGRRSVSS FPVPQDNVDT
     HPGSGKEVRV PATALCVFDA HDGEVNAVQF SPGSRLLATG GMDRRVKLWE VFGEKCEFKG
     SLSGSNAGIT SIEFDSAGSY LLAASNDFAS RIWTVDDYRL RHTLTGHSGK VLSAKFLLDN
     ARIVSGSHDR TLKLWDLRSK VCIKTVFAGS SCNDIVCTEQ CVMSGHFDKK IRFWDIRSES
     IVREMELLGK ITALDLNPER TELLSCSRDD LLKVIDLRTN AIKQTFSAPG FKCGSDWTRV
     VFSPDGSYVA AGSAEGSLYI WSVLTGKVEK VLSKQHSSSI NAVAWSPSGS HVVSVDKGCK
     AVLWAQY
//

The expected out put is for one record if matching entry found is: "Q7L8J4"

ID   3BP5L_HUMAN             Reviewed;         393 AA.
AC   Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3;
DT   05-FEB-2008, integrated into UniProtKB/Swiss-Prot.
DT   05-JUL-2004, sequence version 1.
DT   05-SEP-2012, entry version 71.
DE   RecName: Full=SH3 domain-binding protein 5-like;
DE            Short=SH3BP-5-like;
GN   Name=SH3BP5L; Synonyms=KIAA1720; ORFNames=UNQ2766/PRO7133;
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC   Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   TISSUE=Brain;
RX   MEDLINE=21082932; PubMed=11214970; DOI=10.1093/dnares/7.6.347;
RA   Nagase T., Kikuno R., Hattori A., Kondo Y., Okumura K., Ohara O.;
RT   "Prediction of the coding sequences of unidentified human genes. XIX.
RT   The complete sequences of 100 new cDNA clones from brain which code
RT   for large proteins in vitro.";
RL   DNA Res. 7:347-355(2000).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   TISSUE=Amygdala;
RX   MEDLINE=21154917; PubMed=11230166; DOI=10.1101/gr.GR1547R;
RA   Wiemann S., Weil B., Wellenreuther R., Gassenhuber J., Glassl S.,
RA   Ansorge W., Boecher M., Bloecker H., Bauersachs S., Blum H.,
RA   Lauber J., Duesterhoeft A., Beyer A., Koehrer K., Strack N.,
RA   Mewes H.-W., Ottenwaelder B., Obermaier B., Tampe J., Heubner D.,
RA   Wambutt R., Korn B., Klein M., Poustka A.;
RT   "Towards a catalog of human genes and proteins: sequencing and
RT   analysis of 500 novel complete protein coding human cDNAs.";
RL   Genome Res. 11:422-435(2001).
RN   [3]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RX   MEDLINE=22887296; PubMed=12975309; DOI=10.1101/gr.1293003;
RA   Clark H.F., Gurney A.L., Abaya E., Baker K., Baldwin D.T., Brush J.,
RA   Chen J., Chow B., Chui C., Crowley C., Currell B., Deuel B., Dowd P.,
RA   Eaton D., Foster J.S., Grimaldi C., Gu Q., Hass P.E., Heldens S.,
RA   Huang A., Kim H.S., Klimowski L., Jin Y., Johnson S., Lee J.,
RA   Lewis L., Liao D., Mark M.R., Robbie E., Sanchez C., Schoenfeld J.,
RA   Seshagiri S., Simmons L., Singh J., Smith V., Stinson J., Vagts A.,
RA   Vandlen R.L., Watanabe C., Wieand D., Woods K., Xie M.-H.,
RA   Yansura D.G., Yi S., Yu G., Yuan J., Zhang M., Zhang Z., Goddard A.D.,
RA   Wood W.I., Godowski P.J., Gray A.M.;
RT   "The secreted protein discovery initiative (SPDI), a large-scale
RT   effort to identify novel human secreted and transmembrane proteins: a
RT   bioinformatics assessment.";
RL   Genome Res. 13:2265-2270(2003).
RN   [4]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RX   PubMed=14702039; DOI=10.1038/ng1285;
RA   Ota T., Suzuki Y., Nishikawa T., Otsuki T., Sugiyama T., Irie R.,
RA   Wakamatsu A., Hayashi K., Sato H., Nagai K., Kimura K., Makita H.,
RA   Sekine M., Obayashi M., Nishi T., Shibahara T., Tanaka T., Ishii S.,
RA   Yamamoto J., Saito K., Kawai Y., Isono Y., Nakamura Y., Nagahari K.,
RA   Murakami K., Yasuda T., Iwayanagi T., Wagatsuma M., Shiratori A.,
RA   Sudo H., Hosoiri T., Kaku Y., Kodaira H., Kondo H., Sugawara M.,
RA   Takahashi M., Kanda K., Yokoi T., Furuya T., Kikkawa E., Omura Y.,
RA   Abe K., Kamihara K., Katsuta N., Sato K., Tanikawa M., Yamazaki M.,
RA   Ninomiya K., Ishibashi T., Yamashita H., Murakawa K., Fujimori K.,
RA   Tanai H., Kimata M., Watanabe M., Hiraoka S., Chiba Y., Ishida S.,
RA   Ono Y., Takiguchi S., Watanabe S., Yosida M., Hotuta T., Kusano J.,
RA   Kanehori K., Takahashi-Fujii A., Hara H., Tanase T.-O., Nomura Y.,
RA   Togiya S., Komai F., Hara R., Takeuchi K., Arita M., Imose N.,
RA   Musashino K., Yuuki H., Oshima A., Sasaki N., Aotsuka S.,
RA   Yoshikawa Y., Matsunawa H., Ichihara T., Shiohata N., Sano S.,
RA   Moriya S., Momiyama H., Satoh N., Takami S., Terashima Y., Suzuki O.,
RA   Nakagawa S., Senoh A., Mizoguchi H., Goto Y., Shimizu F., Wakebe H.,
RA   Hishigaki H., Watanabe T., Sugiyama A., Takemoto M., Kawakami B.,
RA   Yamazaki M., Watanabe K., Kumagai A., Itakura S., Fukuzumi Y.,
RA   Fujimori Y., Komiyama M., Tashiro H., Tanigami A., Fujiwara T.,
RA   Ono T., Yamada K., Fujii Y., Ozaki K., Hirao M., Ohmori Y.,
RA   Kawabata A., Hikiji T., Kobatake N., Inagaki H., Ikema Y., Okamoto S.,
RA   Okitani R., Kawakami T., Noguchi S., Itoh T., Shigeta K., Senba T.,
RA   Matsumura K., Nakajima Y., Mizuno T., Morinaga M., Sasaki M.,
RA   Togashi T., Oyama M., Hata H., Watanabe M., Komatsu T.,
RA   Mizushima-Sugano J., Satoh T., Shirai Y., Takahashi Y., Nakagawa K.,
RA   Okumura K., Nagase T., Nomura N., Kikuchi H., Masuho Y., Yamashita R.,
RA   Nakai K., Yada T., Nakamura Y., Ohara O., Isogai T., Sugano S.;
RT   "Complete sequencing and characterization of 21,243 full-length human
RT   cDNAs.";
RL   Nat. Genet. 36:40-45(2004).
RN   [5]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=16710414; DOI=10.1038/nature04727;
RA   Gregory S.G., Barlow K.F., McLay K.E., Kaul R., Swarbreck D.,
RA   Dunham A., Scott C.E., Howe K.L., Woodfine K., Spencer C.C.A.,
RA   Jones M.C., Gillson C., Searle S., Zhou Y., Kokocinski F.,
RA   McDonald L., Evans R., Phillips K., Atkinson A., Cooper R., Jones C.,
RA   Hall R.E., Andrews T.D., Lloyd C., Ainscough R., Almeida J.P.,
RA   Ambrose K.D., Anderson F., Andrew R.W., Ashwell R.I.S., Aubin K.,
RA   Babbage A.K., Bagguley C.L., Bailey J., Beasley H., Bethel G.,
RA   Bird C.P., Bray-Allen S., Brown J.Y., Brown A.J., Buckley D.,
RA   Burton J., Bye J., Carder C., Chapman J.C., Clark S.Y., Clarke G.,
RA   Clee C., Cobley V., Collier R.E., Corby N., Coville G.J., Davies J.,
RA   Deadman R., Dunn M., Earthrowl M., Ellington A.G., Errington H.,
RA   Frankish A., Frankland J., French L., Garner P., Garnett J., Gay L.,
RA   Ghori M.R.J., Gibson R., Gilby L.M., Gillett W., Glithero R.J.,
RA   Grafham D.V., Griffiths C., Griffiths-Jones S., Grocock R.,
RA   Hammond S., Harrison E.S.I., Hart E., Haugen E., Heath P.D.,
RA   Holmes S., Holt K., Howden P.J., Hunt A.R., Hunt S.E., Hunter G.,
RA   Isherwood J., James R., Johnson C., Johnson D., Joy A., Kay M.,
RA   Kershaw J.K., Kibukawa M., Kimberley A.M., King A., Knights A.J.,
RA   Lad H., Laird G., Lawlor S., Leongamornlert D.A., Lloyd D.M.,
RA   Loveland J., Lovell J., Lush M.J., Lyne R., Martin S.,
RA   Mashreghi-Mohammadi M., Matthews L., Matthews N.S.W., McLaren S.,
RA   Milne S., Mistry S., Moore M.J.F., Nickerson T., O'Dell C.N.,
RA   Oliver K., Palmeiri A., Palmer S.A., Parker A., Patel D., Pearce A.V.,
RA   Peck A.I., Pelan S., Phelps K., Phillimore B.J., Plumb R., Rajan J.,
RA   Raymond C., Rouse G., Saenphimmachak C., Sehra H.K., Sheridan E.,
RA   Shownkeen R., Sims S., Skuce C.D., Smith M., Steward C.,
RA   Subramanian S., Sycamore N., Tracey A., Tromans A., Van Helmond Z.,
RA   Wall M., Wallis J.M., White S., Whitehead S.L., Wilkinson J.E.,
RA   Willey D.L., Williams H., Wilming L., Wray P.W., Wu Z., Coulson A.,
RA   Vaudin M., Sulston J.E., Durbin R.M., Hubbard T., Wooster R.,
RA   Dunham I., Carter N.P., McVean G., Ross M.T., Harrow J., Olson M.V.,
RA   Beck S., Rogers J., Bentley D.R.;
RT   "The DNA sequence and biological annotation of human chromosome 1.";
RL   Nature 441:315-321(2006).
RN   [6]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RA   Mural R.J., Istrail S., Sutton G.G., Florea L., Halpern A.L.,
RA   Mobarry C.M., Lippert R., Walenz B., Shatkay H., Dew I., Miller J.R.,
RA   Flanigan M.J., Edwards N.J., Bolanos R., Fasulo D., Halldorsson B.V.,
RA   Hannenhalli S., Turner R., Yooseph S., Lu F., Nusskern D.R.,
RA   Shue B.C., Zheng X.H., Zhong F., Delcher A.L., Huson D.H.,
RA   Kravitz S.A., Mouchard L., Reinert K., Remington K.A., Clark A.G.,
RA   Waterman M.S., Eichler E.E., Adams M.D., Hunkapiller M.W., Myers E.W.,
RA   Venter J.C.;
RL   Submitted (JUL-2005) to the EMBL/GenBank/DDBJ databases.
RN   [7]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   TISSUE=Colon, and Lung;
RX   PubMed=15489334; DOI=10.1101/gr.2596504;
RG   The MGC Project Team;
RT   "The status, quality, and expansion of the NIH full-length cDNA
RT   project: the Mammalian Gene Collection (MGC).";
RL   Genome Res. 14:2121-2127(2004).
RN   [8]
RP   PHOSPHORYLATION [LARGE SCALE ANALYSIS] AT SER-343; SER-350 AND
RP   SER-362, AND MASS SPECTROMETRY.
RC   TISSUE=Cervix carcinoma;
RX   PubMed=18669648; DOI=10.1073/pnas.0805139105;
RA   Dephoure N., Zhou C., Villen J., Beausoleil S.A., Bakalarski C.E.,
RA   Elledge S.J., Gygi S.P.;
RT   "A quantitative atlas of mitotic phosphorylation.";
RL   Proc. Natl. Acad. Sci. U.S.A. 105:10762-10767(2008).
RN   [9]
RP   PHOSPHORYLATION [LARGE SCALE ANALYSIS] AT SER-362, AND MASS
RP   SPECTROMETRY.
RC   TISSUE=Leukemic T-cell;
RX   PubMed=19690332; DOI=10.1126/scisignal.2000007;
RA   Mayya V., Lundgren D.H., Hwang S.-I., Rezaul K., Wu L., Eng J.K.,
RA   Rodionov V., Han D.K.;
RT   "Quantitative phosphoproteomic analysis of T cell receptor signaling
RT   reveals system-wide modulation of protein-protein interactions.";
RL   Sci. Signal. 2:RA46-RA46(2009).
CC   -!- SIMILARITY: Belongs to the SH3BP5 family.
CC   -!- SEQUENCE CAUTION:
CC       Sequence=BAB21811.1; Type=Erroneous initiation;
CC   -----------------------------------------------------------------------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs License
CC   -----------------------------------------------------------------------
DR   EMBL; AB051507; BAB21811.1; ALT_INIT; mRNA.
DR   EMBL; AL136569; CAB66504.1; -; mRNA.
DR   EMBL; AY358453; AAQ88818.1; -; mRNA.
DR   EMBL; AK056382; BAB71171.1; -; mRNA.
DR   EMBL; AL732583; CAI18798.1; -; Genomic_DNA.
DR   EMBL; CH471257; EAW57534.1; -; Genomic_DNA.
DR   EMBL; BC010871; AAH10871.1; -; mRNA.
DR   EMBL; BC017254; AAH17254.1; -; mRNA.
DR   IPI; IPI00028359; -.
DR   RefSeq; NP_085148.1; NM_030645.1.
DR   UniGene; Hs.298573; -.
DR   ProteinModelPortal; Q7L8J4; -.
DR   IntAct; Q7L8J4; 2.
DR   MINT; MINT-1688351; -.
DR   PhosphoSite; Q7L8J4; -.
DR   DMDM; 74749902; -.
DR   PRIDE; Q7L8J4; -.
DR   Ensembl; ENST00000366472; ENSP00000355428; ENSG00000175137.
DR   GeneID; 80851; -.
DR   KEGG; hsa:80851; -.
DR   UCSC; uc001iev.1; human.
DR   CTD; 80851; -.
DR   GeneCards; GC01M249104; -.
DR   H-InvDB; HIX0160026; -.
DR   HGNC; HGNC:29360; SH3BP5L.
DR   HPA; HPA038068; -.
DR   neXtProt; NX_Q7L8J4; -.
DR   PharmGKB; PA142670923; -.
DR   eggNOG; NOG263345; -.
DR   GeneTree; ENSGT00390000018500; -.
DR   HOGENOM; HOG000190360; -.
DR   HOVERGEN; HBG057307; -.
DR   InParanoid; Q7L8J4; -.
DR   OMA; GVRGGRH; -.
DR   OrthoDB; EOG4PZJ78; -.
DR   GenomeRNAi; 80851; -.
DR   NextBio; 71284; -.
DR   ArrayExpress; Q7L8J4; -.
DR   Bgee; Q7L8J4; -.
DR   CleanEx; HS_SH3BP5L; -.
DR   Genevestigator; Q7L8J4; -.
DR   InterPro; IPR007940; SH3-bd_5.
DR   PANTHER; PTHR19423; SH3_bd_5; 1.
DR   Pfam; PF05276; SH3BP5; 1.
PE   1: Evidence at protein level;
KW   Coiled coil; Complete proteome; Phosphoprotein; Reference proteome.
FT   CHAIN         1    393       SH3 domain-binding protein 5-like.
FT                                /FTId=PRO_0000317508.
FT   COILED       59    140       Potential.
FT   COILED      169    272       Potential.
FT   COMPBIAS     37     40       Poly-Gly.
FT   COMPBIAS     41     44       Poly-Ser.
FT   COMPBIAS     52     55       Poly-Glu.
FT   MOD_RES     343    343       Phosphoserine.
FT   MOD_RES     350    350       Phosphoserine.
FT   MOD_RES     362    362       Phosphoserine.
FT   MOD_RES     378    378       Phosphoserine (By similarity).
SQ   SEQUENCE   393 AA;  43499 MW;  3693431765F90FDC CRC64;
     MAELRQVPGG RETPQGELRP EVVEDEVPRS PVAEEPGGGG SSSSEAKLSP REEEELDPRI
     QEELEHLNQA SEEINQVELQ LDEARTTYRR ILQESARKLN TQGSHLGSCI EKARPYYEAR
     RLAKEAQQET QKAALRYERA VSMHNAAREM VFVAEQGVMA DKNRLDPTWQ EMLNHATCKV
     NEAEEERLRG EREHQRVTRL CQQAEARVQA LQKTLRRAIG KSRPYFELKA QFSQILEEHK
     AKVTELEQQV AQAKTRYSVA LRNLEQISEQ IHARRRGGLP PHPLGPRRSS PVGAEAGPED
     MEDGDSGIEG AEGAGLEEGS SLGPGPAPDT DTLSLLSLRT VASDLQKCDS VEHLRGLSDH
     VSLDGQELGT RSGGRRGSDG GARGGRHQRS VSL
//

Right now I m receiving following error: here kaavya.pl contian following program
;

#!/usr/bin/perl

use strict;
use warnings;

open(my $id_file, "<", "id_file"); # list of ids
my $in_record=0;
my @ids=<$id_file>;
close $id_file;
chomp(@ids);
my %id_check;
map {$_++} @id_check{@ids};
open(my $records, "<", "tmp.dat"); # records of the form above
my $head;
while(<$records>){
    $head=$_ if (/^ID/);
    if (/^AC/){
        $in_record=0;
        my @entries=$_=~/\s+([^;]+);/g;
        for my$id(@entries){
            $in_record=1 if ($id_check{$id});
        }
    print $head if $in_record;
    }
print if $in_record;
}
bash-3.2$ perl kaavya.pl
readline() on closed filehandle $records at kaavya.pl line 15.
bash-3.2$

Hi again Manigrover,

Have you copied the records to tmp.dat? (or changed the names used in the open statements within the script?

You can also modify the script to report failure to open the files as follows

#!/usr/bin/perl

use strict;
use warnings;

open(my $id_file, '<', 'id_file')|| die "Could not open id_file\n\t$!";;
my $in_record=0;
my @ids=<$id_file>;
close $id_file;
chomp(@ids);
my %id_check;
map {$_++} @id_check{@ids};
open(my $records, '<', 'tmp.dat')|| die "Could not open tmp.dat\n\t$!";
my $head;
while(<$records>){
    $head=$_ if (/^ID/);
    if (/^AC/){
        $in_record=0;
        my @entries=$_=~/\s+([^;]+);/g;
        for my
$id(@entries){
            $in_record=1 if ($id_check{$id});
        }
    print $head if $in_record;
    }
print if $in_record;
}

Thankyou

---------- Post updated at 04:21 AM ---------- Previous update was at 04:07 AM ----------

Hi Skrynesaver,

I am having a problem with the output. For every alternate record, the information is mising. An example of my output is given below. I do not have information for AC Q8IZP0. but it goes to the next record. This happens every alternate record.

ID   ABI1_HUMAN              Reviewed;         508 AA.
ID   ABI1_HUMAN              Reviewed;         508 AA.
AC   Q8IZP0; A9Z1Y6; B4DQ58; O15147; O76049; O95060; Q5T2R3; Q5T2R4;
ID   ABI3_HUMAN              Reviewed;         366 AA.
AC   Q9P2A4; C9IZN8; Q9H0P6;
DT   19-JUL-2004, integrated into UniProtKB/Swiss-Prot.
DT   18-MAY-2010, sequence version 2.
DT   05-SEP-2012, entry version 93.
DE   RecName: Full=ABI gene family member 3;
DE   AltName: Full=New molecule including SH3;
DE            Short=Nesh;
GN   Name=ABI3; Synonyms=NESH;

Ah, my fault, I based the logic on the first record posted, multiple AC entries are possible. If I am correct in assuming that only one ID entry is possible per record, the following should work.

#!/usr/bin/perl

use strict;
use warnings;

open(my $id_file, '<', 'id_file')|| die "Could not open id_file\n\t$!";;
my $in_record=0;
my @ids=<$id_file>;
close $id_file;
chomp(@ids);
my %id_check;
map {$_++} @id_check{@ids};
open(my $records, '<', 'tmp.dat')|| die "Could not open tmp.dat\n\t$!";
my $head;
while(<$records>){
    if (/^ID/){
        $head=$_
        $in_record=0;
    }
    if (/^AC/){
        my @entries=$_=~/\s+([^;]+);/g;
        for my $id(@entries){
            $in_record=1 if ($id_check{$id});
        }
    print $head if $in_record;
    }
    print if $in_record;
}

Hi,
I am getting this error.

bash-3.2$ perl kaavya.pl
Scalar found where operator expected at kaavya.pl line 18, near "$_
        $in_record"
        (Missing operator before $in_record?)
syntax error at kaavya.pl line 18, near "$_
        $in_record"
Execution of kaavya.pl aborted due to compilation errors.
                                                                                                    [RIGHT]                                                                                         [[IMG]http://linux.unix.com/images/buttons/quote.gif[/IMG]](http://www.unix.com/newreply.php?do=newreply&p=302702311)[/RIGHT]
        $head=$_;
1 Like

Oh and one more thing

#!/usr/bin/perl

use strict;
use warnings;

open(my $id_file, '<', 'id_file')|| die "Could not open id_file\n\t$!";;
my $in_record=0;
my @ids=<$id_file>;
close $id_file;
chomp(@ids);
my %id_check;
map {$_++} @id_check{@ids};
open(my $records, '<', 'tmp.dat')|| die "Could not open tmp.dat\n\t$!";
my $head;
while(<$records>){
    if (/^ID/){
        $head=$_ ;
        $in_record=0;
    }
    if (/^AC/){
        my @entries=$_=~/\s+([^;]+);/g;
        for my $id(@entries){
            $in_record=1 if ($id_check{$id});
        }
        if ($in_record){
            print $head ;
            $head="";
        }
    }
    print if $in_record;
}

will prevent the duplicate ID entries being printed :wink: