extract perticular lines and make them into speadsheet

mskcc · October 4, 2005, 2:58pm

Hi Masters,

I knew this isn't a new issue, but couldn't find any similar threads. So, I have to bother you. Here is my input file (genomic data). The file has many sessions, each session seperated by //. Within eash session there is only one ID and GN line.

ID 3HAO_HUMAN STANDARD; PRT; 286 AA.
AC P46952; Q8N6N9;
DT 01-NOV-1995 (Rel. 32, Created)
DT 01-NOV-1995 (Rel. 32, Last sequence update)
DT 10-MAY-2005 (Rel. 47, Last annotation update)
DE 3-hydroxyanthranilate 3,4-dioxygenase (EC 1.13.11.6) (3-HAO) (3-
DE hydroxyanthranilic acid dioxygenase) (3-hydroxyanthranilate
DE oxygenase).
GN Name=HAAO;
OS Homo sapiens (Human).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae;
OC Homo.
OX NCBI_TaxID=9606;
//
ID A4GCT_HUMAN STANDARD; PRT; 340 AA.
AC Q9UNA3;
DT 28-FEB-2003 (Rel. 41, Created)
DT 28-FEB-2003 (Rel. 41, Last sequence update)
DT 13-SEP-2005 (Rel. 48, Last annotation update)
DE Alpha-1,4-N-acetylglucosaminyltransferase (EC 2.4.1.-) (Alpha4GnT).
GN Name=A4GNT;
OS Homo sapiens (Human).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae;
OC Homo.
OX NCBI_TaxID=9606;
//
................
What I need to do is to extract part of line GN, ID and put them into this format. Thanks in advance.

GN ID
HAAO 3HAO_HUMAN
A4GNT A4GCT_HUMAN
.... ....

blowtorch · October 4, 2005, 3:27pm

There is definitely a better way to do this, but right now, all I could think of was this: (test.tmp holds all your records)

#!/usr/bin/ksh
awk '/^ID/ {print $2}' test.tmp > ID.tmp
awk -F'=' '/^GN/ {print $2}' test.tmp  > GN.tmp
paste -d '\' GN.tmp ID.tmp > final.output

The output will be a ';' seperated file that you could open in any spreadsheet program.

The above code is inefficient and will be very slow if you have a very large number of records - but for a reasonable number of records, it will be just fine.

mskcc · October 4, 2005, 3:49pm

Hi,
It didn't work for some reason. error is
awk: syntax error at source line 1
context is
/^ID/ {print $2} test.tmp > >>> ID <<< .tmp
awk: bailing out at source line 2
paste: ID.tmp: No such file or directory

By the way, I am using Mac OSX

mskcc · October 4, 2005, 3:54pm

my bad! I misspelled word. Thanks.

mskcc · October 4, 2005, 4:09pm

in the output file for each line record must match very well, even though some of session doesn't have a GN line. In such case, ID has to match to empty record. Thanks.