Splitting large file and renaming based on field

fozrun · May 8, 2012, 11:38am

I am trying to update an older program on a small cluster. It uses individual files to send jobs to each node. However the newer database comes as one large file, containing over 10,000 records. I therefore need to split this file. It looks like this:

HMMER3/b [3.0 | March 2010]
NAME  1-cysPrx_C
ACC   PF10417.4
DESC  C-terminal domain of 1-Cys peroxiredoxin
LENG  40
ALPH  amino
RF    no
CS    yes
MAP   yes
.....more data...
          0.00103  6.88015        *  0.61958  0.77255  0.00000        *
//
HMMER3/b [3.0 | March 2010]
NAME  120_Rick_ant
ACC   PF12574.3
DESC  120 KDa Rickettsia surface antigen
LENG  255
ALPH  amino
RF    no
CS    no
MAP   yes
DATE  Tue Sep 27 11:43:56 2011
NSEQ  7
... etc..

Each record starts with HMMER3/b and ends with //

I would like each individual file named after the ACC field, such as PF10417.4 or PF10417 (the . doesn't matter)

Any clues?

Corona688 · May 8, 2012, 12:05pm

new enough versions of awk can conveniently be told to consider "//" the record splitter, which makes it just a matter of finding the "ACC" field and using the next one as the file name to print into.

$ cat hmmer.awk

BEGIN { RS="//\n"; ORS="//\n" }

{
        for(N=1; N<=NF; N++)
        if($N == "ACC")
        {
                printf("Send this record to %s\n", $(N+1));
                print > $(N+1);
                close( $(N+1) );
                break;
        }
}

$ awk -f hmmer.awk data

Send this record to PF10417.4
Send this record to PF12574.3

$ cat PF10417.4

HMMER3/b [3.0 | March 2010]
NAME  1-cysPrx_C
ACC   PF10417.4
DESC  C-terminal domain of 1-Cys peroxiredoxin
LENG  40
ALPH  amino
RF    no
CS    yes
MAP   yes
.....more data...
          0.00103  6.88015        *  0.61958  0.77255  0.00000        *
//

$ cat PF12574.3

HMMER3/b [3.0 | March 2010]
NAME  120_Rick_ant
ACC   PF12574.3
DESC  120 KDa Rickettsia surface antigen
LENG  255
ALPH  amino
RF    no
CS    no
MAP   yes
DATE  Tue Sep 27 11:43:56 2011
NSEQ  7
//

$

If plain awk gives problems, try nawk or gawk.

fozrun · May 8, 2012, 12:18pm

Excellent. thank-you!