awk to split one field and print the last two fields within the split part.

yifangt · April 16, 2012, 9:41pm

Hello;
I have a file consists of 4 columns separated by tab. The problem is the third fields. Some of the them are very long but can be split by the vertical bar "|". Also some of them do not contain the string "UniProt", but I could ignore it at this moment, and sort the file afterwards. Here is the input file like (>100k rows!!!):

B_1019927    ES904790    Unknown    Unknown protein
B_1065532    JCVI_27855   JCVI_27855|gb|DW997385|gb|CD815403|gb|EV204160|gb|CD813268|gb|EX038736|UniProt|Q8RWM2    Putative uncharacterized protein At4g36940
B_1065533    JCVI_27856   JCVI_27856|gb|CD826551|gb|DY030639|gb|CX272767|gb|EV056662|gb|EX022554|gb|EX038271|UniProt|Q3E9T4    Protein At4g31430
B_1065534    JCVI_27857   JCVI_27857|gb|DY023794|gb|EE530065|gb|CX280035|UniProt|Q84WF5    Probable VAMP-like protein At1g33475
B_1065535    JCVI_27858   JCVI_27858|gb|ES941909|gb|ES940913|gb|EE514812|UniProt|Q8LBM4    Iron-sulfur assembly protein IscA-like 1, mitochondrial precursor

I want the output file with the truncated field-3, which is like following:

B_1019927    ES904790    Unknown    Unknown protein
B_1065532    JCVI_27855    UniProt    Q8RWM2     Putative uncharacterized protein At4g36940
B_1065533    JCVI_27856    UniProt    Q3E9T4     Protein At4g31430
B_1065534    JCVI_27857    UniProt    Q84WF5     Probable VAMP-like protein At1g33475
B_1065535    JCVI_27858    Uniprot    Q8LBM4     Iron-sulfur assembly protein IscA-like 1, mitochondrial precursor

What I did is by matching the "UniProt" string in the third field, split this field and only print the last two columns of the split fields, the original 3 fields ($1, $2, $4) need to be printed untouched.

Here is my code:

awk 'FS="\t" {for(i=1;i<=NF;i++) if(match($i,"UniProt")) split($i, a, "|"); {print $1, "\t", $2, "\t", $a[-2],"\t", $a[-1],"\t",$NF}}' infile.csv

Have struggled for a while, but did not work out. Appreciate your expertise to help me out. Thanks a lot! Yifang

agama · April 16, 2012, 10:54pm

Based on your sample I think this works. My assumption was that any row with pipe separated elements in the thrid field would have UniProt as the next to last in the series. If not, use a[n-1] if you want to capture that data, or add an if to test if a[n-1] is UniProt.

awk -F "\t" '
    {
        n = split( $3, a, "|" );
        if( n < 2 )
            data = sprintf( "%s\t", a[1] );
        else
            data = sprintf( "%s\t%s", "UniProt", a[n] );  #assumption that all pipe separated data contains UniProt at a[n-1]

        printf( "%s\t%s\t%s\t%s\n", $1, $2, data, $4 );
    }
' input-file >output-file

yifangt · April 17, 2012, 12:02am

Awesome!
I made the simple thing complicate. Your script is much simpler than mine. Lots to learn! Thanks a lot!
Yifang

michaelrozar17 · April 17, 2012, 4:16am

Through Sed..

sed "/UniProt/s/\(.*\) [^|]*  *\([^ ]*\)|\([^|]*\)|\([^ ]*\)  */\1 \3 \t \4\t/" inputfile > outfile

If \t gets printed as just 't' instead of tab space, remove all the \t and give required number of spaces.

Scrutinizer · April 17, 2012, 4:23am

or enter enter an actual <TAB> instead of \t by using CTRL-V <TAB>

Scrutinizer · April 17, 2012, 4:41am

To only print the UniProt lines:

awk 'sub(/.*UniProt\|/,"UniProt" OFS,$3)' FS='\t' OFS='\t' infile