Hello;
I have a file consists of 4 columns separated by tab. The problem is the third fields. Some of the them are very long but can be split by the vertical bar "|". Also some of them do not contain the string "UniProt", but I could ignore it at this moment, and sort the file afterwards. Here is the input file like (>100k rows!!!):
B_1019927 ES904790 Unknown Unknown protein
B_1065532 JCVI_27855 JCVI_27855|gb|DW997385|gb|CD815403|gb|EV204160|gb|CD813268|gb|EX038736|UniProt|Q8RWM2 Putative uncharacterized protein At4g36940
B_1065533 JCVI_27856 JCVI_27856|gb|CD826551|gb|DY030639|gb|CX272767|gb|EV056662|gb|EX022554|gb|EX038271|UniProt|Q3E9T4 Protein At4g31430
B_1065534 JCVI_27857 JCVI_27857|gb|DY023794|gb|EE530065|gb|CX280035|UniProt|Q84WF5 Probable VAMP-like protein At1g33475
B_1065535 JCVI_27858 JCVI_27858|gb|ES941909|gb|ES940913|gb|EE514812|UniProt|Q8LBM4 Iron-sulfur assembly protein IscA-like 1, mitochondrial precursor
I want the output file with the truncated field-3, which is like following:
B_1019927 ES904790 Unknown Unknown protein
B_1065532 JCVI_27855 UniProt Q8RWM2 Putative uncharacterized protein At4g36940
B_1065533 JCVI_27856 UniProt Q3E9T4 Protein At4g31430
B_1065534 JCVI_27857 UniProt Q84WF5 Probable VAMP-like protein At1g33475
B_1065535 JCVI_27858 Uniprot Q8LBM4 Iron-sulfur assembly protein IscA-like 1, mitochondrial precursor
What I did is by matching the "UniProt" string in the third field, split this field and only print the last two columns of the split fields, the original 3 fields ($1, $2, $4) need to be printed untouched.
Here is my code:
awk 'FS="\t" {for(i=1;i<=NF;i++) if(match($i,"UniProt")) split($i, a, "|"); {print $1, "\t", $2, "\t", $a[-2],"\t", $a[-1],"\t",$NF}}' infile.csv
Have struggled for a while, but did not work out. Appreciate your expertise to help me out. Thanks a lot! Yifang