Convert rows to columns based on condition

I have a file some thing like this:

GN   Name=YWHAB;
RC   TISSUE=Keratinocyte;
RC   TISSUE=Thymus;
CC   -!- FUNCTION: Adapter protein implicated in the regulation of a large
CC       spectrum of both general and specialized signaling pathways
GN   Name=YWHAE;
RC   TISSUE=Liver;
RC   TISSUE=Brain;
RC   TISSUE=Heart;
CC   -!- FUNCTION: Adapter protein implicated in the regulation of a large
CC       spectrum of both general and specialized signaling pathways. Binds
CC       to a large number of partners, usually by recognition of a
CC       phosphoserine or phosphothreonine motif. Binding generally results
CC       in the modulation of the activity of the binding partner.

I want to keep the information related to each entry column wise. each entry starts with GN (gene name) and ends with Function

GN	TISSUE	                 FUNCTION		
YWHAB	Keratinocyte;Thymus	Adapter protein implicated in the regulation of a large spectrum of both general and specialized signaling pathways
YWHAE	Liver;Brain;Heart;	Adapter protein implicated in the regulation of a large spectrum of both general and specialized signaling pathways. Binds to a large number of partners, usually by recognition of a phosphoserine or phosphothreonine motif. Binding generally results in the modulation of the activity of the binding partner.

One way to do it:

awk '
BEGIN {
	printf "%-8s %-22s %s\n","GN","TISSUE","FUNCTION"
	}
	/^GN/ {
		split($0,a,"[=;]")
		g=a[2]
		GN[g]=1
		}
	/^RC/ {
		split($0,a,"=")
		RC[g]=RC[g] a[2]}
	/^CC/ {
		sub(/-!- FUNCTION:/,x)
		split($0,a,"  +")
		CC[g]=CC[g] a[2]
		}
END {
	for (i in GN) 
		printf "%-8s %-22s %s\n",i,RC,CC 
}' file

GN       TISSUE                 FUNCTION
YWHAB    Keratinocyte;Thymus;   Adapter protein implicated in the regulation of a largespectrum of both general and specialized signaling pathways
YWHAE    Liver;Brain;Heart;     Adapter protein implicated in the regulation of a largespectrum of both general and specialized signaling pathways. Bindsto a large number of partners, usually by recognition of aphosphoserine or phosphothreonine motif. Binding generally resultsin the modulation of the activity of the binding partner.

Hi
Thank you but one thing, when i apply

cut -f1 output_file

which is supposed to give the gene names (GN) alone is printing entire content which means it is not delimited

awk '{print $1}' output_file
GN
YWHAB
YWHAE

If you only need GN names, no need for all the above, just do:

awk -F"[=;]" '/^GN/ {print $2}' orgfile
YWHAB
YWHAE
1 Like

I need all the three fields separated by tab but when i am applying for my huge data somehow it is mixing everything so i thought i would copy each column separately into new file

Can you post some of or the complete file?

I have attached the file and sent. It seems that requires moderators permission so soon i think it will be approved and you can see to it

Try to rename the file to txt, like file.bin -> file.txt

---------- Post updated at 14:46 ---------- Previous update was at 13:38 ----------

I do see several problems.

  1. Lines becomes very long so they are wrapped on the screen. Should be no problem since field are tab separated.
  2. Your data is not unique. I do store everything into array based on GN name, and since they are not unique this may mess things up.
GN   Name=HLA-A; Synonyms=HLAA;
RC   TISSUE=Blood;
CC   -!- FUNCTION: Involved in the presentation of foreign antigens to the
CC       immune system.
GN   Name=HLA-A; Synonyms=HLAA;
RC   TISSUE=Blood;
RC   TISSUE=Blood;
RC   TISSUE=Platelet;
CC   -!- FUNCTION: Involved in the presentation of foreign antigens to the
CC       immune system.

As you see here two section describing the same GN, but have different TISSUE
And as you see TISSUE do also have duplicate within one section.

To solve this I may make a script doing section by section, and not all in one array.

Which is the GN name you are looking for? what about Synonyms?
Tissue is not identical, do you want only the uniqe ones or all of them?

--ahamed

There are many isoforms so tissue field may not be identical even if they have same gene name. I want all (only if we consider gene names they appear to be duplicates but when we see other fields there would be some difference). synonyms are also needed sorry for not mentioning that earlier it would be better if it contains following fields

"GN","synonyms","TISSUE","FUNCTION"

For the sample file you have provided, paste the expected output for first 3 set.

--ahamed

I made it in .xls format so that the fields would be clearly seperated from each other. please see to it

Try this

awk '
	/^GN/ {split($0,a,"[=;]");printf "\n%s|",a[2]}
	/^RC/ {split($0,a,"=");printf a[2];f=1}
	/^CC/ {if (f) {printf "|"};sub(/-!- FUNCTION:/,x);split($0,a,"  +");printf "%s ",a[2];f=0}
END {print ""}' uniprot_human_prts_with_tissue_fn

Every field is now separated by | , in TISSUE its separated by ;
What should we do with the Synonyms field? Print it, where?

3 first line:

YWHAB|Keratinocyte;Thymus;Skin;Colon carcinoma;Platelet;Melanoma;Leukemic T-cell;|Adapter protein implicated in the regulation of a large spectrum of both general and specialized signaling pathways. Binds to a large number of partners, usually by recognition of a phosphoserine or phosphothreonine motif. Binding generally results in the modulation of the activity of the binding partner. Negative regulator of osteogenesis. Blocks the nuclear translocation of the phosphorylated form (by AKT1) of SRPK2 and antagonizes its stimulatory effect on cyclin D1 expression resulting in blockage of neuronal apoptosis elicited by SRPK2.
YWHAE|Liver;Brain;Heart;Caudate nucleus, Heart, and Subthalamic nucleus;Placenta;Platelet;B-cell lymphoma;Histiocytic lymphoma;Brain, and Cajal-Retzius cell;Melanoma;Liver;Cervix carcinoma;|Adapter protein implicated in the regulation of a large spectrum of both general and specialized signaling pathways. Binds to a large number of partners, usually by recognition of a phosphoserine or phosphothreonine motif. Binding generally results in the modulation of the activity of the binding partner.
YWHAH|Brain;Brain;Lymph;Keratinocyte;Platelet;Platelet;Leukemic T-cell;|Adapter protein implicated in the regulation of a large spectrum of both general and specialized signaling pathways. Binds to a large number of partners, usually by recognition of a phosphoserine or phosphothreonine motif. Binding generally results in the modulation of the activity of the binding partner. Negatively regulates the kinase activity of PDPK1.

EDIT: Here synonyms are printed behind original like: YWHAH-YWHA1

awk '
	/^GN/ {split($0,a,"[=;]");p=(a[4])?a[2]"-"a[4]:a[2];printf "\n%s|",p}
	/^RC/ {split($0,a,"=");printf a[2];f=1}
	/^CC/ {if (f) {printf "|"};sub(/-!- FUNCTION:/,x);split($0,a,"  +");printf "%s ",a[2];f=0}
END {print ""}' uniprot_human_prts_with_tissue_fn
YWHAB|Keratinocyte;Thymus;Skin;Colon carcinoma;Platelet;Melanoma;Leukemic T-cell;|Adapter protein implicated in the regulation of a large spectrum of both general and specialized signaling pathways. Binds to a large number of partners, usually by recognition of a phosphoserine or phosphothreonine motif. Binding generally results in the modulation of the activity of the binding partner. Negative regulator of osteogenesis. Blocks the nuclear translocation of the phosphorylated form (by AKT1) of SRPK2 and antagonizes its stimulatory effect on cyclin D1 expression resulting in blockage of neuronal apoptosis elicited by SRPK2.
YWHAE|Liver;Brain;Heart;Caudate nucleus, Heart, and Subthalamic nucleus;Placenta;Platelet;B-cell lymphoma;Histiocytic lymphoma;Brain, and Cajal-Retzius cell;Melanoma;Liver;Cervix carcinoma;|Adapter protein implicated in the regulation of a large spectrum of both general and specialized signaling pathways. Binds to a large number of partners, usually by recognition of a phosphoserine or phosphothreonine motif. Binding generally results in the modulation of the activity of the binding partner.
YWHAH-YWHA1|Brain;Brain;Lymph;Keratinocyte;Platelet;Platelet;Leukemic T-cell;|Adapter protein implicated in the regulation of a large spectrum of both general and specialized signaling pathways. Binds to a large number of partners, usually by recognition of a phosphoserine or phosphothreonine motif. Binding generally results in the modulation of the activity of the binding partner. Negatively regulates the kinase activity of PDPK1.
1 Like

Thats exactly what i needed