Convert rows to column and add header

Hi,

I need help to convert rows in input file into a table.

inputfile

192.98.1   192.98.192.98.17    VVC family                            Zorro    10
192.98.1   192.98.192.98.17    VVC family                            Ace      1
192.98.1   192.98.192.98.17    VVC family                            Bora     1
192.98.1   192.98.192.98.17    VVC family                            Sakura   5
12.A.4     12.A.4.10.30        channel2 family                       Usopun   1
7.A3.14    7.A3.14.3.1         DuanXon channel family                T-Law    1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Robyn    1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Zorro    1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Ace      1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Bora     1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Sakura   1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Hashir   1
8.M.14     8.M.14.1            potential receptor, channel family    Robyn    1
8.M.14     8.M.14.1.2          potential receptor, channel family    Usopun   1
8.M.14     8.M.14.1.3          potential receptor, channel family    T-Law    1
8.M.14     8.M.14.1.3          potential receptor, channel family    Zorro    2
8.M.14     8.M.14.1.3          potential receptor, channel family    Ace      4
8.M.14     8.M.14.1.3          potential receptor, channel family    Bora     1
8.M.14     8.M.14.1.3          potential receptor, channel family    Sakura   2
1.P.5      1.P.5.18.1          major intrinsic family                Ace      8
1.P.5      1.P.5.18.3          major intrinsic family                Sakura   1
1.P.5      1.P.5.6.4           major intrinsic family                T-Law    1
1.P.5      1.P.5.6.4           major intrinsic family                Robyn    6
1.P.5      1.P.5.6.4           major intrinsic family                Sakura   1

The expected output file (sample) as below:-

Fam        No                  Name                               Ace     Bora    Hashir    Robyn    Sakura    T-Law    Usopun    Zorro
192.98.1   192.98.192.98.17   VVC family                          1       1       null      null       5       null      null       10
12.A.4     12.A.4.10.30       channel2 family                     null    null    null      null      null     null      1        null
7.A3.14    7.A3.14.3.1        DuanXon channel family              1       1         1       1         1         1        null        1
8.M.14     8.M.14.1           potential receptor, channel family  null    null    null      1         null     null      null     null
8.M.14     8.M.14.1.2         potential receptor, channel family  null    null    null      null      null     null      1        null
8.M.14     8.M.14.1.3         potential receptor, channel family  4       1       null      null      2          1       null       2
1.P.5      1.P.5.18.1         major intrinsic family              8       null    null      null      null     null      null     null
1.P.5      1.P.5.18.3         major intrinsic family              null    null    null      null       1       null      null     null
1.P.5      1.P.5.6.4          major intrinsic family              null    null    null      6          1         1       null     null

I need to add header and arrange the info in inputfile by distributing the value in $5 (inputfile) accordingly. The distribution is based on $2. If there is no value in any $4 to $11 (from $5 in inputfile), those columns should be printed with "null" as above.
I tried some codes that i got from this forum plus awk online documentation but failed to get what i wanted. one of the script that i tried as follows:-

awk -F"\t" '{a[$2]=a[$2]?a[$2]FS$4$5:$2FS$4$5} END{for(i in a) print a}' inputfile

This problem is very complicated to me. Hope my description is clear and really appreciate if somebody can help me to solve this using awk. Thanks.

Would an adaption of this help?

How about this ?

 awk 'NR==FNR{a[$(NF-1)]=$(NF-1);next}
{if (FNR==1) { asort(a) ; printf "Fam\t\tNo\t\tName\t" ; for ( j in a ) { printf a[j] FS }} if ( !b[$1,$2] ) { if ( FNR>1) { for(j in a) {if ( p[a[j]] ) { printf OFS p[a[j]] } else {printf OFS "NULL"} } delete p;} printf "\n"; b[$1,$2]++; printf $1 OFS $2 OFS ;for (i=3;i<=NF-2;i++) { printf $i FS  } ; } if (FNR==1) { asort(a) } ; for ( j in a ) { if ( a[j] == $(NF-1) ) { p[a[j]]=$NF;} }} END {  for(j in a) {if ( p[a[j]] ) { printf OFS p[a[j]] } else {printf OFS "NULL" } } printf "\n";}' OFS="|" testFile testFile
1 Like

Try

Input

[akshay@nio tmp]$ cat file
192.98.1   192.98.192.98.17    VVC family                            Zorro    10
192.98.1   192.98.192.98.17    VVC family                            Ace      1
192.98.1   192.98.192.98.17    VVC family                            Bora     1
192.98.1   192.98.192.98.17    VVC family                            Sakura   5
12.A.4     12.A.4.10.30        channel2 family                       Usopun   1
7.A3.14    7.A3.14.3.1         DuanXon channel family                T-Law    1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Robyn    1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Zorro    1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Ace      1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Bora     1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Sakura   1
7.A3.14    7.A3.14.3.1         DuanXon channel family                Hashir   1
8.M.14     8.M.14.1            potential receptor, channel family    Robyn    1
8.M.14     8.M.14.1.2          potential receptor, channel family    Usopun   1
8.M.14     8.M.14.1.3          potential receptor, channel family    T-Law    1
8.M.14     8.M.14.1.3          potential receptor, channel family    Zorro    2
8.M.14     8.M.14.1.3          potential receptor, channel family    Ace      4
8.M.14     8.M.14.1.3          potential receptor, channel family    Bora     1
8.M.14     8.M.14.1.3          potential receptor, channel family    Sakura   2
1.P.5      1.P.5.18.1          major intrinsic family                Ace      8
1.P.5      1.P.5.18.3          major intrinsic family                Sakura   1
1.P.5      1.P.5.6.4           major intrinsic family                T-Law    1
1.P.5      1.P.5.6.4           major intrinsic family                Robyn    6
1.P.5      1.P.5.6.4           major intrinsic family                Sakura   1

Excel version

awk '
	{
		# Field 1,2,4 and 5
		f1 = $1  
		f2 = $2 
		f4 = $(NF -1)
		f5 = $NF
				
		# This is for f3
		# We can also do this like $1 = $2 = $(NF -1) = $NF ="" ; f3 = $0
		# But there will be OFS issue, you have set OFS in END block I guess for proper formating

		split($0,d,/[^[:space:]]*/)
		for(i=3; i<=NF-2; i++)
		{
			f3 = sprintf("%s%s%s",f3,d,$i)
		}
		
		# Remove Leading and trailing space in f3
		gsub(/^[ \t]+|[ \t]+$/,"",f3)	

		# Index
		key = f1 SUBSEP f2 SUBSEP f3		

		# Uniq index
		D[key]
		
		# Hash with value
		N[key,f4] = f5

		# Uniq User
		U[f4] 
		
		# reset variable
		key = f3 = "" 
	}
	END{
		# Sort User array
		n = asorti(U,copy)		
	
		# Loop through uniq index
		for(key in D)
		{
			# Split key
			split(key,S,SUBSEP)

			# Loop through uniq user
			for(i=1;i<=n;i++)
			{
			
				# User
				u = copy

				# If header not printed then
				if(!header)
				{
					# Create header
					hdr = sprintf("%s%s%s",hdr,OFS,u)
				}			
				
				# index we are looking for
				ind = S[1] SUBSEP S[2] SUBSEP S[3] SUBSEP u

				# if key exists and use that value else string null
				val = (ind in N)? N[ind] : "null"
				
				# Save value in variable
				str = sprintf("%s%s%s",str,OFS,val)
			}
			
			# write header if header not set
			if(!header)
			{
				print "Fam","No","Name" hdr
				
				# set flag header
				header = 1
			}
			
			# Print values
			# Quote is added incase if you want to open them in excel as 
			# there are comma inside the field
			print S[1],S[2],"\""S[3]"\"" str
			str = ""
		}
	}
    ' 	OFS=","  file

Resulting

Fam,No,Name,Ace,Bora,Hashir,Robyn,Sakura,T-Law,Usopun,Zorro
12.A.4,12.A.4.10.30,"channel2 family",null,null,null,null,null,null,1,null
1.P.5,1.P.5.6.4,"major intrinsic family",null,null,null,6,1,1,null,null
1.P.5,1.P.5.18.3,"major intrinsic family",null,null,null,null,1,null,null,null
8.M.14,8.M.14.1.3,"potential receptor, channel family",4,1,null,null,2,1,null,2
8.M.14,8.M.14.1,"potential receptor, channel family",null,null,null,1,null,null,null,null
8.M.14,8.M.14.1.2,"potential receptor, channel family",null,null,null,null,null,null,1,null
7.A3.14,7.A3.14.3.1,"DuanXon channel family",1,1,1,1,1,1,null,1
1.P.5,1.P.5.18.1,"major intrinsic family",8,null,null,null,null,null,null,null
192.98.1,192.98.192.98.17,"VVC family",1,1,null,null,5,null,null,10

Fixed Width don't blame me if not opening properly on excel

awk '
	{
		split($0,d,/[^[:space:]]*/)
		for(i=1; i<=NF-2; i++)
		{
			key = sprintf("%s%s%s",key,d,$i)
		}
			D[key]
			N[key,$(NF-1)] = $NF; 
			U[$(NF-1)] 
		key = "" 
	}
	END{
		n = asorti(U,copy)
		for(key in D)
		{
			split(key,S,SUBSEP)
			for(i=1;i<=n;i++)
			{
			
				u = copy

				if(!header)
				{
					hdr = sprintf("%3s%s%3s",hdr,"\t",u)
				}
				val = ((S[1],u) in N)? N[S[1],u] : "null"
				str = sprintf("%3s%s%3s",str,"\t",val)
			}
			if(!header)
			{
				printf("%-100s%s\n",sprintf("%-15s%-25s%s","Fam","No","Name"),hdr);
				header = 1
			}
			printf("%-100s%s\n", S[1],str)
			str = ""
		}
	}
    '  file

Resulting

Fam            No                       Name                                                           	Ace	Bora	Hashir	Robyn	Sakura	T-Law	Usopun	Zorro
1.P.5      1.P.5.18.3          major intrinsic family                                                  	null	null	null	null	  1	null	null	null
8.M.14     8.M.14.1.3          potential receptor, channel family                                      	  4	  1	null	null	  2	  1	null	  2
8.M.14     8.M.14.1            potential receptor, channel family                                      	null	null	null	  1	null	null	null	null
192.98.1   192.98.192.98.17    VVC family                                                              	  1	  1	null	null	  5	null	null	 10
1.P.5      1.P.5.18.1          major intrinsic family                                                  	  8	null	null	null	null	null	null	null
1.P.5      1.P.5.6.4           major intrinsic family                                                  	null	null	null	  6	  1	  1	null	null
12.A.4     12.A.4.10.30        channel2 family                                                         	null	null	null	null	null	null	  1	null
8.M.14     8.M.14.1.2          potential receptor, channel family                                      	null	null	null	null	null	null	  1	null
7.A3.14    7.A3.14.3.1         DuanXon channel family                                                  	  1	  1	  1	  1	  1	  1	null	  1
1 Like

Hi Pravin27,

Thanks a lot! It really works perfectly on my real data. New things in the codes that interest me like asort function. I am wondering about the input file being written twice there too. Anyways, i will study your code first and come back if i don't understand it. :b:

---------- Post updated at 11:08 AM ---------- Previous update was at 11:02 AM ----------

Hi Akshay Hegde,

Your codes worked awesome!! Many new things in the codes and i am really2 appreciate your clear explanation for each step. Thanks a million!