Concatenate column values when header is Matching from multiple files

Nina2910 · September 23, 2016, 11:26pm

there can be n number of columns but the number of columns and header name will remain same in all 3 files. Files are tab Delimited.
a.txt

Name 9/1 9/2
X 1 7
y 2 8
z 3 9
a 4 10
b 5 11
c 6 12

b.xt

Name  9/1 9/2
X 13 19
y 14 20
z 15 21
a 16 22
b 17 23
c 18 24

c.txt

Name  9/1 9/2
X 25 31
y 26 32
z 27 33
a 28 34
b 29 35
c 30 36

required output

Name  9/1 9/2
X 1/13/25 7/19/31
y 2/14/26 8/20/32
z 3/15/27 9/21/33
a 4/16/28 10/22/34
b 5/17/29 11/23/35
c 6/16/30 12/24/36

I Want to merge all three file on the basis of matching header and if values are same in the row then print one value and if different then concatenate values from all three files under that column. like in file a.txt,b.txt,c.txt, row 2 has same value under column name "NAME" so print only X but column "9/1" has different values so print one header " 9/1" and under it print all values from three files e.g. 1/13/25.

join <(sort a.txt) <(sort b.txt) <(sort c.txt) >out.txt

 awk 'FNR==NR{a[$1]=$2;next} ($1 in a){ print $0, a[$1]}' a.txt b.txt c.txt

the above commands are not concatenating the values of same headers but it is printing all the rows from file a then all rows from file b then c

RavinderSingh13 · September 24, 2016, 12:14am

Hello Nina2910,

Welcome to forums, hope you will enjoy learning/sharing knowledge here. Could you please try following and let me know how it goes.

awk 'function get(a,b,array1){array1[a]=a in array1?array1[a]"/"b:b} FNR==NR{A[$1]=$2;B[$1]=$3;S[++i]=$1;next} FNR!=NR && FNR>1{get($1,$2,A);get($1,$3,B)} END{for(j=1;j<=i;j++){print S[j], A[S[j]],B[S[j]]}}' *.txt

If you have more txt files than a.txt , b.txt and c.txt then you could use following.

awk 'function get(a,b,array1){array1[a]=a in array1?array1[a]"/"b:b} FNR==NR{A[$1]=$2;B[$1]=$3;S[++i]=$1;next} FNR!=NR && FNR>1{get($1,$2,A);get($1,$3,B)} END{for(j=1;j<=i;j++){print S[j], A[S[j]],B[S[j]]}}'  a.txt  b.txt  c.txt

Output will be as follows.

Name 9/1 9/2
X 1/13/25 7/19/31
y 2/14/26 8/20/32
z 3/15/27 9/21/33
a 4/16/28 10/22/34
b 5/17/29 11/23/35
c 6/18/30 12/24/36

EDIT: Adding a non-one liner form of solution.

awk 'function get(a,b,array1){
				array1[a]=a in array1?array1[a]"/"b:b
		             } 
     FNR==NR                 {
				A[$1]=$2;
				B[$1]=$3;
				S[++i]=$1;
				next
                             } 
     FNR!=NR && FNR>1        {
				get($1,$2,A);
				get($1,$3,B)
			     } 
     END                     {
				for(j=1;j<=i;j++){
							print S[j], A[S[j]],B[S[j]]
						 }
                             }
    ' *.txt

Thanks,
R. Singh

Nina2910 · September 24, 2016, 12:29am

Thank you so much Ravinder for welcoming me and giving the solution. It worked as expected

Two requests
1) could you please let me know if each file has n number of columns for example 32 columns in files a 32 columns in file b and 32 columns in file c. column header and column number will remain same in each file.how can I get the desired result.

FIles are Tab delimited
a.txt

Name 9/1 9/2 9/3.................9/30

b.txt

 Name 9/1 9/2 9/3....................9/30

c.txt

Name 9/1 9/2 9/3......................9/30

2) can you please explain the code sorry still learning

RavinderSingh13 · September 24, 2016, 1:52am

Hello Nina,

So number of columns were not mentioned before so solution was according to it, so following may help you with dynamic number of columns but condition is Input_files should have same number of columns(based on row vice), if this condition is TRUE(which seems to be as per your post) then following may help you in same. Let's say following are the Input_files.

cat a.txt
Name 9/1 9/2
X 1 7 7 8 9 10 12 12 12 12 12 13
y 2 8 3 4 5 6 7 8 8 9 1 212 12 12
z 3 9 1 1 2
a 4 10 1
b 5 11
c 6 12 12 13 14

cat b.txt
Name 9/1 9/2
X 1 7 7 8 9 10 12 12 12 12 12 13
y 22 4 13 2 53 26 17 18 82 19 111 12 13 4
z 3 9 1 1 2
a 4 10 1
b 5 11
c 6 12 12 13 14

cat c.txt
Name 9/1 9/2
X 1 7 7 8 9 10 12 12 12 12 12 13
y 2 8 3 4 5 6 7 8 8 9 1 212 12 12
z 3 9 1 1 2
a 4 10 1
b 5 11
c 6 12 12 13 14

Then following is the code for same.

paste *.txt | awk 'function get(field){q=NF/3;for(i=field;i<=NF;i+=q){$field=i>field?$field "/" $i:$field;}} NR>1{for(j=2;j<=NF/3;j++){get(j)};for(j=NF/3+1;j<=NF;j++){$j=""};sub(/[[:space:]]+$/,X,$0);print}'

Output will be as follows.

X 1/1/1 7/7/7 7/7/7 8/8/8 9/9/9 10/10/10 12/12/12 12/12/12 12/12/12 12/12/12 12/12/12 13/13/13                          
y 2/22/2 8/4/8 3/13/3 4/2/4 5/53/5 6/26/6 7/17/7 8/18/8 8/82/8 9/19/9 1/111/1 212/12/212 12/13/12 12/4/12                              
z 3/3/3 9/9/9 1/1/1 1/1/1 2/2/2            
a 4/4/4 10/10/10 1/1/1        
b 5/5/5 11/11/11      
c 6/6/6 12/12/12 12/12/12 13/13/13 14/14/14

Also in above code wherever I have used NF/3 so it is because of 3 Input_files are there if they vary in numbers then you could change according to their number. Kindly do let me know if this helps you or if you have any issues on same.
EDIT: Adding a non-one liner form of solution too here.

paste *.txt | awk 'function get(field){
					q=NF/3;
					for(i=field;i<=NF;i+=q){
								$field=i>field?$field "/" $i:$field;
							       }
                                      } 
                   NR>1               {
					for(j=2;j<=NF/3;j++)   {
								get(j)
                                                               };
					for(j=NF/3+1;j<=NF;j++){
								$j=""
							       };
                                        sub(/[[:space:]]+$/,X,$0);
					print
 				      }
                  '

Thanks,
R. Singh

Don_Cragun · September 24, 2016, 2:06am

You could also try the following...

Note that you say that your files are <tab> delimited, but all of the sample data you have shown us uses one or two <space> characters to separate fields; not a <tab> character.

The following will work with input files with fields separated by one or more blanks (where a blank is a <space> or a <tab>). This will not work if you have input files that really do use a <tab> as a field separator and some of your field data contains a <space>. It will work with any number of input files. It will work with any number of fields in a line (as long as all files have the same number of fields). The output header is taken from the 1st line of the 1st input file. The first line of all other input files are ignored. If the name on a data line is the same as the name on the header line in the 1st input file, that data will be merged into the output header line (i.e., it is assumed that the name used in the header in the 1st input file is not used as a name on any non-header line in any of the input files). Output fields will be separated by a <tab> character. The names of the files to be processed are not built into this script, they must be supplied as command-line arguments to the following script:

#!/bin/ksh
awk '	# Use the awk utility to interpret the following script...
BEGIN {	# Set output field separator.
	OFS = "\t"
}
NR == 1 || FNR > 1 {
	# Gather data from the 1st line in the 1st file (the header is supposed
	# to be the same in all input files) and from the 2nd line on in every
	# input file...
	# If we have not seen the name found in the first field before...
	if(!($1 in name)) {
		# Add the 1st first to the list of known names, increment the
		# number of names we have seen, and note the output line number
		# where this name should appear in the output....
		name[order[++nc] = $1]
		# and initialize the data for each output field for this name
		# from the corresponding input fields on this line.
		for(i = 2; i <= NF; i++)
			d[$1, i] = $i
	} else	# And if we have seen this name before, add data to be output
		# for this name to the accumalated data we have seen before for
		# this name.
		for(i = 2; i <= NF; i++)
			d[$1, i] = d[$1, i] "/" $i
}
END {	# Now that we have hit EOF on the last input file, print the accumulated
	# output.  For each name seen...
	for(i = 1; i <= nc; i++) {
		# Print the name...
		printf("%s", order)
		# and for the remaining fields...
		for(j = 2; j <= NF; j++)
			# print the output field separator followed by the
			# accumulated data for this name and field number.
			printf("%s%s", OFS, d[order, j])
		# and after the last field has been printed, add an aoutput
		# record separator.
		print ""
	}
}' "$@"	# Terminate the awk script and use the command line arguments as the
	# list of files to be processed.

This was written and tested using a Korn shell, but will work with any shell that uses Bourne shell syntax. If you save this script in a file named merger and make it executable:

chmod +x merger

and execute it with the pathnames of your sample input files:

./merger a.txt b.txt c.txt

it produces the output:

Name	9/1	9/2
X	1/13/25	7/19/31
y	2/14/26	8/20/32
z	3/15/27	9/21/33
a	4/16/28	10/22/34
b	5/17/29	11/23/35
c	6/18/30	12/24/36

Note that the output you said you wanted on the last line of the output was:

c 6/16/30 12/24/36

which, in addition to using <space> as a field separator instead of <tab>, also has 16 as the data from the 2nd column of the last line in b.txt instead of the value 18 that was contained in that field in your sample input file.

RavinderSingh13 · September 24, 2016, 2:35am

Hello Nina2910,

So in previous code of mine(POST#4)we need to manually provide number of .txt file into code(where we are using NF/3 ). So converting code into a script and skipping pain of manually putting number of .txt files into following script, which is successfully run and tested in BASH. One more thing my code not checking about hearders of any of the Input_files.

cat script.ksh
COUNT=$(ls *.txt | wc -l)
paste *.txt | awk -vcount="$COUNT" 'function get(field){
							q=NF/count;
							for(i=field;i<=NF;i+=q){
										$field=i>field?$field "/" $i:$field;
							       		       }
                                                       } 
                   NR>1               {
					for(j=2;j<=NF/count;j++)   {
								get(j)
                                                               };
					for(j=NF/count+1;j<=NF;j++){
								$j=""
							       };
                                        sub(/[[:space:]]+$/,X,$0);
					print
 				      }
                  '

Thanks,
R. Singh

RudiC · September 24, 2016, 6:02am

For any number or order of fields, any order of lines, missing fields across files, try

awk '
FNR == 1        {for (n=split($0, T); n>1; n--) if (!(T[n] in HDK)) HDK[T[n]] = ++HDC
                 F1 = T[1]
                 next
                }

!($1 in NM)     {NM[$1]
                 NMS[++SQ] = $1
                }

                {for (i=2; i<=NF; i++) NC[$1,T] = NC[$1,T] (NC[$1,T]?"/":"") $i
                }

END             {printf "%s", F1
                 for (h in HDK) printf "\t%s", h
                 printf RS

                 for (i=1; i<=SQ; i++)  {printf "%s", NMS
                                         for (h in HDK) printf "\t%s", NC[NMS,h]
                                         printf RS
                                        } 
                }
'  file[123]
Name	9/1	9/2	9/3
X	1/25	7/19/31	13
y	2/26	8/20/32	14
z	3/27	9/33	
a	4/28	10/22/34	16
b	5/29	11/23/35	17
c	6/30	12/24/36	18
Z		21	15

The second input file has an upper case "Z" as a name, and "9/3" as a new and unique column header. Output fields are separated by <TAB> chars, the misleading output picture is due to the varying length of Col 2...

Nina2910 · September 24, 2016, 10:07pm

@Ravinder...Thank you so much I added below to get it tab delimited however I was not able get the header file. As header will be constant in all three files can get the header. Otherwise it worked perfectly fine. Thank you again you have really helped

 
 paste Allen_Free.txt Allen_Current.txt Allen_Allocated.txt | awk 'function get(field){q=NF/3;for(i=field;i<=NF;i+=q){$field=i>field?$field "/" $i:$field;}} BEGIN{FS=OFS="\t"}  NR>1 {for(j=2;j<=NF/3;j++){get(j)};for(j=NF/3+1;j<=NF;j++){$j=""};sub(/[[:space:]]+$/,X,$0);print

please help

Don_Cragun · September 24, 2016, 11:28pm

Did you try the code I suggested in post #5 in this thread?

RavinderSingh13 · September 25, 2016, 1:05am

nina2910:

@Ravinder...Thank you so much I added below to get it tab delimited however I was not able get the header file. As header will be constant in all three files can get the header. Otherwise it worked perfectly fine. Thank you again you have really helped
 
 paste Allen_Free.txt Allen_Current.txt Allen_Allocated.txt | awk 'function get(field){q=NF/3;for(i=field;i<=NF;i+=q){$field=i>field?$field "/" $i:$field;}} BEGIN{FS=OFS="\t"}  NR>1 {for(j=2;j<=NF/3;j++){get(j)};for(j=NF/3+1;j<=NF;j++){$j=""};sub(/[[:space:]]+$/,X,$0);print
 
please help

Hello Nina2910,

Request you to please answer Don's question. Glad that I could help you, you could run following script, where I have made a minor change into it to get the headers of your Input_file.

cat script.ksh
COUNT=$(ls *.txt | wc -l)
paste *.txt | awk -vcount="$COUNT" 'function get(field){
							q=NF/count;
							for(i=field;i<=NF;i+=q){
										$field=i>field?$field "/" $i:$field;
							       		       }
                                                       } 
                   NR>1               {
					for(j=2;j<=NF/count;j++)   {
								get(j)
                                                               };
					for(j=NF/count+1;j<=NF;j++){
								$j=""
							       };
                                        sub(/[[:space:]]+$/,X,$0);
					print
 				      }
                   NR==1              {
					for(j=1;j<=NF/count;j++)   {
								printf("%s\t",$j);
                                                                   }
					print X;
				      }
		                   '  FS=OFS="\t"

Thanks,
R. Singh

Nina2910 · September 25, 2016, 8:30pm

@Don ...it worked perfectly and thank you so for sparing time for me and explaining it for me. I am so sorry could not replied on it earlier. I was looking for a function or one liner so that I can use it in my script.

---------- Post updated at 08:30 PM ---------- Previous update was at 08:06 PM ----------

@Rudi ...Thank you so much but it changes the header columns order

---------- Post updated at 08:30 PM ---------- Previous update was at 08:30 PM ----------

@Ravinder thank you however the latest code didn't work

Don_Cragun · September 25, 2016, 11:33pm

You asked Ravinder for an explanation of his code, so I assumed you would want comments on how my code worked as well.

Sorry, but I don't do one-liners; I try to write code that can be read and understood. You can convert my code to an unreadable 1-liner if you want to; but if you ever need to modify it in the future and can't figure out how to do it, don't expect me to try to help you modify my code after you have made it unreadable!

I'm sorry that my code did not meet your needs either. If you needed a function instead of a complete script, you should have explained what inputs your function would be given and what the function is supposed to return to the invoking script. I guess I don't see what a function would do for you that isn't done by the script I suggested in post #5 in this thread.

Nina2910 · September 26, 2016, 12:20am

@Don ...I think I spoke too soon actually I used your code as function and it worked fine ...Thank you so much and I understand what you saying and I understood your code as well ....Thanks one again ...do you recommend any video or any book I am new to shell scripting and want to learn awk because my new profile demands me to do lots of shell scripting. Thanks once again

RavinderSingh13 · September 26, 2016, 12:31am

Hello Nina2910,

Above script provided by me worked fine for me, you could try to run it in as follows too.

cat script.ksh
COUNT=$(ls *.txt | wc -l)
paste *.txt | awk -vcount=$COUNT 'function get(field){q=NF/count;for(i=field;i<=NF;i+=q){$field=i>field?$field "/" $i:$field;}} NR>1{for(j=2;j<=NF/count;j++){get(j)};for(j=NF/count+1;j<=NF;j++){$j=""};sub(/[[:space:]]+$/,X,$0);print} NR==1{for(j=1;j<=NF/count;j++){printf("%s\t",$j)}print X;}'

Following is the explanation of above code, please do not run the following code it is only for explanation purposes I have split it.

COUNT=$(ls *.txt | wc -l) #### Creating a variable named COUNT(in shell) and it's value will be number of .txt files. If you want to hardcode files which you showed in my code previously then you could ignore this variable and could hardcode file names with paste command along with subsituting the count variable with number of files in awk code too.
paste *.txt               #### Using paste command with all files whose extension is .txt so that it will concatenate their contents as per line numbers.
|                         #### Using pipe here to send the standard output of paste command to awk command as standard input.
awk                       #### Starting awk 
-vcount=$COUNT            #### -v option is used to define ann awk's variable. So here I am making count variable which will have values of SHELL variable named COUNT's value init. This is the way where we could set a shell's variable's value to an awk's variable too.
'function get(field)      #### starting a function here, as we all know in function we could write a logic which we need to perform several times and could save our time and could make code neat and clear, so creating a function for same. Function name is get. passing a value to it called field as by name itself it is clear we are going to pass field into it.
{q=NF/count;              #### creating a variable named q whose value is NF/count where NF is number of fields and it is an awk's in-built variable which gives number of fields into any line/record. so q will have actually number of fields for a single file. Here you have a;ready mentioned that number of fields will be equal in each Input_file so I am dividing TOTAL number of fields with TOTAL number of Input_files so that could get 1 Input_file's number of fields.
for(i=field;i<=NF;i+=q)   #### starting a for loop here whose syntax will be always the usual one for(variable initilization,condition,variable decrement/increment). Similarly we are starting a variabled named i whose value will be equal to variable field(which we are passing to function) and till the value of NF(number of fields in current line/record) it will execute this for loop.
{$field=                  #### Making values of $field where $field defines, let's say we have field variable's value as 2 then $2 defines 2nd field of current line etc.
i>field                   #### checking here condition if i's value is greater than variable field.
?                         #### ? is a well known ternary operator which defines the next statements will be executing if above condition is TRUE.
$field "/" $i             #### setting value of $field into $field "/" $i now.
:                         #### : is a well known ternary operator which defines that statements which are coming next will be executed because condition showed 2 steps above is NOT TRUE.
$field;}}                 #### keeping the value of $field as same $field.
NR>1                      #### Now coming into main section where we are checking if value of NR>1 where NR is awk's built in variable which defines the number of records in a line/record, so I am making sure we are not executing further statements while line number is one which is your header line.
{for(j=2;j<=NF/count;j++) #### starting a for loop here which wil run till variable j's value if less than and equal to value of NF/count.
{get(j)};                 #### Calling function get which we created and explained above for each field of each record/line.
for(j=NF/count+1;j<=NF;j++) #### Starig a for loop whic will run till the value of variable j is less than and equal to value of NF/count+1.
{$j=""};                  #### So I am nullifying the fields, so basically what I am doing is since we need to only fields depending on 1 Input_file so I am Nullifying extra fields now, whose value already been concatinated to needed fields already above.
sub(/[[:space:]]+$/,X,$0);#### substituting the space at last to NULL, when we wil Nullify the fields then at last space will be there so removing it completly with sub which is awk's in-built functionality whose syntax is sub(/pattern/variable/,"new value",line/record/variable).
print}                    #### Finally printing the value of newly modified fields which is requirement.
NR==1{                    #### checking here condition when NR==1 means when 1st line is there then only execute further statements.
for(j=1;j<=NF/count;j++){ #### starting a for loop till variable j's value is less than or equal to value of NF/count+1.
printf("%s\t",$j)}        #### printing the value of fields here.
print X;}'                #### printing value of a NULL value variable, basically to get a new line after headers are printed by above for loop.

Thanks,
R. Singh

Nina2910 · September 26, 2016, 8:08pm

@Ravinder somehow its not working for me however I used Don's solution and it worked.