Row bind multiple csv files having different column headers

ks_reddy · November 4, 2016, 2:17am

All,
I guess by this time someone asked this kind of question, but sorry I am unable to find after a deep search.
Here is my request
I have many files out of which 2 sample files provided below.
File-1 (with A,B as column headers)

A,B
1,2

File-2 (with C, D as column headers)

C,D
4,5

I need the below output

A,B,C,D
1,2, ,
 , ,4,5

I actually got a perl script to concatenate the files the way I want but it is shuffling the column order randomly.

I am looking for awk based script (preferably)

Thanks in advance
ks_reddy

Don_Cragun · November 4, 2016, 3:21am

Why do some of your output fields have <space> as a replacement for data that is not present (row 2, column C; row 3, column A; and row 3, column B) and other output fields have nothing as a replacement for data that is not present (row 2, column D)?

Why isn't the desired output just:

A,B,C,D
1,2,3,4

?

Are the column headers between the two files always distinct? If not, how do common column headers affect the desired output?

Are there always just one header line and one data line in each input file? If not, how is the output supposed to be ordered? All lines from File-1 followed by all lines from File2? Alternating lines from File-1 and File-2? Does something different happen if the number of lines in the two files differs?

rbatte1 · November 4, 2016, 7:26am

I'm guessing that you might have common column headings somewhere that you want to merge in, but in that case, what rule would you want to apply:-

file1 always?
file2 always?
file2 if value is bigger/smaller?
value only if matched?
the sequence for records to be added?
anything else?

Is this something that could be covered with the paste command perhaps? You might need to prefix your file2 data with ,, to make the column counts match, perhaps with sed -i '2,$ s/^/,,/' file2 or some variation?

Does that help? It still doesn't get over the question of matching columns either and might not do what you want at all. It would help if we had more test data and the various options explained.

Kind regards,
Robin

rbatte1 · November 4, 2016, 8:44am

I suppose you could:-

{
  printf "%s,%s\n" "$(head -1 file1)" "$(head -1 file2)"
  sed '2,$ s/$/,,/' file1
  sed '2,$ s/^/,,/' file2
} > output_file

It is rather unwieldy and has no logical control options in it, but I think it fits the brief. There must be more to it than this though,......?

Robin

ks_reddy · November 7, 2016, 1:27am

Hi Don,
Thanks for going through my request. I am very sorry in my late reply.
What I provided earlier is a simple case.
Here is the desired output for my previous query.

A,B,C,D
1,2,,
,,4,5

.
My actual files contain lot of columns (100+) with some of the columns have same names.
Here are modified sample inputs from 2 files.
In actual case I have many files.

File-1
A,B,C
1,2,3
4,5,6

File-2

A,C,D
2,3,4
0,1,2

Desired output
A,B,C,D
1,2,3,
4,5,6,
2,,3,4
0,,1,2

---------- Post updated at 11:57 AM ---------- Previous update was at 11:55 AM ----------

Hello rbattle1,
Yes I do have column headers sometimes between different files that I want to row bind.
Please check my updated post on my requirements.

Thanks
ks_reddy

Don_Cragun · November 7, 2016, 3:50am

The following script seems to do what you want:

#!/bin/ksh
IAm=${0##*/}
if [ $# -lt 2 ]
then	printf 'Usage: %s file1 ... fileN\n' "$IAm" >&2
	exit 1
fi
awk '
BEGIN {	FS = OFS = ","
	# Gather field names from the 1st line of each input file...
	for(i = 1; i < ARGC; i++) {
		if((getline < ARGV) != 1) {
			printf("*** Cannott read header from file \"%s\".\n",
			    ARGV)
			continue;
		}
		close(ARGV)
		for(j = 1; j <= NF; j++)
			if($j in H)
				continue;
			else {	H[$j] = ++Hc
				OH[Hc] = $j
				# Add to output file header.
				printf("%s%s", HOFS, $j)
				HOFS = OFS
			}
	}
	print ""
}
FNR == 1 {
	# Clear output order list and headers from from previous file.
	for(i in O)
		delete O
	for(i in FH)
		delete FH
	# Determine order in which input fields should be output for this file...
	for(i = 1; i <= NF; i++)
		FH[$i] = i
	for(i = 1; i <= Hc; i++)
		O = (OH in FH) ? FH[OH] : NF + 1
	next
}
{	# Print the data lines for the current input file.
	for(i = 1; i <= Hc; i++)
		printf("%s%s", $O, (i == Hc) ? ORS : OFS)
}' "$@"

This was written and tested using a Korn shell, but should work with any shell that process POSIX standard shell variable expansions.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk . (I don't remember whether or not nawk on Solaris systems includes support for the awk ARGV[] array that is required for this script to work correctly.)

ks_reddy · November 7, 2016, 4:53am

Hi Don,
Thanks for your code.
When I try to run like this

awk -f script.awk input1.csv input2.csv

I am getting the below error.

awk ' BEGIN { FS =OFS=","
    ^ invalid char ''' in expression

I tested this is

default bash

as well as

ksh

in Cygwin.

Any clues on debugging this ?

RudiC · November 7, 2016, 5:13am

If running a script from a file like you did, you can't have the awk ' (opening) and final ' (closing) construct in the file. Try without.