Merge csvs with column headers

abh.kumar · March 4, 2015, 11:39am

hello gurus,

Somebody must have done this before, I couldn't find anything. Please redirect me if this was solved before, and if not please help.

To the problem now, I have multiple csv files (about 1000) which I need to concatenate by column header. The final file should have a superset of all columns of the csvs, and corresponding missing values should be blank.

All files are the same directory. Note the randomness and spaces in filenames.

File 2012 IBM 1

A,B,C
d1,d2,d3
d4,d5,d6

DELL2013 FILE_2

A,C,D
d12,d22,d32
d42,d52,d62

HP File 3

B,C,D,E
d22,d122,d132,
d32,d152,d262,142

My output should look like

A,B,C,D,E
d1,d2,d3,,
d4,d5,d6,,
d12,,d22,d32,
d42,,d52,d62,
,d22,d122,d132,
,d32,d152,d262,142

I can do this for files with common columns in the same order, but this cant incorporate new columns in the superfile

awk 'FNR==1 && NR!=1{next;}{print}' *.csv

Don_Cragun · March 5, 2015, 5:07am

You need to process the 1st lines of the files 1st to get a list of the headings and then process the entire files a 2nd time to print the fields in the correct order in a file with all of the headings. Some implementations of the awk utility have a nextfile command that skips the rest of the file; others don't (and, therefore, need to read each line of each file both times the file is read). With your sample files, the difference won't be noticed; but with large files, implementations with nextfile will run faster.

You said your sample input files were named 2012 IBM 1 , DELL2013 FILE_2 , and HP File 3 , but your sample code processes files with names ending with .csv . The following awk script assumes your sample input filenames are really 2012 IBM 1.csv , DELL2013 FILE_2.csv , and HP File 3.csv . Note that if you redirect the output of this script and you place the output in the same directory in which your input files are located you must not create the output file with a name ending in .csv .

awk '
BEGIN {	FS = OFS = ","
}
FNR == 1 {
	if(!(FILENAME in files)) {
		# This is the 1st time we have seen this file, print extra
		# heading names to the output header line and add new fields
		# from this file to headings list...
		for(i = 1; i <= NF; i++)
			if(!($i in h)) {
				printf("%s%s", nh ? "," : "", $i)
				h[$i] = ++nh
			}
		files[FILENAME]
		if(file1 == "")
			file1 = FILENAME
		# If your version of awk does not have a "nextfile" command,
		# comment out the "nextfile" line below AND uncomment every
		# line after this line containing the string "flushfile".
		nextfile
		# flushfile = 1
	} else {
		# This is the 2nd time we have seen this file.  Set the output
		# field positions for fields in this file...
		# Clear hf[]...
		split("", hf, "")
		for(i = 1; i <= NF; i++)
			hf[h[$i]] = i
		# Get output field order for this file...
		for(i = 1; i <= nh; i++) {
			of = hf
		}
		# If this is the 2nd time we have seen the 1st file in the list,
		# teminate the output file heading line...
		if(file1 == FILENAME)
			print ""
		# flushfile = 0
	}
	next
}
# flushfile { next }	# Flush lines 2 to the end of the file the 1st time we
			# see this file on systems that do not have an awk
			# "nextfile" command.
{	# Print contents of lines 2 through the end of the file the 2nd time we
	# read the file.
	for(i = 1; i <= nh; i++)
		printf("%s%s", of ? $of : "", (i == nh) ? ORS : OFS)
}' *.csv *.csv

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk .

This has been tested on a Mac OS X system as shown above and with the flushfile lines uncommented and the nextfile line commented. Either way it produced the output you said you were trying to get. (Note, however, that if the names of the input files were different, the order of the output fields might change.)

Note, however, that there are limits to the length of argument lists that can be handled when invoking commands like awk . If you get an argument list too long error, you can preprocess the 1st lines of your *.csv files to create a file with just the headings from the combined files. Then you can modify the script to work when you only pass awk operands naming the preprocessed file and the *.csv files just once instead of twice. If that argument list is still too long, you could modify the script to just read a file containing the names of the files to be processed.

Hope this helps.

Note that if you tell us what OS and shell you're using when you ask for help, we can give you more personalized answers that will work on your system.

MadeInGermany · March 5, 2015, 9:43am

The following solution reads the input file once, storing all in memory, and prints in the END section.

awk -F, '
(FNR==1) {
  # header line
  for (i=1; i<=NF; i++) {
    if (!($i in H)) {
      # H[] is for this hash-lookup and O[] for in-order lookup
      ++nh
      H[$i]=nh
      O[nh]=$i
    }
    # column mapping for this file
    map=H[$i]
  }
  next
}
{
  # store fields in F array
  ++nr
  for (i=1; i<=NF; i++) {F[nr,map]=$i}
}
END {
  # print header line
  for (i=1; i<=nh; i++) {printf "%s", sep O; sep=FS}
  printf RS
  # print F array
  for (n=1; n<=nr; n++) {
    # line n
    sep=""
    for (i=1; i<=nh; i++) {
      printf "%s", sep F[n,i]
      sep=FS 
    }
    printf RS
  }
}
' *.csv

Don_Cragun · March 5, 2015, 12:51pm

Hi MadeInGermany,
Good point. With about a thousand input files (and no indication of their sizes), I thought that storing all of the file data could consume a lot of memory and chose to read the files twice to reduce the memory footprint. I should have explicitly stated that when I posted my suggestion.

Depending on file naming conventions and file sizes, reducing the size of the argument list may well be more important than reducing the memory footprint.

Don

abh.kumar · March 5, 2015, 4:45pm

Thanks a lot to the both of you..this will not only help me, but several others in future