Merge CSV files

rk4k · June 1, 2015, 4:42am

I have lot of csv file collected from script like below :

Name of files (some examples) there are thousands of it:

192.168.0.123_251_18796_1433144473.csv
192.168.0.123_251_18796_1433144772.csv
192.168.0.123_251_18796_1433145073.csv
192.168.0.123_251_18796_1433145372.csv

Content of each files :

192.168.0.123_251_18796_1433144473.csv :

"Date","Inbound","Outbound"
"2015-06-01 13:45:00","102061994.58","30970432.36"
"2015-06-01 13:50:00","101806973.26","31345909.05"
"2015-06-01 13:55:00","104203841.45","31944063.26"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","",""

192.168.0.123_251_18796_1433144772.csv

"Date","Inbound","Outbound"
"2015-06-01 13:50:00","101806973.26","31345909.05"
"2015-06-01 13:55:00","104203841.45","31944063.26"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","84352207.67","30581944.65"
"2015-06-01 14:50:00","",""

192.168.0.123_251_18796_1433145073.csv

"Date","Inbound","Outbound"
"2015-06-01 13:55:00","104203841.45","31944063.26"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","84352207.67","30581944.65"
"2015-06-01 14:50:00","93056971.42","30470065.37"
"2015-06-01 14:55:00","",""

192.168.0.123_251_18796_1433145372.csv

"Date","Inbound","Outbound"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","84352207.67","30581944.65"
"2015-06-01 14:50:00","93056971.42","30470065.37"
"2015-06-01 14:55:00","78773346.56","29866197.88"
"2015-06-01 15:00:00","",""

Each files collected csv data from one hour data, and every files differ each other with the interval 5 minutes ahead based on the timestamp.

I want to merge all of the files so it will merged in one big file with the content of each column sorted based on the timestamp.
I think some perl script can do this, but have no idea how to do this.

Expected result :

"Date","Inbound","Outbound"
"2015-06-01 13:45:00","102061994.58","30970432.36"
"2015-06-01 13:50:00","101806973.26","31345909.05"
"2015-06-01 13:55:00","104203841.45","31944063.26"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","84352207.67","30581944.65"
"2015-06-01 14:50:00","93056971.42","30470065.37"
"2015-06-01 14:55:00","78773346.56","29866197.88"
"2015-06-01 15:00:00","",""

Regards and Thanks

---------- Post updated at 03:42 PM ---------- Previous update was at 03:17 PM ----------

Answering my own question.
I use this one liner shell :

cat *.csv |  grep -v '"Date","Inbound","Outbound"' | sort -u -k1 | uniq | sed '/,\"\",\"\"/d'

It work.

But I open to maybe the best solution from other folks.

TIA.

blackrageous · June 1, 2015, 1:32pm

In cases like these, when I merge files, I will throw in an extra field to convey the filename of where the record came in the event I want to get rid of the separate files. So I would throw in a fourth field like "source" and put the filename in it. You don't have to include this extra field in your sort (so you could do sort by fields date, inbound, and outbound (disclude source) and use "," as the delimiter for the sort keys.

Scrutinizer · June 1, 2015, 2:01pm

If your grep has the -h option (but that is not standard grep), your solution could be reduced somewhat further still, to something like this:

grep -Fhve '"Date","Inbound","Outbound"' -e '","",""' *.csv | sort -u

If there are too many .csv files you may run into line length limitations. Then this is something you could try:

find . -name "*.csv" -exec grep -Fhve '"Date","Inbound","Outbound"' -e '","",""' {} + | sort -u

--
Without the -h option you would need cat:

cat *.csv | grep -Fve '"Date","Inbound","Outbound"' -e '","",""' | sort -u

or

find . -name "*.csv" -exec cat {} + | grep -Fve '"Date","Inbound","Outbound"' -e '","",""' | sort -u

Don_Cragun · June 2, 2015, 12:44am

rk4k,
The output produced by your script and the scripts suggested by Scrutinizer do not match the output you said you wanted. The heading line and the last line that you showed in your expected result are not included in the output they produce.

I haven't tried a performance comparison, but the following uses fewer processes and pushes less data through the pipes so it could be faster. With your sample input files, the following code:

for file in *.csv
do	printf '%s\n' "$file"
done | awk '
{	file = $0
	while((getline < file) == 1) {
		if(/Date/ && ! count[$0]++) {
			print
			continue
		}
		if(/"",""/ && $0 > trailer) {
			trailer = $0
			continue
		}
		if(! count[$0]++)
			print | "sort"
	}
	close(file)
}
END {	close("sort")
	print trailer
}'

produces the output:

"Date","Inbound","Outbound"
"2015-06-01 13:45:00","102061994.58","30970432.36"
"2015-06-01 13:50:00","101806973.26","31345909.05"
"2015-06-01 13:55:00","104203841.45","31944063.26"
"2015-06-01 14:00:00","87749793.77","31403776.19"
"2015-06-01 14:05:00","74034336.32","31371706.30"
"2015-06-01 14:10:00","58456052.80","30244690.38"
"2015-06-01 14:15:00","68841405.45","31038223.80"
"2015-06-01 14:20:00","91816301.42","31397426.60"
"2015-06-01 14:25:00","109179816.57","31436652.92"
"2015-06-01 14:30:00","113248200.09","31885767.39"
"2015-06-01 14:35:00","94289765.97","30544861.40"
"2015-06-01 14:40:00","102619582.08","31742486.29"
"2015-06-01 14:45:00","84352207.67","30581944.65"
"2015-06-01 14:50:00","93056971.42","30470065.37"
"2015-06-01 14:55:00","78773346.56","29866197.88"
"2015-06-01 15:00:00","",""

which seems to be what you said you wanted.

And, even with thousands of .csv files in the current directory, it shouldn't have any problems with E2BIG errors that you might get expanding the argument list for cat *.csv or grep ... *.csv .

rk4k · June 2, 2015, 10:41pm

Hello all,

I've found Don's code much more responsive to sort the csv files.
Sorry for the heading and last lines in my first request. It's not mandatory, all I need is the csv results.

Thanks for all the responses.

Much obliged.

Best Regards.

Scrutinizer · June 4, 2015, 10:18am

The number of processes between the various non-find solutions does not differ much, but I never realized that the amount of data through the pipes could matter that much. Thank you for that insight..

I did some tests with 10.000 input file and it was up to a factor 5 when selection is done before the pipe (obviously depending on the amount of data that can be saved by selecting before the pipe)..

--
My system does not have a lot of line length limitations , so I was able to use this script (since the header and footer do not matter):

awk '/[0-9]"$/ && !A[$0]++' *.csv

Which is a little bit faster still, since it doesn't even use a pipe..