Compare 2 csv files by columns, then extract certain columns of matcing rows

Hi all, I'm pretty much a newbie to UNIX. I would appreciate any help with UNIX coding on comparing two large csv files (greater than 10 GB in size), and output a file with matching columns.

I want to compare file1 and file2 by 'id' and 'chain' columns, then extract exact matching rows' remaining columns from file2 and add them to file1's columns, and remove no match rows. Also create new entries in file1 for multiple row matches from file2

For example:

$ head file1
id,chain,offer,market,repeattrips,repeater,offerdate
86246,205,1208251,34,5,t,2013-04-24
86252,205,1197502,34,16,t,2013-03-27
12682470,18,1197502,11,0,f,2013-03-28
12996040,15,1197502,9,0,f,2013-03-25
13089312,15,1204821,9,0,f,2013-04-01
$ head file2
id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount
86246,205,7,707,1078778070,12564,2012-03-02,12,OZ,1,7.59
86246,205,63,6319,107654575,17876,2012-03-02,64,OZ,1,1.59
86246,205,97,9753,1022027929,0,2012-03-02,1,CT,1,5.99
86976,205,25,2509,107996777,31373,2012-03-02,16,OZ,1,1.99
97646,206,55,5555,107684070,32094,2012-03-02,16,OZ,2,10.38

and the desired output would be:

id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount,offer,market,repeattrips,repeater,offerdate
86246,205,7,707,1078778070,12564,2012-03-02,12,OZ,1,7.59,1208251,34,5,t,2013-04-24
86246,205,63,6319,107654575,17876,2012-03-02,64,OZ,1,1.59,1208251,34,5,t,2013-04-24
86246,205,97,9753,1022027929,0,2012-03-02,1,CT,1,5.99,1208251,34,5,t,2013-04-24

If you leave a code please explain them a little bit

Thanks

Welcome to forums.

$ cat file1
id,chain,offer,market,repeattrips,repeater,offerdate
86246,205,1208251,34,5,t,2013-04-24
86252,205,1197502,34,16,t,2013-03-27
12682470,18,1197502,11,0,f,2013-03-28
12996040,15,1197502,9,0,f,2013-03-25
13089312,15,1204821,9,0,f,2013-04-01

$ cat file2
id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount
86246,205,7,707,1078778070,12564,2012-03-02,12,OZ,1,7.59
86246,205,63,6319,107654575,17876,2012-03-02,64,OZ,1,1.59
86246,205,97,9753,1022027929,0,2012-03-02,1,CT,1,5.99
86976,205,25,2509,107996777,31373,2012-03-02,16,OZ,1,1.99
97646,206,55,5555,107684070,32094,2012-03-02,16,OZ,2,10.38
$ awk -F, 'FNR==NR{s= $1 FS $2; $1=$2="\b"; A=$0;next}(($1 FS $2) in A){print $0,A[$1 FS $2]}' OFS=',' file1 file2

Resulting

id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount,offer,market,repeattrips,repeater,offerdate
86246,205,7,707,1078778070,12564,2012-03-02,12,OZ,1,7.59,1208251,34,5,t,2013-04-24
86246,205,63,6319,107654575,17876,2012-03-02,64,OZ,1,1.59,1208251,34,5,t,2013-04-24
86246,205,97,9753,1022027929,0,2012-03-02,1,CT,1,5.99,1208251,34,5,t,2013-04-24
1 Like

Thank you Akshay. That worked.

I have another similar matching problem. I want to compare the previous outputted result file (I will call it file3) with file4 (below) by 'offer','category','brand', 'company'. Like before, I want to extract exact matching rows' remaining columns ('quantity', 'offervalue') from file4 and add them to file3 columns, and remove no match rows from file3.

$ head file4
offer,category,quantity,company,offervalue,brand
1190530,9115,1,108500080,5,93904
1194044,9909,1,107127979,1,6732
1197502,3203,1,106414464,0.75,13474
1198271,5558,1,107120272,1.5,5072
1198272,5558,1,107120272,1.5,5072
1198273,5558,1,107120272,1.5,5072
1198274,5558,1,107120272,1.5,5072
1198275,5558,1,107120272,1.5,5072
1199256,4401,1,105100050,2,13791

Thanks

There is no match except header in current example try this

awk -F, '
	FNR==NR{
		# Read file3

		# Loop through fields...
		  for(i=1;i<=NF;i++)
		   {
			if(i < 4 || ( i > 6 && i < 12 ) || i>12)
			{
				# Array A with index key offer,category,brand,company
                                key = $12 FS $4 FS $6 FS $5

				# All column except 12,4,6,5 are array element with OFS being comma
				A[key] = A[key] ? A[key] OFS $i : $i	
			}
		   }
			next
	        }
	
	# read file4, if exact match print
	(($1 FS $2 FS $6 FS $4) in A) \
		{ 
			print $0, A[$1 FS $2 FS $6 FS $4] 
		}
      ' OFS="," file3 file4
1 Like

Thanks this works with the following examples files. However, it's been 4 hours since I've started running on the actual files I have, and it still has not finished. The actual file3 I have is about 16GB, and the actual file4 is only 2KB. Is there ways to make the program faster? maybe a way to program it without the loop?

 cat file3.csv
id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount,offer,market,repeattrips,repeater,offerdate
86246,205,7,707,1078778070,12564,2012-03-02,12,OZ,1,7.59,1208251,34,5,t,2013-04-24
86246,205,63,6319,107654575,17876,2012-03-02,64,OZ,1,1.59,1208251,34,5,t,2013-04-24
86246,205,97,9753,1022027929,0,2012-03-02,1,CT,1,5.99,1208251,34,5,t,2013-04-24
86246,205,25,2509,107996777,31373,2012-03-02,16,OZ,1,1.99,1208251,34,5,t,2013-04-24
86246,205,55,5558,107684070,32094,2012-03-02,16,OZ,2,10.38,1208251,34,5,t,2013-04-24
86246,205,97,9753,1021015020,0,2012-03-02,1,CT,1,7.8,1208251,34,5,t,2013-04-24
86246,205,99,9909,107127979,6732,2012-03-02,16,OZ,1,2.49,1194044,34,5,t,2013-04-24
86246,205,59,5907,102900020,2012,2012-03-02,16,OZ,1,1.39,1208251,34,5,t,2013-04-24
86246,205,9,9909,107127979,9209,2012-03-02,4,OZ,2,1.5,1194044,34,5,t,2013-04-24
$ cat file4.csv
offer,category,quantity,company,offervalue,brand
1190530,9115,1,108500080,5,93904
1194044,9909,1,107127979,1,6732
1197502,3203,1,106414464,0.75,13474
1198271,5558,1,107120272,1.5,5072
1198272,5558,1,107120272,1.5,5072
1198273,5558,1,107120272,1.5,5072
1198274,5558,1,107120272,1.5,5072
1198275,5558,1,107120272,1.5,5072
1199256,4401,1,105100050,2,13791

and the result:

offer,category,quantity,company,offervalue,brand,id,chain,dept,date,productsize,productmeasure,purchasequantity,purchaseamount,market,repeattrips,repeater,offerdate
1194044,9909,1,107127979,1,6732,86246,205,99,2012-03-02,16,OZ,1,2.49,34,5,t,2013-04-24

Okay this will be faster, try I wasn't knowing about your file size

$ awk -F, 'FNR==NR{A[$1 FS $2 FS  $6 FS $4] = $3 OFS $5;next}(($12 FS $4 FS $6 FS $5) in A){print $0,A[$12 FS $4 FS $6 FS $5]}' OFS="," file4 file3

id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount,offer,market,repeattrips,repeater,offerdate,quantity,offervalue
86246,205,99,9909,107127979,6732,2012-03-02,16,OZ,1,2.49,1194044,34,5,t,2013-04-24,1,1

---------- Post updated at 02:59 PM ---------- Previous update was at 02:53 PM ----------

adjust {print $0, A[...]} , according to your need,

where

$0 -> full line

$1 -> column1

$2 -> column2, and so on

1 Like