Comparing two CSV files

I have two csv files and im trying to compare them. e.g.

SAMPLE DATA:

file one:

ZipCode            Name
20878             Washington
10023              Missouri
20304              Maryland

file two:

ID                 Name            City                 ZipCode
11654      Arizona               Phoenix              10001
12343     Washington             DC                   20878
12343     New Hampshire      Concord             20854
54211        Maryland           Silver Spring         20304             
54877       Missouri              St. Louis             10023

SAMPLE OUTPUT:

12343     Washington             DC                   20878
54211        Maryland           Silver Spring         20304             
54877       Missouri              St. Louis             10023

These 3 rows only display because they are the rows we were looking for from file 1. Basically filtering file2 by zipcode to display all the zipcodes from file 1 but including all the other data components of file 2.

these files have like tens of thousands of rows so I wanted to retrieve all the rows in file 2 that exist with the same zip code in file 1. I thought this would be like an awk command like :

awk*-F','*'NR==FNR{c[$1]++;next};c[$4] > 0'*file2 file1 

Am I not understanding the AWK command or typing something wrong? Comparing field 1 from file 1 to field 4 from file 2.

Hi, dan139

It would be much helpful if you were to give a few more lines from file1 and file2, in that way we can see what's the real structure of the files, which it will make a difference. Also, it would be most helpful if you were to post your expected output example.

Make sure you use the the code tag for your posts, that preserves the structure of it. My quote of your first post has an example how I code tagged your entries. Make sure it looks the same.

1 Like

Added a more detailed response. let me know if you need more clarification.

basically taking all of the zips from file 1 and getting the rows for those same zips in file 2.

Give this a try:

awk 'FNR==NR {zip[$1]; next} $NF in zip' file1 file2
FNR==NR # this condition is true only for the first file
{zip[$1]; next} # record the zips from file1 only
$NF in zip # if the zip, (last column) in file2, is found in the records, display

Output:

ID                 Name            City                 ZipCode
12343     Washington             DC                   20878
54211        Maryland           Silver Spring         20304
54877       Missouri              St. Louis             10023

Thanks for the reply. So I realized that my data in file 2, it has the Zipcode in column #2, not the last the column. would the command be:

awk 'FNR==NR {zip[$1]; next} $2 in zip' file1 file2

I tried both versions and nothing was in the output so not really sure.

That's why it is important to post accurate representative information.
To answer your last question: Yes, your modification "should" work if what you are saying is factual. However, since you are not getting any output that could only mean that what you think is not accurate, neither.

Please, post the real information for file2

You were close.

  • What are the stars for? Replace by spaces.
  • Why did you specify the comma field separator when there's not a single comma in any of your files? Remove, use default.
  • Due to the spaces in city names, $4 is not always the zip code in file2. Use $NF .
  • file1 should be used to populate the array. Reverse the files.

For your new approach with the zip code in $2 , I can't see why it shouldn't work; actually it does for me.

file 1:

ID,Zip,Address,Parent,Country
9874125,43232,"493 Marietta St",21152,'United States'
4845622,85489,"434 Beach St",21542,'United States'
9874126,43234,"368 John's Creek Way",21122,'United States'
9874122,43233,"345 Cherry Place",21152,'United States'

file2

Zip, Parent
43232,21152
43234,21122

desired output

ID,Zip,Address,Parent,Country
9874125,43232,"493 Marietta St",21152,'United States'
9874126,43234,"368 John's Creek Way",21122,'United States'

This was just a sample of the data, there's thousands of more lines.

when i run:

awk 'FNR==NR {zip[$1]; next} $2 in zip' file2 file1

i don't get anything in the output...

Also these files are separated by commas because they are CSV files.

---------- Post updated at 05:41 PM ---------- Previous update was at 05:25 PM ----------

Never mind, figured it out! Just needed to add a -F ', ' delimiter! thanks. Final command:

awk -F ',' 'FNR==NR {zip[$1]; next} $2 in zip' file2 file1

If you're trying to match on both of the fields that are present in file2 , you might want to try something more like:

awk -F ',' 'FNR==NR {zip_parent[$1,$2]; next} ($2,$4) in zip_parent' file2 file1

And, note that with a <comma> character as your field separator, you MUST use -F ',' with no space between the <comma> character and the closing single-quote character. (With -F ', ' , you are specifying a field separator that is a <comma> character followed by a <space> character.)

1 Like

Hi again -

So after reexamining the output using the following command:

awk -F ',' 'FNR==NR {zip[$1]; next} $2 in zip' file2 file1

the output has less rows than file2 which is impossible. Every row in file2 exists in file1 but with added data. For example,

file1 has 1000 rows with zipcodes to search through.
file2 has 500 zipcodes we are looking for.

the output of this command:
file3 yields only 350 zipcodes when it should yield 500 zipcodes. I know for a fact every zipcode in file2 exists in file1.

Anyone know what the problem could be?

Perhaps, you can run the opposite to diagnose the problem:

awk -F ',' 'FNR==NR {zip[$1]; next} !($2 in zip)' file2 file1

That will show the lines that did not make it, from file1. After that, you can analyze what's not according to what you think.

---------- Post updated at 04:15 PM ---------- Previous update was at 04:01 PM ----------

After re-reading your post again, I think a better test would be:

 awk -F"," 'FNR==NR {zip[$2]; next} !($1 in zip)' file1 file2

That will show the zips found in file2 that it does not have a match in file1, meaning: those are the lines it will not produce a result when you run the real program.

1 Like

Please also show us what output you get from the following awk script:

awk -F, '
NR == 1 {
	next
}
!($1 in z) {
	c++
}
{	z[$1]++
}
END {	printf("%d data lines read\n", NR - 1)
	printf("%d unique zip codes read\n", c)
	for(i in z)
		if(z > 1)
			printf("zip:%s appears %d times\n", i, z)
}' file2

As has been said many times before, if you are running this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

With what you have told us so far, I would expect the 2nd line of output to be about 350 and I would expect several lines following that listing zip codes that appear on more than one line in file2 (i.e., some zip codes have more than one parent).