Comparing Strings in 2 .csv/txt files?

chickeneaterguy · February 17, 2011, 2:56am

EDIT: My problems have been solved thanks to the help of bartus11 and pravin27

This code is just to help me learn. It serves no purpose other than that.

Here's a sample csv that I'm working with -

#listofpeeps.csv
Jackie Chan,1954,M
Chuck Norris,1930,M
Bruce Lee,1940,M

This code is used to extract strings from the above csv and output the strings meeting my specification of DOB == 1940 into another file titled 1940births.csv

awk -F, '{ if ($2 == 1940) print $1 "," $3}' listofpeeps.csv > 1940births.csv

This program will output:

#1940births.csv
Bruce Lee,M

Let's say I have another .csv file that has -

#ethnicity.csv
1,Bruce Lee,Chinese
2,Chuck Norris,Irish
3,Jackie Chan,Chinese

How would I go about comparing the "Bruce Lee" or first column in the 1940births.csv with the "Bruce Lee" or second column in the ethnicity.csv file?

bartus11 · February 17, 2011, 3:31am

What would be the output of this comparison?

chickeneaterguy · February 17, 2011, 3:33am

ethnicity or the third column of the 2nd file. I know in this situation I could just have them in the same file, but that's not the point. I am trying to learn.

pravin27 · February 17, 2011, 3:47am

try this,

awk -F"," 'NR==FNR{a[$1]++;next}a[$2]' 1940births.csv ethnicity.csv

bartus11 · February 17, 2011, 3:55am

join -t, -11 -22 -o1.1,1.2,2.3 1940births.csv ethnicity.csv

chickeneaterguy · February 17, 2011, 4:08am

Thanks.

Can both of you explain to me what these lines do? I know I can just try them out but I am new to unix and I don't know much.

bartus11 · February 17, 2011, 4:26am

join -t, -11 -22 -o1.1,1.2,2.3 1940births.csv ethnicity.csv

"join" is a standard Unix/Linux utility to join files based on common field.

-t option specifies separator of the fields (comma in this case).[]-1n selects "nth" field from first file as field to join on (in this case n=1, so first field from first file is selected).[]-2n selects "nth" field from second file as field to join on (in this case n=2, so second field from second file is selected).[]-o specifies desired output (in this case 1.1 - first field from first file, 1.2 - second field from first file and 2.3 - third field from second file).[]1940births.csv - first file, ethnicity.csv - second file

pravin27 · February 17, 2011, 4:28am

awk -F"," 'NR==FNR{a[$1]++;next}a[$2]' 1940births.csv ethnicity.csv

NR The total number of input records seen so far.
FNR The input record number in the current input file,FNR restarts the counting at the begining of each input file.

if NR==FNR then {a[$1]++;next}

# Process the first file(1940births.csv), fill the array 'a' with index as $1 of 1940births.csv

else a[$2]

' # process the second file(ethnicity.csv),if $2 of second file is present in array 'a' then print the current record for second file

1940births.csv ethnicity.csv

# input files

chickeneaterguy · February 17, 2011, 8:51pm

bartus11:

join -t, -11 -22 -o1.1,1.2,2.3 1940births.csv ethnicity.csv
"join" is a standard Unix/Linux utility to join files based on common field.

-t option specifies separator of the fields (comma in this case).[]-1n selects "nth" field from first file as field to join on (in this case n=1, so first field from first file is selected).[]-2n selects "nth" field from second file as field to join on (in this case n=2, so second field from second file is selected).[]-o specifies desired output (in this case 1.1 - first field from first file, 1.2 - second field from first file and 2.3 - third field from second file).[]1940births.csv - first file, ethnicity.csv - second file

Thanks, this was a very concise and easy-to-understand explanation. I will try using this implementation after memorizing how it works.

pravin27:

awk -F"," 'NR==FNR{a[$1]++;next}a[$2]' 1940births.csv ethnicity.csv
NR The total number of input records seen so far.
FNR The input record number in the current input file,FNR restarts the counting at the begining of each input file.
if NR==FNR then {a[$1]++;next}
# Process the first file(1940births.csv), fill the array 'a' with index as $1 of 1940births.csv
else a[$2]
' # process the second file(ethnicity.csv),if $2 of second file is present in array 'a' then print the current record for second file
1940births.csv ethnicity.csv
# input files

Thanks to you as well. You've provided me with code that is easy to understand and gets the job done. I will also try using this implementation and learning it so I can help people like me

---------- Post updated 02-17-11 at 03:51 PM ---------- Previous update was 02-16-11 at 11:32 PM ----------

@bartus11 - Is there a method that I can use for ALL lines in a file instead of just one?

bartus11 · February 18, 2011, 2:30am

Post sample input files consisting of multiple lines please.

chickeneaterguy · February 18, 2011, 2:31am

Say I have

#names.csv
Jackie Chan,1954,O+
Chuck Norris,1930,A-
Bruce Lee,1940,O+

Where the 1st column is the name, 2nd is DOB, and 3rd is blood type.

And I had a second file

#o_pos_ssn.csv
Jackie Chan,O+,123-45-6789
Bruce Lee,O+,095-34-1647
Jet Li,O+,067-127-5791

Where the 1st column is name, 2nd is blood type, and 3rd is SSN.

If I use this:

join -t, -13 -22 -o1.1,1.2,2.2,2.3 names.csv o_pos_ssn.csv > result.csv

The output will be:

#result.csv
Jackie Chan,1954,O+,123-45-6789

I want it to output ALL lines that fit the requirements and get the code below:

#result.csv
Jackie Chan,1954,O+,123-45-6789
Bruce Lee,1940,O+,095-34-1647

How do I make the join command go through all lines?

bartus11 · February 18, 2011, 2:45am

I forgot that files should be sorted. Try:

join -t, -11 -21 -o1.1,1.2,2.2,2.3 <(sort names.csv) <(sort o_pos_ssn.csv)

chickeneaterguy · February 18, 2011, 2:55am

Thanks, you're amazing!

And can you explain to me why you matched " -11 -21 " and still got it to work rather than the blood types? Confused.

bartus11 · February 18, 2011, 3:32am

Well, I just used unique key to join those files Joining on blood type would create some garbage output IMO (and is a bit of nonsense too). Out of pure curiosity:

% join -t, -13 -22 -o1.1,1.2,2.2,2.3 <(sort -t, -k3 names.csv) <(sort -t, -k2 o_pos_ssn.csv)
Bruce Lee,1940,O+,067-127-5791
Bruce Lee,1940,O+,095-34-1647
Bruce Lee,1940,O+,123-45-6789
Jackie Chan,1954,O+,067-127-5791
Jackie Chan,1954,O+,095-34-1647
Jackie Chan,1954,O+,123-45-6789