Comparing Strings in 2 .csv/txt files?

EDIT: My problems have been solved thanks to the help of bartus11 and pravin27

This code is just to help me learn. It serves no purpose other than that.

Here's a sample csv that I'm working with -

#listofpeeps.csv
Jackie Chan,1954,M
Chuck Norris,1930,M
Bruce Lee,1940,M

This code is used to extract strings from the above csv and output the strings meeting my specification of DOB == 1940 into another file titled 1940births.csv

awk -F, '{ if ($2 == 1940) print $1 "," $3}' listofpeeps.csv > 1940births.csv

This program will output:

#1940births.csv
Bruce Lee,M

Let's say I have another .csv file that has -

#ethnicity.csv
1,Bruce Lee,Chinese
2,Chuck Norris,Irish
3,Jackie Chan,Chinese

How would I go about comparing the "Bruce Lee" or first column in the 1940births.csv with the "Bruce Lee" or second column in the ethnicity.csv file?

What would be the output of this comparison?

ethnicity or the third column of the 2nd file. I know in this situation I could just have them in the same file, but that's not the point. I am trying to learn.

try this,

awk -F"," 'NR==FNR{a[$1]++;next}a[$2]' 1940births.csv ethnicity.csv
1 Like
join -t, -11 -22 -o1.1,1.2,2.3 1940births.csv ethnicity.csv
1 Like

Thanks.

Can both of you explain to me what these lines do? I know I can just try them out but I am new to unix and I don't know much.

join -t, -11 -22 -o1.1,1.2,2.3 1940births.csv ethnicity.csv

"join" is a standard Unix/Linux utility to join files based on common field.

  • -t option specifies separator of the fields (comma in this case).[]-1n selects "nth" field from first file as field to join on (in this case n=1, so first field from first file is selected).[]-2n selects "nth" field from second file as field to join on (in this case n=2, so second field from second file is selected).[]-o specifies desired output (in this case 1.1 - first field from first file, 1.2 - second field from first file and 2.3 - third field from second file).[]1940births.csv - first file, ethnicity.csv - second file
1 Like
awk -F"," 'NR==FNR{a[$1]++;next}a[$2]' 1940births.csv ethnicity.csv

NR The total number of input records seen so far.
FNR The input record number in the current input file,FNR restarts the counting at the begining of each input file.

if NR==FNR then {a[$1]++;next}

# Process the first file(1940births.csv), fill the array 'a' with index as $1 of 1940births.csv

else a[$2]

' # process the second file(ethnicity.csv),if $2 of second file is present in array 'a' then print the current record for second file

1940births.csv ethnicity.csv

# input files

1 Like

Thanks, this was a very concise and easy-to-understand explanation. I will try using this implementation after memorizing how it works.

Thanks to you as well. You've provided me with code that is easy to understand and gets the job done. I will also try using this implementation and learning it so I can help people like me :wink:

---------- Post updated 02-17-11 at 03:51 PM ---------- Previous update was 02-16-11 at 11:32 PM ----------

@bartus11 - Is there a method that I can use for ALL lines in a file instead of just one?

Post sample input files consisting of multiple lines please.

Say I have

#names.csv
Jackie Chan,1954,O+
Chuck Norris,1930,A-
Bruce Lee,1940,O+

Where the 1st column is the name, 2nd is DOB, and 3rd is blood type.

And I had a second file

#o_pos_ssn.csv
Jackie Chan,O+,123-45-6789
Bruce Lee,O+,095-34-1647
Jet Li,O+,067-127-5791

Where the 1st column is name, 2nd is blood type, and 3rd is SSN.

If I use this:

join -t, -13 -22 -o1.1,1.2,2.2,2.3 names.csv o_pos_ssn.csv > result.csv

The output will be:

#result.csv
Jackie Chan,1954,O+,123-45-6789

I want it to output ALL lines that fit the requirements and get the code below:

#result.csv
Jackie Chan,1954,O+,123-45-6789
Bruce Lee,1940,O+,095-34-1647

How do I make the join command go through all lines?

I forgot that files should be sorted. Try:

join -t, -11 -21 -o1.1,1.2,2.2,2.3 <(sort names.csv) <(sort o_pos_ssn.csv)
1 Like

Thanks, you're amazing!

And can you explain to me why you matched " -11 -21 " and still got it to work rather than the blood types? Confused.

Well, I just used unique key to join those files :slight_smile: Joining on blood type would create some garbage output IMO (and is a bit of nonsense too). Out of pure curiosity:

% join -t, -13 -22 -o1.1,1.2,2.2,2.3 <(sort -t, -k3 names.csv) <(sort -t, -k2 o_pos_ssn.csv)
Bruce Lee,1940,O+,067-127-5791
Bruce Lee,1940,O+,095-34-1647
Bruce Lee,1940,O+,123-45-6789
Jackie Chan,1954,O+,067-127-5791
Jackie Chan,1954,O+,095-34-1647
Jackie Chan,1954,O+,123-45-6789
1 Like