comm -12 based on 1 column

tiggyboo · November 30, 2010, 5:14pm

I'd like to eliminate the rows in two files that do not share a common value in the first column. Here's my tortured logic that is way too inefficient to consider, but might show what i'm trying to do (assume the files have been sorted):

cut -f1 -d '|' file1 > file1.dat
cut -f1 -d '|' file2 > file2.dat
 
comm -12 file1.dat file2.dat > same.dat
 
grep -f same.dat file1.dat > file1_finished.dat
grep -f same.dat file2.dat > file2_finished.dat

Any thoughts on how to do this more efficiently? Thanks in advance!
Al

ctsgnb · November 30, 2010, 6:15pm

man join

By the way,even if it is not exactly the same problem, you can find some source of inspiration from :

rdcwayx · November 30, 2010, 6:23pm

awk -F \| 'NR==FNR{a[$1]++;next} a[$1]' file2 file1 > file1_finished.dat


awk -F \| 'NR==FNR{a[$1]++;next} a[$1]' file1 file2 > file2_finished.dat

ctsgnb · November 30, 2010, 7:21pm

awk -F'|' '{print"^"$1FS}' f1 f2 | sort | uniq -d | fgrep - f1 >f1.done
awk -F'|' '{print"^"$1FS}' f1 f2 | sort | uniq -d | fgrep - f2 >f2.done

tiggyboo · December 3, 2010, 9:31am

Thanks folks, what I eventually ended up with was:

awk -F'|' 'NR==FNR{++a[$1];next} $1 in a' file1 file2> first.dat
awk -F'|' 'NR==FNR{++a[$1];next} $1 in a' file2 file1> second.dat
 
comm -13 second.dat first.dat > final.dat

I should add that the various options involving grep -f were too time consuming given the size of the files, something I should have mentioned at the outset.

Thanks again.