Remove words from file2 that don't exist in file1

noliveira · July 12, 2012, 1:27pm

Hi

I have to list of words file1 and file2, I want to compare both lists and remove from file2 all the words that don't exist in file1.

How can I do this?

Many thanks

jim_mcnamara · July 12, 2012, 1:52pm

awk 'FILENAME=="file1" { arr[$0]++ }
       FILENAME=="file2" { if( $0 in arr ) {print $0}; next } ' file1 file2 > tmp.tmp
# be SURE you got what you wanted before doing the mv command
mv tmp.tmp file2

ctsgnb · July 12, 2012, 1:53pm

# cat f1
a f g h i
j k l
# cat f2
o p q r
g z x
n b i
# comm -12 <(xargs -n1 <f1 | sort) <(xargs -n1 <f2 | sort)
g
i
#

... but ok this solution may not be the most optimized one ...

alister · July 12, 2012, 2:07pm

You're correct about it not being optimal ;). xargs will fork/exec echo once per word in each file. Not a big deal for smaller files, but it would be an expensive solution if the dataset were large.

Regards,
Alister

ctsgnb · July 13, 2012, 3:36am

Ok ok

... a little better with tr :

# time comm -12 <(xargs -n1 <f1 | sort) <(xargs -n1 <f2 | sort)
g
i

real    0m0.022s
user    0m0.000s
sys     0m0.050s
# time comm -12 <(tr ' ' '\n' <f1 | sort) <(tr ' ' '\n' <f2 | sort)
g
i

real    0m0.009s
user    0m0.000s
sys     0m0.010s

If we can assume the lists already consist of a single column (just as Jim's code does) the tr step can then be removed.

And if the lists are already sorted, we can then also remove the sorting step ...