diff 2 files > file3, but records in various order

unclecameron · April 14, 2010, 1:25pm

What I really need is a script that compares 2 (.csv) text files line by line with a single entries on each line and then outputs NON-duplicate lines to a third (.csv) text file, the problem is the lines may be exactly the same, but in different order in the 2 text files, so

sourcefile1 contains
bob
jane
sally

sourcefile2 contains
sally
bob

output to > file3 containing
jane

I've tried using:

grep -vxf sourcefile1 sourcefile2 > file3

but I get no output, because the lines are in different order? if I do:

comm -13 sourcefile1 sourcefile2 > file3

I get error: "file 1 is not in sorted order", but sorting it doesn't seem to help. I was thinking about writing a loop that said something like:

cat sourcefile1 | while read LINE
do
      cat sourcefile2 | while read LINE2
      do
             if [ "$LINE2" = "$LINE" ]
             then
                   exit
             else
                   echo $LINE > file3
             fi
      done
done

but I don't know if my logic is right (I'm sure my syntax is wrong), and it seems like an inefficient way to do it, are there better/more elegant ways to do this?

Franklin52 · April 14, 2010, 1:44pm

Replace the order of the files:

grep -vxf sourcefile2 sourcefile1 > file3

soleil4716 · April 14, 2010, 2:16pm

I think the best approach would be this:

grep -vxf sourcefile2 sourcefile1 > file3
grep -vxf sourcefile1 sourcefile2 >> file3

We need to do it both ways since there can be unique entries in each file.

alister · April 14, 2010, 5:15pm

If there's a possibility of a name containing a regular expression metacharacter (such as a dot following an initial, for example), to strictly match an entire line with grep, you'll want to use -xF along with whatever other options the logic of the solution demands.

Regards,
Alister

jgt · April 14, 2010, 5:46pm

#!/bin/ksh
              
        
while read name            
do                         
  grep $name $2 >/dev/null
  if [ "$?" != "0" ]       
  then                     
     echo $name            
  fi                       
done <$1

./script source1 source2

soleil4716 · April 14, 2010, 5:50pm

What about lines in source2 that are not in source1?

jgt · April 14, 2010, 5:52pm

run the script above as ./script source2 source1

methyl · April 14, 2010, 6:12pm

I can't see a simple solution without sorting the data. This approach has only one sort. It works.

cat sourcefile1 sourcefile2 | sort | uniq -u > file3

Btw and off topic. I'm sure that the uuoc advocates can find some complicated way of concatonating the files.

Scott · April 14, 2010, 6:20pm

Certainly not a UUOC advocate, but isn't that the same as:

sort -u sourcefile1 sourcefile2 > file3

alister · April 14, 2010, 6:25pm

Nope. sort -u prints one instance of duped lines. uniq -u does not print those lines (which in this case are unwanted).

Scott · April 14, 2010, 6:28pm

I missed the -u uniq option in methyl's post. Oops, sorry

(at least the cat isn't neccessary :D)

unclecameron · April 14, 2010, 8:07pm

Wonderful, thanks! It looks like both of these work

grep -vxf file1 file2 > file3
grep -vxf file2 file1 >> file3

using a single grep wouldn't see changes to the other file. The second version looks more elegant:

cat file1 file2 | sort | uniq -u > file3

though I'm not sure which method would use less cpu/mem? I'm sorting a few million records with this so resource usage may be an issue. Actually I have to repeat this process for many text files in a directory, so I may want to automate piping the ls of the dir and to this process, so at the end I get a dump of only unique records from all the files.