Comparing two files

todaealas · January 15, 2013, 10:42am

I am trying to do a comparison between two files, and trying to output the difference between the two files.

Let's take FileA.txt and FileB.txt for example:

FileA.txt
--------
Just A Fool:Christina Aguilera feat. Blake Shelton:Lotus (Deluxe Edition)
Figure 8:Ellie Goulding:Halcyon
Lovebird:Leona Lewis:Glassheart
Try:P!nk:The Truth About Love
Die Young:Ke$ha:Warrior

FileB.txt
--------
Just A Fool:Christina Aguilera feat. Blake Shelton:Lotus
Figure 8:Ellie Goulding:Halcyon
Lovebird:Leona Lewis:Glassheart (Deluxe Edition)
Try:P!nk:The Truth About Love
Die Young:Ke$ha:Warrior

I wanna compare FileA.txt (before) and FileB.txt (after), and take the lines that are different from FileA.txt to output to FileC.txt. So, FileC.txt should have the following output:

Just A Fool:Christina Aguilera feat. Blake Shelton:Lotus
Lovebird:Leona Lewis:Glassheart (Deluxe Edition)

I'm using diff to check for the difference:

diff FileA.txt FileB.txt > FileC.txt

However, I got the following results:

12c12
< Just A Fool:Christina Aguilera feat. Blake Shelton:Lotus (Deluxe Edition)
---
> Just A Fool:Christina Aguilera feat. Blake Shelton:Lotus

I am unable to find such an option that does what I need to do.

Help please?

vgersh99 · January 15, 2013, 10:55am

awk 'FNR==NR{a[$0];next} !($0 in a)' FileA.txt FileB.txt > FileC.txt

Scott · January 15, 2013, 10:57am

Also, if you have sdiff, that is a bit more useful than regular diff

$ sdiff File[AB].txt | sed -n "/|/ {s/.*| *//;p;}"
Just A Fool:Christina Aguilera feat. Blake Shelton:Lotus
Lovebird:Leona Lewis:Glassheart (Deluxe Edition)

(although how you extract the required text is up to you)

todaealas · January 15, 2013, 11:21am

I tried vgersh99's and Scott's solution, and both worked amazingly. Thanks!

I really need to read up more on awk/sed, because they are amazing when it comes to almost anything bash-related.

---------- Post updated at 12:21 AM ---------- Previous update was at 12:10 AM ----------

Oh yes, is it possible for vgersh99 to explain your code? It's more for documentation.

Don_Cragun · January 15, 2013, 1:45pm

1 awk '
2 FNR==NR{
3       a[$0]
4       next
5 }
6 !($0 in a)
7 ' FileA.txt FileB.txt > FileC.txt

This is a reformatted version of vgersh99's awk script with line numbers added for reference during this discussion. The line numbers cannot appear in the actual script.

Line 1 says we are using the awk utility to evaluate a script of awk commands.

Lines 2 through 6 are the awk commands that make up the script. The script is delimited by the single quotes at the end of line 1 and start of line 7.

Line 7 names the two input files ( FileA.txt and FileB.txt ) that awk will process, and specifies that the shell running this command will redirect any output written by awk ( > ) into a file named FileC.txt .

When awk runs a script, it first processes any commands that are requested to run before processing data read from input files (but there aren't any of these in this script). Then it goes into a loop that reads the next line from the input files and processes that line by running the script. This loop repeats until all lines have been read and processed for all of the input files given. Then it processes any commands that are requested to run after all input lines have been processed (but there aren't any of these in this script either).

In the awk script there are commands of the form:

         condition{action}

When condition evaluates to a non-zero value or to a non-empty string (depending on context), the condition evaluates to TRUE and the commands in {action} will be performed. (If condition is not present, {action} will be performed for every input line read.) If condition is present but {action} is not present, the default action is to print the current contents of the current line. (Note that the contents of the current line may have been changed by statements in the script, so the current line night not be the line that was read.)

The condition on line 2 tests whether the number of lines read from the current input file ( FNR ) is equal to( == ) the number of lines read from all input files ( NR ). This is a common idiom in awk saying "Execute this action for lines read from the 1st input file."

The command on line 3 creates an element in array a indexed by the contents of the current line ( $0 ). That element ( a["contents of current line"] is not assigned any value, it just creates an element in the array.

The command on Line 4 says stop processing this line and restart the script for the next input line.

Line 5 marks the end of the commands in the action assoiated with the condition on Line 2.

The condition on Line 6 evaluates to TRUE is there is not ( ! ) an element in the array a indexed by the contents of the current line ( ($0 in a) ). Since there is no {action} for this condition, if this condition evaluates to TRUE the current line will be printed.

So, if a line in the 2nd file did not also appear in the first file, print the line.

Note, however, that this will not report any differences if the same lines appear in both files, but are in a different order. It also will not notice if identical lines appear a different number of times in the two files.