Compare content between two files

Mrinal_Mondal · December 14, 2012, 1:01pm

I have two files in unix environment with similer type of contain:

Example:
File1 File2

Milestone1                              Milestone1
Milestone2                              Milestone12
Milestone3                              Milestone13
Milestone4                              Milestone14
Milestone5                              Milestone5
Milestone6                              Milestone16

I need to compare these two files (File1 and File2), and create output to a different file (say File3) informing

There are 2 name matching.
Names are:

Milestone1
Milestone5

Please help me with resolution. I need unix script to complete the job. I am stuck in between it for a long time:confused:

Scrutinizer · December 14, 2012, 1:10pm

Hi, try:

grep -xFf file1 file2

Mrinal_Mondal · December 14, 2012, 1:17pm

Hi Scrutinizer, thanks , but it does not give me an output to a new file informing how many matching records and what are the matching data. I am working with huge number of data (nearly 900000 lines of data), it required to fetch out the numbers and the data which is matching in two files.

Scrutinizer · December 14, 2012, 1:25pm

You would need to redirect the output:

grep -xFf file1 file2 > file3

An alternative would be:

awk 'NR==FNR{A[$1]; next}$1 in A' file1 file2 > file3

Mrinal_Mondal · December 14, 2012, 1:36pm

Thanks
I have run the code, It does not give me any error message and successfully executed and new file created, but it does not fetch any data.

i run both the codes. Sorry I am really novice in this area and keep asking "may be very basic" questions and tricks.

Scrutinizer · December 14, 2012, 1:48pm

Could you post an anonimized representative sample of both files?

Mrinal_Mondal · December 14, 2012, 1:58pm

Sure. I have posted on my query details.

Suppose I have two files File1: it contain some name like Milestone1, Milestone2, Milestone3, Milestone4, Milestone5, Milestone6. File2: it contain some files like Milestone12, Milestone13, Milestone14, Milestone5, Milestone16.

I need to get output to another file as:

There are 2 file matching and these files are:
Milestone1
Milestone5

Scrutinizer · December 14, 2012, 2:07pm

If I run both scripts on your sample:

$ cat file1
Milestone1
Milestone2
Milestone3
Milestone4
Milestone5
Milestone6
$ cat file2
Milestone1
Milestone12
Milestone13
Milestone14
Milestone5
Milestone16
$ grep -xFf file1 file2
Milestone1
Milestone5
$ awk 'NR==FNR{A[$1]; next}$1 in A' file1 file2
Milestone1
Milestone5

Perhaps you could cut and paste this back into test input files and see if it works then and to spot the difference with your actual sample?

Vikram_Tanwar12 · December 14, 2012, 2:16pm

You can also try like this

1st sort both the files

sort filename1 -o filename1

sort filename2 -o filename2

then try

comm -12 filename1 filename2

Mrinal_Mondal · December 15, 2012, 1:43am

Hi Folks thanks for your prompt reply.

Now I am getting the same contain repeated in different files, but there is another problem raised with it.

If I run the command specified by Scrutinizer and Vikram. It is giving me output of every repeated files, but it is not a uniq name.

I believe I need to more precise to explain the scenario, I am changing a little in my file structure example.

Let me explain. If I do

$cat File1
Milestone1
Milestone1
Milestone2
Milestone3
Milestone4
Milestone5
Milestone6

$cat File2
Milestone1
Milestone12
Milestone13
Milestone14
Milestone5
Milestone1
Milestone16

Now if I run the code specified by Scrutinizer and Vikram it is giving me output as

File2:Milestone1
File2:Milestone1
File2:Milestone5

As I told that I am working with huge number of data. The repeat lines will create trouble for me.

What I require is the out put will be

File2:Milestone1 , repeated 2 times
File2:Milestone5 , repeated 1 times

is it possible to get the results like that?

Scrutinizer · December 15, 2012, 3:35am

To get unique results you could just run everything throught sort -u :

grep -xFf file1 file2 | sort -u

or

awk 'NR==FNR{A[$1]; next}$1 in A' file1 file2 | sort -u

--
Alternatively, one could make sure results only get printed once, for example:

awk 'NR==FNR{A[$1]++; next}$1 in A{delete A[$1]; print $1}' file1 file2

===

To get a count, you could use the -H opion you can run everything through | sort | uniq -c

grep -HxFf file1 file2 | sort | uniq -c

or to also specify the file name:

echo | grep -xFf infile1904 infile1904b - | sort | uniq -c

--
To get exactly the result you specified, it would need to be something like this:

awk '
  NR==FNR{
    A[$1]++;
    next
  }
  $1 in A{
    T[$1]++
  } 
  END{
    for(i in T)printf "%s: %s , repeated %d times\n", FILENAME,i,T
  }
' file1 file2

MadeInGermany · December 15, 2012, 12:58pm

Maybe you have created file1 and file2 in Windos?
Then you need to convert them to Unix style first:

dos2unix file1 file1
dos2unix file2 file2

and now retry the scripts. They should work.

If you want both the matches and the number of matches, go with awk:

awk 'NR==FNR {A[$1]; next} $1 in A {count++; print} END {print count,"matching records"}' file1 file2 > file3