Compare three files based on two fields

Guys,

I tried searching on the internet and I couldn't get the answer for this problem. I have 3 files. First 2 fields of all of them are of same type, say they come from various databases but first two fields in the 3 files means the same.

I need to verify the entries that are not present in all the 3 files.

Below are the files 1.txt, 2.txt & 3.txt respectively

2677|47876992|1|20:20:51|12/16/2012|1|1|496300|
2677|47877024|2|13:47:37|12/16/2012|1|1|008994|
2677|47877052|3|21:03:56|12/16/2012|1|1|647546|
2677|47877055|4|16:54:07|12/16/2012|1|1|133914|
2677|47877099|5|16:29:06|12/16/2012|1|1|379245|
2677|47877081|6|10:44:13|12/16/2012|1|1|014078|
2677|47877232|7|19:07:06|12/16/2012|1|1|242776|
2677|47877246|8|13:02:32|12/16/2012|1|1|623853|
2677|47877258|9|22:03:05|12/16/2012|1|1|997345|
2677|47877351|10|16:29:27|12/16/2012|1|1|792584|
 
2677|47876992|1|1|40|
2677|47877024|32|1|100|
2677|47877052|2|1|39|
2677|47877055|1|1|75|
2677|47877074|1|1|9|
2677|47877081|2|1|175|
2677|47877232|1|1|10|
2677|47877246|9|1|25|
2677|47877258|25|1|40|
2677|47877350|9|1|50|

2677|47876992|1|7000|603098|40|0|
2677|47877024|1|7000|603086|100|0|
2677|47877052|1|1700|200180|39|0|
2677|47877055|1|7000|603098|75|0|
2677|47877074|1|1700|003400|9|0|
2677|47877081|1|7000|603062|25|0|
2677|47877081|2|7000|603065|50|0|
2677|47877081|3|7000|603074|100|0|
2677|47877232|1|7000|601802|10|0|
2677|47877246|1|7000|252120|25|0|

The output should be the lines that are not in all the three files. that is like below.

2677|47877099
2677|47877258
2677|47877351
2677|47877350

It would be geat if the output has the filenames as well. :slight_smile: :slight_smile: like below

2677|47877099|1.txt|
2677|47877258|1.txt,2.txt|
2677|47877351|1.txt|
2677|47877350|2.txt|

I'm still new to unix, but I'm pretty sure you need to AWK your results. I'll try to find an example, but if your not new to unix. I'm sure you will find it quicker than me. Much of luck, hope I was of some assistance.

$ awk -F "|" 'f==1{A[$1,$2]++;B[$1,$2]=$1 FS $2 FS FILENAME}
f==2{A[$1,$2]++;B[$1,$2]=B[$1,$2]?B[$1,$2]","FILENAME:$1 FS $2 FS FILENAME}
f==3{A[$1,$2]++;B[$1,$2]=B[$1,$2]?B[$1,$2]","FILENAME:$1 FS $2 FS FILENAME}END{
for (i in A){if(A<3){print B"|"}}}
' f=1 file1 f=2 file2 f=3 file3

2677|47877074|file2,file3|
2677|47877258|file1,file2|
2677|47877099|file1|
2677|47877350|file2|
2677|47877351|file1|

EDIT:

Minimized it's length lit bit..:smiley: using function..:slight_smile:

awk -F "|" 'function define_arr() {
A[$1,$2]++;
B[$1,$2]=B[$1,$2]?B[$1,$2]","FILENAME:$1 FS $2 FS FILENAME
}
f==1{define_arr()}
f==2{define_arr()}
f==3{define_arr()}
END{for (i in A){if(A<3){print B"|"}}}' f=1 file1 f=2 file2 f=3 file3
1 Like

pamu's suggestion could be shortened a bit further without a function:

awk -F\| '
{
  i=$1 FS $2
  A++
  B=(B?B",":i FS) FILENAME
}
END{
  for (i in A) if(A<3) print B FS
}' file*

But this would not count the occurrence of entries that occur multiple times in one file but not in other files. To counteract that, one could do something like this:

awk -F\| '
{
  i=$1 FS $2
}
!D[i,FILENAME]++{
  A++
  B=(B?B",":i FS) FILENAME
}
END{
  for (i in A) if(A<3) print B FS
}' file*
(awk '{print $0""FILENAME}' a && awk '{print $0""FILENAME}' b && awk '{print $0""FILENAME}' c) | awk -F"[|]" '{
cnt[$1" "$2]++
file[$1" "$2]=sprintf("%s%s",file[$1" "$2],(file[$1" "$2])?","$NF:$NF)
}
END{
for (i in cnt){
  if(cnt < 3){
    print i" "file	
  }
}
}'