Find common lines between multiple files

bibb · January 8, 2013, 12:43pm

Hello everyone

A few years Ago the user radoulov posted a fancy solution for a problem, which was about finding common lines (gene variation names) between multiple samples (files). The code was:

awk 'END {
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1) 
      dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
        sprintf("\t%-20s -->\t%s", rec[R], R)
    }
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }  
  }
{  
  rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
  }' f10.lista f12.lista f13.lista f14.lista fs6.lista

The problem now is that I want to find intersectons of lines between 3, 4 and 5 files, but the program is only showing the results for 3 files.
I'm very newbie at AWK so help me please to modify this code to get my solution.
Thank yo in advance.

DGPickett · January 8, 2013, 1:03pm

Sort each file unique, sort merge not unique all those, and count the duplicates:

sort -m <( sort -u file1 ) <( sort -u file2 ) ... | uniq -c | sort -nr | pg

bibb · January 8, 2013, 1:21pm

Thank you DGPickett for your answer but what I need is to modify the given code to obtain the intersection results for 4 and 5 or more files than just 3.

Actually, I want this kind of result:

records found in 3 files:
.
.
.
.
records found in 4 files:
.
.
.
.
.
records found in 5 files:
.
.
.
records found in 'n' files:

but the program now is only showing this:

records found in 3 files:

I hope this would clarify any doubts

rdrtx1 · January 8, 2013, 1:54pm

try:

awk '
! f[FILENAME]++ {fc++}
! b[$0,FILENAME] {a[$0]++; b[$0,FILENAME]=$0}
END {
for (j=3; j<=fc; j++) {
   print "records found in " j " files:"
   for (i in a) {if (a==j) print i}}
}
' file*

bibb · January 8, 2013, 1:59pm

Thank you so much rdrtx1, It works as I wanted!

DGPickett · January 8, 2013, 2:13pm

If a line is in 5 files, it comes up prefixed with 5. You can add "grep -v '^ 1 ' |" before the final sort to toss those with only 1 file.