Find common lines between all of the files in one folder

Eve · March 3, 2018, 8:21am

Could it be possible to find common lines between all of the files in one folder? Just like comm -12 . So all of the files two at a time. I would like all of the outcomes to be written to a different files, and the file names could be simply numbers - 1 , 2 , 3 etc. All of the file names contain dash signs, usually they have between one and six dash signs, I hope it won't disturb? The folder has a 100 files or more , sometimes I have to work even with 200 files.
Sincerely grateful if anyone can help!

Don_Cragun · March 3, 2018, 10:43am

Of course it is possible... Why don't you start with some of the suggestions in your first thread in this forum Find common lines with one file and with all of the files in another folder and build on top of that to do what you're requesting here.

Please, when starting a thread in this forum, always tell us what operating system and shell you're using so we don't waste our time making suggestions that won't work in your environment. And, tell us whether or not our suggestions work for you (and if not) tell us clearly what worked and what didn't.

If you keep on just making requests and don't do any of the work yourself, it'll get really boring for us to waste time trying to help you and you won't learn anything from our efforts. We want to help you learn how to do things like the on your own; not to act as your unpaid programming staff.

RudiC · March 3, 2018, 10:45am

Pretty sure, yes. More details would help, as always. How would you approach the problem, given the proposals in your other thread , adapted to the new problem?

jim_mcnamara · March 3, 2018, 10:46am

What have you tried so far?

And please give us examples of the filenames.

The way you worded it, we can find only duplicates in two files. Once we move to the next two we would find a possibly distinct set of new duplicate lines.

If you actually want a set of lines that are duplicated anywhere the logic look like this

awk '{arr[0]++ }  END {for(i in arr) { if(arr>1) {print i } } \
           find . -type f -name 'filenames_I_want*'  > my_duplicated_lines

EDIT: oops we all answered at the same time....

We need lots of clarification to help....

Eve · March 3, 2018, 5:46pm

I'm really sorry for the inconvenience caused...
This thread is my last question to you, if I can get it working I won't bother you again.
I'm using Windows 7 and Unix subsystems for Windows. I'm using C Shell.

And the filename examples are:

AC-FOUR-136-ZEL2-ZECO-111
AC-SEVEN-56-ZEL4-ZECO-68
AC-NINE-994-ZEL3-ZECO-811
AC-ONE-4-ZEL1-ZECO-544
AC-NINE-4-53-ZEL3-ZECO-811
AC-ELEVEN-66-788-ZEL4-ZECO-87
AC-TWO-32-7788-ZEL4-ZECO-95
AC-SIX-56-111-ZEL4-ZECO-87
AC-FOURTEEN-59-1561-ZEL2-ZECO-5

In case of this solution I got an error - Unmatched '
I noticed that there is also an unequal number of these signs { } could it be a problem?

awk '{arr[0]++ }  END {for(i in arr) { if(arr>1) {print i } } \
           find . -type f -name 'filenames_I_want*'  > my_duplicated_lines

---------- Post updated at 05:46 PM ---------- Previous update was at 05:38 PM ----------

I forgot to mention that all of the files are sorted

jim_mcnamara · March 5, 2018, 2:22pm

Edited code - please see correction above.

MadeInGermany · March 6, 2018, 3:04am

What do you want to compare?
File names or file contents?

Eve · March 6, 2018, 6:57am

Hi! I only need to compare file contents to find common lines between all of the files in one folder. Could it be done with comm since all of the files are sorted? I just need that every outcome where they had common lines would be in a different file.

---------- Post updated at 06:57 AM ---------- Previous update was at 06:52 AM ----------

A question to jim mcnamara . Where is the edited code with a correction?

Don_Cragun · March 6, 2018, 7:08am

Note that based on her other thread, Eve uses csh on a Windows 7 system and is unwilling to use bash , ksh , or any other POSIX-conforming shell to run any script we might propose to help solve her problems.

MadeInGermany · March 6, 2018, 3:44pm

Maybe something like this?
It simply displays duplicates as soon as they occur ( ==2 ).

awk '++cnt[$0]==2' * > outfile

The * matches all files in the current directory; adapt to your need.
All the files must be strictly unique.

Eve · March 8, 2018, 8:30am

Hi!

Thank you for your help!

All of the files in my folder are unique.

In case of this code -

 awk '++cnt[$0]==2' * > outfile

I made a test with four files and with this code the outfile consists of only three of the possible six outcomes, since 4 files two at a time always gives six outcomes. The three outcomes that it had were all correct, but three outcomes were missing.

MadeInGermany · March 8, 2018, 8:43am

Hmm, I doubt that.
Maybe you have some trailing spaces or even trailing ^M characters (MS-DOS line ends)?
You can strip them off with

awk '{ sub(/[[:space:]]+$/, "") } ++cnt[$0]==2' * > /tmp/outfile

Or I have misunderstood your requirement. Then please post an example, e.g. four 10 lines files, and expected outcome.

Eve · March 9, 2018, 9:25am

Hi!

Here are three examples of the contents of the files:

file1

2   78   99  129  665   765
   3   88   99  543  876   988
   7   45   54   99  120   987
  13   23  167  334 2378  8765
  15   17   18 1125 2356  6765
  54   78   79   90  344  3399
 111  233  788  999 3421  7654
 223  299  388  455  477   566

file2

3   22   78   87  773   876
   4    9   77  890  977  7655
   7    8   23  854 1276  3343
  33  122  665  888  997   999
  54   78   79   90  344  3399
 223  299  388  455  477   566
 228  332  339  453  988  1299

file3

1  112  134  235  734  1123
   5   35   84   98 1889  2300
   7    8   23  854 1276  3343
  15   17   18 1125 2356  6765
  45  443  556  887  889   987
 111  233  788  999 3421  7654

And the desired outcome would be three files with one file containg these two lines

54   78   79   90  344  3399
 223  299  388  455  477   566

and the second file containing these two lines

15   17   18 1125 2356  6765
 111  233  788  999 3421  7654

and the third file containing this lines

7    8   23  854 1276  3343

The file names look like this

AC-FOUR-136-ZEL2-ZECO-111
AC-SEVEN-56-ZEL4-ZECO-68
AC-NINE-994-ZEL3-ZECO-811
AC-ONE-4-ZEL1-ZECO-544
AC-NINE-4-53-ZEL3-ZECO-811
AC-ELEVEN-66-788-ZEL4-ZECO-87
AC-TWO-32-7788-ZEL4-ZECO-95
AC-SIX-56-111-ZEL4-ZECO-87
AC-FOURTEEN-59-1561-ZEL2-ZECO-5

I have to work with 100-200 new files every week to find common lines two at a time. The files contain somewhere between 1000-100000 lines. The examples above are of course a lot shorter.

This code unfortunately didn't help

awk '{ sub(/[[:space:]]+$/, "") } ++cnt[$0]==2' * > /tmp/outfile

I hope this post is helpful!

RudiC · March 9, 2018, 10:19am

It's good to finally have some decent samples at hand for testing. This is a "proof of concept" for your problem and your data given, adapted from the (working!) solution to your previous problem. Not sure if it will work on the larger datasets mentioned.

awk '{CNT[$0]++; FN[$0] = FN[$0] FILENAME "-"} END {for (c in CNT) if (CNT[c]>1) {print c >> FN[c]; close (FN[c])}} ' file[123]
cf file?-*

---------- file1-file2-: ----------

  54   78   79   90  344  3399
 223  299  388  455  477   566

---------- file1-file3-: ----------

  15   17   18 1125 2356  6765
 111  233  788  999 3421  7654

---------- file2-file3-: ----------

   7    8   23  854 1276  3343

vgersh99 · March 9, 2018, 10:42am

I don't quite follow your sample files and the "desired" output.
Say take file1 - the desired result is:

54   78   79   90  344  3399
223  299  388  455  477   566

but 54 78 79 90 344 3399 is common ONLY in 2 files out of sampled 3...
Please clarify or provide a better matching outcome.

Eve · March 10, 2018, 11:28am

Hi! Thank you for your help!
This code

awk '{CNT[$0]++; FN[$0] = FN[$0] FILENAME "-"} END {for (c in CNT) if (CNT[c]>1) {print c >> FN[c]; close (FN[c])}} ' file[123]

works really well if the filenames are file1 file2 file3 etc. But how should I use it if the filenames look like that:

AC-FOUR-136-ZEL2-ZECO-111
AC-SEVEN-56-ZEL4-ZECO-68
AC-NINE-994-ZEL3-ZECO-811
AC-ONE-4-ZEL1-ZECO-544
AC-NINE-4-53-ZEL3-ZECO-811
AC-ELEVEN-66-788-ZEL4-ZECO-87
AC-TWO-32-7788-ZEL4-ZECO-95
AC-SIX-56-111-ZEL4-ZECO-87
AC-FOURTEEN-59-1561-ZEL2-ZECO-5

Is this line a part of the code?

cf file?-*

---------- Post updated at 11:28 AM ---------- Previous update was at 11:18 AM ----------

Text lines in my files don't usually duplicates in more than two of the files, only sometimes a text line is present in three or four files.

RudiC · March 10, 2018, 12:28pm

How about awk ' ... ' AC-* ?

It is my own shell function for cat files (with wildcards)

And what be the result of those?

Eve · March 10, 2018, 6:44pm

Thank you RudiC! Everything works great! Now everything is solved for me!

MadeInGermany · March 13, 2018, 11:15am

The following variant uses less memory,
and it more efficiently writes in the END section (but uses more file handles).
One can adapt the separator character in the BEGIN section

awk 'BEGIN {sep=","} {FN[$0]=(FN[$0]=="" ? FILENAME : (FN[$0] sep FILENAME))} END {for (c in FN) if (index(FN[c],sep)) {print c > FN[c]}} ' AC-*

Eve · March 13, 2018, 8:27pm

Thank you MadeInGermany! Your code works really well too!