Could it be possible to find common lines between all of the files in one folder? Just like comm -12 . So all of the files two at a time. I would like all of the outcomes to be written to a different files, and the file names could be simply numbers - 1 , 2 , 3 etc. All of the file names contain dash signs, usually they have between one and six dash signs, I hope it won't disturb? The folder has a 100 files or more , sometimes I have to work even with 200 files.
Sincerely grateful if anyone can help!
Of course it is possible... Why don't you start with some of the suggestions in your first thread in this forum Find common lines with one file and with all of the files in another folder and build on top of that to do what you're requesting here.
Please, when starting a thread in this forum, always tell us what operating system and shell you're using so we don't waste our time making suggestions that won't work in your environment. And, tell us whether or not our suggestions work for you (and if not) tell us clearly what worked and what didn't.
If you keep on just making requests and don't do any of the work yourself, it'll get really boring for us to waste time trying to help you and you won't learn anything from our efforts. We want to help you learn how to do things like the on your own; not to act as your unpaid programming staff.
Pretty sure, yes. More details would help, as always. How would you approach the problem, given the proposals in your other thread , adapted to the new problem?
What have you tried so far?
And please give us examples of the filenames.
The way you worded it, we can find only duplicates in two files. Once we move to the next two we would find a possibly distinct set of new duplicate lines.
If you actually want a set of lines that are duplicated anywhere the logic look like this
awk '{arr[0]++ } END {for(i in arr) { if(arr>1) {print i } } \
find . -type f -name 'filenames_I_want*' > my_duplicated_lines
EDIT: oops we all answered at the same time....
We need lots of clarification to help....
I'm really sorry for the inconvenience caused...
This thread is my last question to you, if I can get it working I won't bother you again.
I'm using Windows 7 and Unix subsystems for Windows. I'm using C Shell.
And the filename examples are:
AC-FOUR-136-ZEL2-ZECO-111
AC-SEVEN-56-ZEL4-ZECO-68
AC-NINE-994-ZEL3-ZECO-811
AC-ONE-4-ZEL1-ZECO-544
AC-NINE-4-53-ZEL3-ZECO-811
AC-ELEVEN-66-788-ZEL4-ZECO-87
AC-TWO-32-7788-ZEL4-ZECO-95
AC-SIX-56-111-ZEL4-ZECO-87
AC-FOURTEEN-59-1561-ZEL2-ZECO-5
In case of this solution I got an error - Unmatched '
I noticed that there is also an unequal number of these signs { } could it be a problem?
awk '{arr[0]++ } END {for(i in arr) { if(arr>1) {print i } } \
find . -type f -name 'filenames_I_want*' > my_duplicated_lines
---------- Post updated at 05:46 PM ---------- Previous update was at 05:38 PM ----------
I forgot to mention that all of the files are sorted
Edited code - please see correction above.
What do you want to compare?
File names or file contents?
Hi! I only need to compare file contents to find common lines between all of the files in one folder. Could it be done with comm since all of the files are sorted? I just need that every outcome where they had common lines would be in a different file.
---------- Post updated at 06:57 AM ---------- Previous update was at 06:52 AM ----------
A question to jim mcnamara . Where is the edited code with a correction?
Note that based on her other thread, Eve uses csh
on a Windows 7 system and is unwilling to use bash
, ksh
, or any other POSIX-conforming shell to run any script we might propose to help solve her problems.
Maybe something like this?
It simply displays duplicates as soon as they occur ( ==2
).
awk '++cnt[$0]==2' * > outfile
The *
matches all files in the current directory; adapt to your need.
All the files must be strictly unique.
Hi!
Thank you for your help!
All of the files in my folder are unique.
In case of this code -
awk '++cnt[$0]==2' * > outfile
I made a test with four files and with this code the outfile consists of only three of the possible six outcomes, since 4 files two at a time always gives six outcomes. The three outcomes that it had were all correct, but three outcomes were missing.
Hmm, I doubt that.
Maybe you have some trailing spaces or even trailing ^M characters (MS-DOS line ends)?
You can strip them off with
awk '{ sub(/[[:space:]]+$/, "") } ++cnt[$0]==2' * > /tmp/outfile
Or I have misunderstood your requirement. Then please post an example, e.g. four 10 lines files, and expected outcome.
Hi!
Here are three examples of the contents of the files:
file1
2 78 99 129 665 765
3 88 99 543 876 988
7 45 54 99 120 987
13 23 167 334 2378 8765
15 17 18 1125 2356 6765
54 78 79 90 344 3399
111 233 788 999 3421 7654
223 299 388 455 477 566
file2
3 22 78 87 773 876
4 9 77 890 977 7655
7 8 23 854 1276 3343
33 122 665 888 997 999
54 78 79 90 344 3399
223 299 388 455 477 566
228 332 339 453 988 1299
file3
1 112 134 235 734 1123
5 35 84 98 1889 2300
7 8 23 854 1276 3343
15 17 18 1125 2356 6765
45 443 556 887 889 987
111 233 788 999 3421 7654
And the desired outcome would be three files with one file containg these two lines
54 78 79 90 344 3399
223 299 388 455 477 566
and the second file containing these two lines
15 17 18 1125 2356 6765
111 233 788 999 3421 7654
and the third file containing this lines
7 8 23 854 1276 3343
The file names look like this
AC-FOUR-136-ZEL2-ZECO-111
AC-SEVEN-56-ZEL4-ZECO-68
AC-NINE-994-ZEL3-ZECO-811
AC-ONE-4-ZEL1-ZECO-544
AC-NINE-4-53-ZEL3-ZECO-811
AC-ELEVEN-66-788-ZEL4-ZECO-87
AC-TWO-32-7788-ZEL4-ZECO-95
AC-SIX-56-111-ZEL4-ZECO-87
AC-FOURTEEN-59-1561-ZEL2-ZECO-5
I have to work with 100-200 new files every week to find common lines two at a time. The files contain somewhere between 1000-100000 lines. The examples above are of course a lot shorter.
This code unfortunately didn't help
awk '{ sub(/[[:space:]]+$/, "") } ++cnt[$0]==2' * > /tmp/outfile
I hope this post is helpful!
It's good to finally have some decent samples at hand for testing. This is a "proof of concept" for your problem and your data given, adapted from the (working!) solution to your previous problem. Not sure if it will work on the larger datasets mentioned.
awk '{CNT[$0]++; FN[$0] = FN[$0] FILENAME "-"} END {for (c in CNT) if (CNT[c]>1) {print c >> FN[c]; close (FN[c])}} ' file[123]
cf file?-*
---------- file1-file2-: ----------
54 78 79 90 344 3399
223 299 388 455 477 566
---------- file1-file3-: ----------
15 17 18 1125 2356 6765
111 233 788 999 3421 7654
---------- file2-file3-: ----------
7 8 23 854 1276 3343
I don't quite follow your sample files and the "desired" output.
Say take file1 - the desired result is:
54 78 79 90 344 3399
223 299 388 455 477 566
but 54 78 79 90 344 3399
is common ONLY in 2 files out of sampled 3...
Please clarify or provide a better matching outcome.
Hi! Thank you for your help!
This code
awk '{CNT[$0]++; FN[$0] = FN[$0] FILENAME "-"} END {for (c in CNT) if (CNT[c]>1) {print c >> FN[c]; close (FN[c])}} ' file[123]
works really well if the filenames are file1 file2 file3 etc. But how should I use it if the filenames look like that:
AC-FOUR-136-ZEL2-ZECO-111
AC-SEVEN-56-ZEL4-ZECO-68
AC-NINE-994-ZEL3-ZECO-811
AC-ONE-4-ZEL1-ZECO-544
AC-NINE-4-53-ZEL3-ZECO-811
AC-ELEVEN-66-788-ZEL4-ZECO-87
AC-TWO-32-7788-ZEL4-ZECO-95
AC-SIX-56-111-ZEL4-ZECO-87
AC-FOURTEEN-59-1561-ZEL2-ZECO-5
Is this line a part of the code?
cf file?-*
---------- Post updated at 11:28 AM ---------- Previous update was at 11:18 AM ----------
Text lines in my files don't usually duplicates in more than two of the files, only sometimes a text line is present in three or four files.
How about awk ' ... ' AC-*
?
It is my own shell function for cat files
(with wildcards)
And what be the result of those?
Thank you RudiC! Everything works great! Now everything is solved for me!
The following variant uses less memory,
and it more efficiently writes in the END section (but uses more file handles).
One can adapt the separator character in the BEGIN section
awk 'BEGIN {sep=","} {FN[$0]=(FN[$0]=="" ? FILENAME : (FN[$0] sep FILENAME))} END {for (c in FN) if (index(FN[c],sep)) {print c > FN[c]}} ' AC-*
Thank you MadeInGermany! Your code works really well too!