If string matches within 2 files, delete one file.

sitney · August 31, 2009, 5:40am

I have a directory with a large # of files and in each file I am looking to match a string in one file with a string in the subsequent n file(s). If there is a match between a string in one file and a string in the next n file(s) then delete the subsequent duplicate file(s). Here is sample input:

blake [ ~/scratch ]$ ls -l ???.txt
-rw-r--r--  1 blake  staff  7 Aug 31 16:23 aaa.txt
-rw-r--r--  1 blake  staff  7 Aug 31 16:23 bbb.txt
-rw-r--r--  1 blake  staff  7 Aug 31 16:23 ccc.txt
-rw-r--r--  1 blake  staff  7 Aug 31 16:23 ddd.txt
-rw-r--r--  1 blake  staff  7 Aug 31 16:24 eee.txt
-rw-r--r--  1 blake  staff  7 Aug 31 16:24 fff.txt
-rw-r--r--  1 blake  staff  7 Aug 31 16:24 ggg.txt

blake [ ~/scratch ]$ cat ???.txt
aabbcc
aabbcc
aabbcc
abcabc
abcabc
abc123
aabbcc

And the desired output is as follows (assuming that I set n to look at 7 files or more)

blake [ ~/scratch ]$ ls -l ???.txt
-rw-r--r--  1 blake  staff  7 Aug 31 16:23 aaa.txt
-rw-r--r--  1 blake  staff  7 Aug 31 16:23 ddd.txt
-rw-r--r--  1 blake  staff  7 Aug 31 16:24 fff.txt

Many thanks.

methyl · August 31, 2009, 6:12pm

What constitutes "subsequent"? Is it the next file in the order "ls -l", and is a "subsequent" list terminated by a file not containing the match characters, or can the "subsequent" list extent to any file containing the match characters?

Usage of the word subsequent imply a sequence.

FlyingSquirrel · September 1, 2009, 12:53am

Sitney,

Here is something to try...
Assumptions are:

The first line in each file contains the comparison string
only one instance of a specific string is allowed, regardless of the number of files.

Test files are:

# ls -l ???.txt
-rw-r--r-- 1 root root 7 2009-08-31 22:54 aaa.txt
-rw-r--r-- 1 root root 7 2009-08-31 22:54 bbb.txt
-rw-r--r-- 1 root root 7 2009-08-31 22:54 ccc.txt
-rw-r--r-- 1 root root 7 2009-08-31 22:55 ddd.txt
-rw-r--r-- 1 root root 7 2009-08-31 22:55 eee.txt
-rw-r--r-- 1 root root 7 2009-08-31 22:55 fff.txt
-rw-r--r-- 1 root root 7 2009-08-31 22:56 ggg.txt

Contents of files:

# cat ???.txt
aabbcc
aabbcc
aabbcc
abcabc
abcabc
abc123
aabbcc

Script to run:

for i in ???.txt
do 
   c=$(head -1 $i)
   echo "$c|$i"
done | perl -e '{my %s; while(<>){chomp;($st,$fn) = split(/\|/);if (! defined($s{$st})) {$s{$st} = $fn; print "$s{$st}\n";}}}' | xargs ls -l

Description:
For each file,
echo the string, followed by pipe symbol, followed by the filename
end of for loop, pass this into perl script via standard in
the perl script splits output on the pipe symbol,
checks if the string name is defined in the hash, if not, store the filename value, with the string as the key to the hash, then print the filename
Send this output as standard input to the xargs which passes each filename to the "ls -l" command.

Output is:

-rw-r--r-- 1 root root 7 2009-08-31 22:54 aaa.txt
-rw-r--r-- 1 root root 7 2009-08-31 22:55 ddd.txt
-rw-r--r-- 1 root root 7 2009-08-31 22:55 fff.txt