Removing files with same text but different file names

Hi All,

I have some 50,000 HTML files in a directory. The problem is; some HTML files are duplicate versions that is wget crawled them two times and gave them file names by appending 1, 2, 3 etc after each crawl. For example, if the file index.html has been crawled several times, it has been named as index.html.1, index.html.2 etc. But all index.html files contain the same text.

I browsed through some posts here and found this:

I then tried the above script by creating 3 similar "test" text files containing one word (this was just to test the code given there). It works for the 3 text files where it gave me the information of two and discarded one. I can then process the output text (duplicate.files) file to get file names and delete the duplicate files or files with same text.

But when I apply the above code on my HTML directory, it does not show any files with same text. But in reality there are duplicate files or files with same text as I have manually checked it.

I am not sure where the problem is? I am using Linux with BASH.

Finddup - Find duplicate files by content, name

./finddup
Displays files of the current directory which are all same by its content
1 Like