List dublicated files into a file

bulpak · November 19, 2015, 12:00pm

Dear All,

I have many files in a directory similar in below format (in order to understand each group files are separated from others by blank lines ). I want to find duplicate filenames and write them into a new file line by line. I tried several scripts but I couldn't be successful.

Do you have any suggestion?

2222.00.AAA.AHE.DAT
2222.00.AAA.AHN.DAT
2222.00.AAA.AHZ.DAT

2222.01.BBB.AHE.DAT
2222.02.BBB.AHE.DAT
2222.03.BBB.AHE.DAT
2222.01.BBB.AHN.DAT
2222.02.BBB.AHN.DAT
2222.03.BBB.AHN.DAT
2222.04.BBB.AHN.DAT
2222.01.BBB.AHZ.DAT
2222.02.BBB.AHZ.DAT

2222.00.CCC.AHE.DAT
2222.00.CCC.AHN.DAT
2222.00.CCC.AHZ.DAT

2222.01.DDD.AHE.DAT
2222.02.DDD.AHE.DAT
2222.03.DDD.AHE.DAT
2222.04.DDD.AHE.DAT
2222.01.DDD.AHN.DAT
2222.02.DDD.AHN.DAT
2222.01.DDD.AHZ.DAT
2222.02.DDD.AHZ.DAT
2222.03.DDD.AHZ.DAT

It should be below format after scripting.

2222.01.BBB.AHE.DAT
2222.02.BBB.AHE.DAT
2222.03.BBB.AHE.DAT
2222.01.BBB.AHN.DAT
2222.02.BBB.AHN.DAT
2222.03.BBB.AHN.DAT
2222.04.BBB.AHN.DAT
2222.01.BBB.AHZ.DAT
2222.02.BBB.AHZ.DAT
2222.01.DDD.AHE.DAT
2222.02.DDD.AHE.DAT
2222.03.DDD.AHE.DAT
2222.04.DDD.AHE.DAT
2222.01.DDD.AHN.DAT
2222.02.DDD.AHN.DAT
2222.01.DDD.AHZ.DAT
2222.02.DDD.AHZ.DAT
2222.03.DDD.AHZ.DAT

mjf · November 19, 2015, 12:34pm

You can try something like this:

 find . | awk -F"/" '{print $NF}' | sort | uniq -d > dupfiles.txt

You can do a second sort as the output is not in the required order.

RudiC · November 19, 2015, 1:25pm

There's not a single duplicate file name in your sample. The file system wouldn't allow it, btw.

bulpak · November 19, 2015, 5:49pm

Dear RudiC,

You are right. It seems there is no dublicate file name. But actually filenames including BBB and DDD strings are parts of a single file. These are dublicate files for me. These files were created by a conversion program and added some sequence numbers to filenames. Eg.

Below files are parts of 2222.00.BBB.AHE.DAT

2222.01.BBB.AHE.DAT 
2222.02.BBB.AHE.DAT 
2222.03.BBB.AHE.DAT

I want to find these kind of files and write into a file as a list.

Thanks

RudiC · November 19, 2015, 6:04pm

Expressing this a bit differently: the second "field" may not be zero. Would this be of some usefulness?

while IFS="." read  A B C D E REST; do [ 0"$B" -gt 0 ] && printf "%s.%s.%s.%s.%s\n" $A $B $C $D $E; done < file4
2222.01.BBB.AHE.DAT
2222.02.BBB.AHE.DAT
2222.03.BBB.AHE.DAT
2222.01.BBB.AHN.DAT
2222.02.BBB.AHN.DAT
2222.03.BBB.AHN.DAT
2222.04.BBB.AHN.DAT
2222.01.BBB.AHZ.DAT
2222.02.BBB.AHZ.DAT
2222.01.DDD.AHE.DAT
2222.02.DDD.AHE.DAT
2222.03.DDD.AHE.DAT
2222.04.DDD.AHE.DAT
2222.01.DDD.AHN.DAT
2222.02.DDD.AHN.DAT
2222.01.DDD.AHZ.DAT
2222.02.DDD.AHZ.DAT
2222.03.DDD.AHZ.DAT

bulpak · November 19, 2015, 6:46pm

This code works but is there another way only considering BBB.AHE , BBB.AHN, BBB.AHZ strings? Number of BBB.AHE and others show that those are dublicate files.
Maybe in your script [0"$B" -gt 0 ] part can be modifed but how?

Thanks again

RudiC · November 20, 2015, 4:26am

As much as I would like to help, I can't as I don't understand what you want. Show meticulously what input becomes what output and describe the algorithm/logics/reasoning behind it.

bulpak · November 20, 2015, 6:39am

Dear RubiC,

My actual data are as below. I simplified them my previous messages to be understood but a little bit confused. Sorry for that

2015.314.07.57.59.3200.GR.GAZ.BHE.D.DAT
2015.314.07.57.59.3200.GR.GAZ.BHN.D.DAT
2015.314.07.58.00.6600.GR.GAZ.BHZ.D.DAT

2015.314.07.58.00.1000.GR.SVRC.BHE.D.DAT
2015.314.07.58.01.1200.GR.SVRC.BHN.D.DAT
2015.314.07.58.01.7400.GR.SVRC.BHZ.D.DAT

2015.314.08.02.26.0000.GR.MALT.HHE.D.DAT
2015.314.08.02.38.0000.GR.MALT.HHN.D.DAT
2015.314.08.02.59.4000.GR.MALT.HHZ.D.DAT

2015.314.08.05.24.0000.GR.GMLD.HNZ.D.DAT
2015.314.08.05.26.0000.GR.GMLD.HNE.D.DAT
2015.314.08.05.26.0000.GR.GMLD.HNN.D.DAT
2015.314.08.05.29.0000.GR.GMLD.HNZ.D.DAT
2015.314.08.05.31.0000.GR.GMLD.HNN.D.DAT
2015.314.08.05.34.0000.GR.GMLD.HNZ.D.DAT
2015.314.08.05.36.0000.GR.GMLD.HNE.D.DAT
2015.314.08.05.36.0000.GR.GMLD.HNN.D.DAT
2015.314.08.05.39.0000.GR.GMLD.HNZ.D.DAT
2015.314.08.05.41.0000.GR.GMLD.HNN.D.DAT

In here, the numbers represent to date,time. GR is network code. GAZ, SVRC, MALT and GMLD are station names. BH? or HH? are components. The rest is not important.

As you see, in the first three group, each file (BHE, BHE, BHZ or HHE,HHN,HHZ) contains full data. They are ok for me. But last group contains more than one HHE, HHN and HHZ files. Those are parted by conversion program.

My aim is to find more than one XXXX.HHE (or XXXX.BHE), XXXX.HHN (or XXXX.BHN) and XXXX.HHZ (or XXXX.BHZ) files and list them in a file.

mjf · November 20, 2015, 11:27am

If your flavor of Unix supports uniq with -D option, this should meet your requirement of listing all duplicate file names ignoring the first 26 characters.

ls | sort -t'.' -k8,9 | uniq -s 26 -D

bulpak · November 23, 2015, 6:49am

Thanks mjf,
I tried and it works..