Removing duplicates depending on file size

Error404 · July 8, 2013, 5:48am

Hi all,

I am working with a huge amount of files in a Linux environment and I was trying to filter my data. Here's what my data looks like

Name............................Size
OLUSDN.gf.gif-1.JPEG.......5 kb
LKJFDA01.gf.gif-1.JPEG.....3 kb
LKJFDA01.gf.gif-2.JPEG.....1 kb
LKJFDA01.gif-3.JPEG.........0 kb
JLKJAIN11.gf.gif-1.JPEG.....3 kb
LKJFAD.gf.gif-1.JPEG.........2 kb
LKJFAD.gf.gif-4.JPEG.........5 kb
LKJFAD.gf.gif-5.JPEG.........7 kb

The first part of the filename (anything before the first dot is similar in many of them).
I would like to keep the files with unique names. In case the first part of the name (before the first dot) is similar, look for the largest file size and keep it. My resulting data should look something like this:

Name.............................Size
OLUSDN.gf.gif-1.JPEG.......5 kb
LKJFDA01.gf.gif-1.JPEG.....3 kb
JLKJAIN11.gf.gif-1.JPEG.....3 kb
LKJFAD.gf.gif-5.JPEG....... 7 kb

I think `awk` can do it but I am not sure how to handle duplicates with `awk`
I hope this is not very complicated
Many thanks,

krishmaths · July 8, 2013, 6:20am

I tried this awk code and it returned expected result.

sort -t"." -k1 yourfile | awk -F"." 'BEGIN{row=$0;T=$1;} {if ($1==T) {FS=" ";if($2>max){max=$2;row=$0;}FS=".";} else {print row;row=$0}; T=$1} END{print row}'

Input File:

OLUSDN.gf.gif-1.JPEG 5 kb
LKJFDA01.gf.gif-1.JPEG 3 kb
LKJFDA01.gf.gif-2.JPEG 1 kb
LKJFDA01.gif-3.JPEG 0 kb
JLKJAIN11.gf.gif-1.JPEG 3 kb
LKJFAD.gf.gif-1. 2 kb
LKJFAD.gf.gif-4. 5 kb
LKJFAD.gf.gif-5. 7 kb

Output obtained:

JLKJAIN11.gf.gif-1.JPEG 3 kb
LKJFAD.gf.gif-5. 7 kb
LKJFDA01.gf.gif-1.JPEG 3 kb
OLUSDN.gf.gif-1.JPEG 5 kb

To exclude header, you may add a condition

if (NR>1)

in the awk code.

vidyadhar85 · July 8, 2013, 6:24am

try sort..

 
sort -u -t. -k1,1 filename

Error404 · July 8, 2013, 6:36am

Hi Krishmaths,

Thanks for the script, this is essentially what I want to do but I still have a couple of concerns about the command. The file size isn't really in the second column "$2". I was just representing the file size of each file. How do I get the file size involved in the command?.

The second thing is about the sort command. I have multiple files in a Folder, so can I still use a directory instead of "yourfile" in your command?

Many Thanks,

krishmaths · July 8, 2013, 6:52am

To sort multiple files, you may give the files with a wildcard as argument to sort as below

sort file*

If you do not have a specific pattern to use the wildcard then you may need to find a way to provide all filenames as argument or you may redirect all files into a single file and then sort the single file.

Coming to the problem of file size position, do all the records end with <size> kb?

If yes then we can try to grab the number using a sed command, provided kb is fixed.

Error404 · July 8, 2013, 8:09am

So basically I am unable to work with directories in Linux? This is a very long process to put them into a file and find them again, specially with someone who's a newbie in Linux (like me), the error margin is huge!

The file sizes aren't available anywhere, I just know that they actually differ and I wanted to grab the largest in size. So my files after an ls command would look like this :

OLUSDN.gf.gif-1.JPEG    LKJFDA01.gf.gif-1.JPEG    LKJFDA01.gf.gif-2.JPEG    LKJFDA01.gif-3.JPEG

and so on. I don't really want to see the files sizes if the Linux is able to decide that in its own memory... I'm happy with an output of file names, just like the input... in some other location.
Cheers,

---------- Post updated at 07:09 AM ---------- Previous update was at 06:47 AM ----------

I know that the

ls -al

would give me the 5th column as the file size, but I am now sure how to assign that column to the files to make it part of the name...

gokcell · July 8, 2013, 8:23am

Hi,

I think below duplication analyses can help you.

[goksel@gokcell 2july]$ cat file1 
Goksel Yangin
Deneme Test
Goksel Yangin
Deneme Test
Ali Veli
Hasan Huseyin
Test 12345
Unix Linux
Linux Unix
Goksel Yangin
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c
      1 Ali Veli
      2 Deneme Test
      3 Goksel Yangin
      1 Hasan Huseyin
      1 Linux Unix
      1 Test 12345
      1 Unix Linux
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}'
Veli    Ali     DUP1
Test    Deneme  DUP2
Yangin  Goksel  DUP3
Huseyin Hasan   DUP1
Unix    Linux   DUP1
12345   Test    DUP1
Linux   Unix    DUP1
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP1"
Veli    Ali     DUP1
Huseyin Hasan   DUP1
Unix    Linux   DUP1
12345   Test    DUP1
Linux   Unix    DUP1
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP2"
Test    Deneme  DUP2
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP3"
Yangin  Goksel  DUP3

Regards,
Goksel Yangin
Computer Engineer

krishmaths · July 9, 2013, 5:25am

@Error404, Please try below solution.

cd to the directory where you have the files and execute below command. You may redirect the output to a temporary file.

ls -l|sort -k9 | awk '{OFS="."}{print $5,$9}' | awk -F"." 'BEGIN{row=$0;T=$2;} {if ($2==T) {if($1>max){max=$1;row=$0;}} else {print row;row=$0;max=0}; T=$2} END{print row}'

The command first lists all the files under the directory and picks the filename ($9) and size ($5). You may adjust this if you are getting the filename and size in different positions.

The fiesize is output as first field and the filename follows. I have used "." as an output delimiter to easily fetch the file with maximum size.

I created below files in a directory called tempdir:

LAJ.g.gif-1.JPEG                    4
LAJ.g.gif-2.JPEG                   12
LKJFDA01.gf.gif-1.JPEG           0
LKJFDA01.gf.gif-2.JPEG           0
LKJFDA01.gif-3.JPEG               4
OLUSDN.gf.gif-1.JPEG             0

The output was as below.

12.LAJ.g.gif-2.JPEG
4.LKJFDA01.gif-3.JPEG
0.OLUSDN.gf.gif-1.JPEG

The first field in the output is the maximum size of the file starting with 2nd field (i.e., LAJ, etc) in bytes.