Hi all,
I am working with a huge amount of files in a Linux environment and I was trying to filter my data. Here's what my data looks like
Name............................Size
OLUSDN.gf.gif-1.JPEG.......5 kb
LKJFDA01.gf.gif-1.JPEG.....3 kb
LKJFDA01.gf.gif-2.JPEG.....1 kb
LKJFDA01.gif-3.JPEG.........0 kb
JLKJAIN11.gf.gif-1.JPEG.....3 kb
LKJFAD.gf.gif-1.JPEG.........2 kb
LKJFAD.gf.gif-4.JPEG.........5 kb
LKJFAD.gf.gif-5.JPEG.........7 kb
The first part of the filename (anything before the first dot is similar in many of them).
I would like to keep the files with unique names. In case the first part of the name (before the first dot) is similar, look for the largest file size and keep it. My resulting data should look something like this:
Name.............................Size
OLUSDN.gf.gif-1.JPEG.......5 kb
LKJFDA01.gf.gif-1.JPEG.....3 kb
JLKJAIN11.gf.gif-1.JPEG.....3 kb
LKJFAD.gf.gif-5.JPEG....... 7 kb
I think `awk` can do it but I am not sure how to handle duplicates with `awk`
I hope this is not very complicated
Many thanks,
I tried this awk code and it returned expected result.
sort -t"." -k1 yourfile | awk -F"." 'BEGIN{row=$0;T=$1;} {if ($1==T) {FS=" ";if($2>max){max=$2;row=$0;}FS=".";} else {print row;row=$0}; T=$1} END{print row}'
Input File:
OLUSDN.gf.gif-1.JPEG 5 kb
LKJFDA01.gf.gif-1.JPEG 3 kb
LKJFDA01.gf.gif-2.JPEG 1 kb
LKJFDA01.gif-3.JPEG 0 kb
JLKJAIN11.gf.gif-1.JPEG 3 kb
LKJFAD.gf.gif-1. 2 kb
LKJFAD.gf.gif-4. 5 kb
LKJFAD.gf.gif-5. 7 kb
Output obtained:
JLKJAIN11.gf.gif-1.JPEG 3 kb
LKJFAD.gf.gif-5. 7 kb
LKJFDA01.gf.gif-1.JPEG 3 kb
OLUSDN.gf.gif-1.JPEG 5 kb
To exclude header, you may add a condition
if (NR>1)
in the awk code.
try sort..
sort -u -t. -k1,1 filename
Hi Krishmaths,
Thanks for the script, this is essentially what I want to do but I still have a couple of concerns about the command. The file size isn't really in the second column "$2". I was just representing the file size of each file. How do I get the file size involved in the command?.
The second thing is about the sort command. I have multiple files in a Folder, so can I still use a directory instead of "yourfile" in your command?
Many Thanks,
To sort multiple files, you may give the files with a wildcard as argument to sort as below
sort file*
If you do not have a specific pattern to use the wildcard then you may need to find a way to provide all filenames as argument or you may redirect all files into a single file and then sort the single file.
Coming to the problem of file size position, do all the records end with <size> kb?
If yes then we can try to grab the number using a sed command, provided kb is fixed.
So basically I am unable to work with directories in Linux? This is a very long process to put them into a file and find them again, specially with someone who's a newbie in Linux (like me), the error margin is huge!
The file sizes aren't available anywhere, I just know that they actually differ and I wanted to grab the largest in size. So my files after an ls
command would look like this :
OLUSDN.gf.gif-1.JPEG LKJFDA01.gf.gif-1.JPEG LKJFDA01.gf.gif-2.JPEG LKJFDA01.gif-3.JPEG
and so on. I don't really want to see the files sizes if the Linux is able to decide that in its own memory... I'm happy with an output of file names, just like the input... in some other location.
Cheers,
---------- Post updated at 07:09 AM ---------- Previous update was at 06:47 AM ----------
I know that the
ls -al
would give me the 5th column as the file size, but I am now sure how to assign that column to the files to make it part of the name...
Hi,
I think below duplication analyses can help you.
[goksel@gokcell 2july]$ cat file1
Goksel Yangin
Deneme Test
Goksel Yangin
Deneme Test
Ali Veli
Hasan Huseyin
Test 12345
Unix Linux
Linux Unix
Goksel Yangin
[goksel@gokcell 2july]$ cat file1 | sort | uniq -c
1 Ali Veli
2 Deneme Test
3 Goksel Yangin
1 Hasan Huseyin
1 Linux Unix
1 Test 12345
1 Unix Linux
[goksel@gokcell 2july]$ cat file1 | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}'
Veli Ali DUP1
Test Deneme DUP2
Yangin Goksel DUP3
Huseyin Hasan DUP1
Unix Linux DUP1
12345 Test DUP1
Linux Unix DUP1
[goksel@gokcell 2july]$ cat file1 | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP1"
Veli Ali DUP1
Huseyin Hasan DUP1
Unix Linux DUP1
12345 Test DUP1
Linux Unix DUP1
[goksel@gokcell 2july]$ cat file1 | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP2"
Test Deneme DUP2
[goksel@gokcell 2july]$ cat file1 | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP3"
Yangin Goksel DUP3
Regards,
Goksel Yangin
Computer Engineer
@Error404, Please try below solution.
cd to the directory where you have the files and execute below command. You may redirect the output to a temporary file.
ls -l|sort -k9 | awk '{OFS="."}{print $5,$9}' | awk -F"." 'BEGIN{row=$0;T=$2;} {if ($2==T) {if($1>max){max=$1;row=$0;}} else {print row;row=$0;max=0}; T=$2} END{print row}'
The command first lists all the files under the directory and picks the filename ($9) and size ($5). You may adjust this if you are getting the filename and size in different positions.
The fiesize is output as first field and the filename follows. I have used "." as an output delimiter to easily fetch the file with maximum size.
I created below files in a directory called tempdir:
LAJ.g.gif-1.JPEG 4
LAJ.g.gif-2.JPEG 12
LKJFDA01.gf.gif-1.JPEG 0
LKJFDA01.gf.gif-2.JPEG 0
LKJFDA01.gif-3.JPEG 4
OLUSDN.gf.gif-1.JPEG 0
The output was as below.
12.LAJ.g.gif-2.JPEG
4.LKJFDA01.gif-3.JPEG
0.OLUSDN.gf.gif-1.JPEG
The first field in the output is the maximum size of the file starting with 2nd field (i.e., LAJ, etc) in bytes.