Conditional delete

Hi Friends,

I have somefiles like

20180720_1812.tar.gz
20180720_1912.tar.gz 
20180720_2012.tar.gz 
20180720_2112.tar.gz 
20180721_0012.tar.gz 
20180721_0112.tar.gz 
20180721_0212.tar.gz  
20180721_0312.tar.gz 

in a directory and so on..these files gets created every 3 hours where as first part of file name is date and second part is time.

however as its occupying more space on disk, its required to delete old files which we manually doing now...in such a way that only last 30 days backup files should be there, any file before that should be deleted

however we have a challenge even after clearing files before 30 days, still lot of disc filled, now idea is to keep only 1 file for any given day which is being last time stamp on that day and delete rest of files for that day , in such way I can retain one file for any day and free up some space.

for example , for 2018-7 -20, i can retain 20180720_2112.tar.gz -- assuming this the last backup for that day, delete rest 3 files ..in that way, i will have atleast one file of backup for a day and free up some space by deleting rest of the copies for that given day

20180720_1812.tar.gz
20180720_1912.tar.gz 
20180720_2012.tar.gz 
20180720_2112.tar.gz

Any idea how can i do it conditionally. Appreciate any help

How about

ls 2018* | sort -ur -k1,1 -t_ | cut -d_ -f1 | while read TS; do echo rm $(ls -r $TS* | tail -n +2); done
rm 20180721_0212.tar.gz 20180721_0112.tar.gz 20180721_0012.tar.gz
rm 20180720_2012.tar.gz 20180720_1912.tar.gz 20180720_1812.tar.gz

Remove the echo when happy with the proposed result.

1 Like

Thank you Rudic, let me try. my apologies for not including Code tags.

Hi Rudic,

Its working fine. thank you...now i get little bit complicated requirement...

we have a mount point called /backup (300 GB)under which these files will be places every 3 hours...

now if disc space crosses 60 % of /backup and also for example if 90% is filled up in that 300 gb then our logic /program should stop at a point where deletion of files brought down to 60 % (starting from old date )

for example,
if /backup mount reaches 90 % , now it has to brought down to 60 % which is threshold. Now let us assume files are from

20180701_0112.tar.gz...to 20180831_2112.tar.gz.. 

now solution should start from 2018-07-01 and start deleting files for that day except last timestamp file per day (your proposed solution is already doing this) and it should continue only till a point where deleting till for example till 2018-07-03 and if this range is enough to disc usage to 60 % then our logic should exit..some thing like...that, next time when again disc cross beyond 60% then, when we run our command it should start from

2018-07-04 

as

2018-7-01
2018-7-02,
2018-7-03 already has only 1 file per day...

Sorry if my explanation not clear or complicated requirement.. If possible please help . the idea is to not deleting files for all dates, rather limit it till disc usage comes to 60 %

I would be tempted to try a slightly simpler pipeline for what RudiC suggested:

ls -1 2[0-9][0-9][0-9][01][0-9][0-3][0-9]_[0-2][0-9][0-5][0-9].tar.gz |
awk -F_ '
$1 == last {
	print "echo rm " file
	file = $0
	next
}
{	last = $1
	file = $0
}' | sh

If the above prints a list of the rm commands you want to run, remove the echo from the script and run it again.

Note that you should also tell us what operating system and shell you're using when you start a thread in the Shell Programming and Scripting forum so we don't suggest things that can't work in your environment. If you are using a Solaris/SunOS system, change awk in the above script to /usr/xpg4/bin/awk or nawk .

If I create the files you named in post #1 in a directory and run the above script in that same directory, the output produced is:

rm 20180720_1812.tar.gz
rm 20180720_1912.tar.gz
rm 20180720_2012.tar.gz
rm 20180721_0012.tar.gz
rm 20180721_0112.tar.gz
rm 20180721_0212.tar.gz

On most systems you can omit the -1 option (that is the digit one; not the letter ell), but on some old systems the ls utility doesn't produce one name per line of output when output is directed to a pipe (as required by the standards).

Obviously, you can add a df on your source filesystem and check for the desired level of free space before or after each file is deleted or each time the date changes. Since your requirements aren't clear as to when this testing should be performed, I'll leave that as an exercise for the reader. (The output format produced by df also varies somewhat depending on what options you use and what operating system you're using. And, I'm not going to try to guess what OS you're using.)

1 Like

Hi Don Cragun,

Thanks for your reply. Apologies , i shall try describe requirement better. We are using RHEL 7.4 as OS.

as for "Since your requirements aren't clear as to when this testing should be performed, I'll leave that as an exercise for the reader"

We used to get alert from network team that particular node is having high disc utilisation, at the point we manually login into that box perform this housekeeping activity (by deleting files and freeup space ) to bring down to 60 %. There is no need of program to automatically run when disc usage is high.. only thing is when we get alert that disc space is high, then when we execute solution it should delete files from start date based on file name in ascending order (for example like i mention if we have files from 20180701 to 20180831 then it has to start deleting files from date 20180701(keep only last copy for that day ,delete rest) ---> then check if discspace came down to 60 % if not --> continue with next date i.e. 20180702 (keep only last copy for that day ,delete rest)--df check if space is below 60 % --if not take next date i.e 20170703 (keep only last copy for that day ,delete rest) --> df check if its 60 % then exit the program...--> now after some days again if we get notification disc space ---> when we run program --> it should start from 20170704 start doing deletes till such date space equals 60%

apologies if its not still clear...

Talking about simplifying, why not

ls -r1 2[0-9][0-9][0-9][01][0-9][0-3][0-9]_[0-2][0-9][0-5][0-9].tar.gz | awk -F_ 'T[$1]++ {print "echo rm " $0}'  | sh 
1 Like

Hi onenessboy,
You can start by showing us the complete, exact output produced by the command:

df -P /backup

If the output from the above command doesn't complain about an unknown -P option, the percentage of space used on the filesystem containing /backup should be in field #5 on line #2 of the output from the above command.

If it does complain about an unknown -P option, show us the complete, exact output from the command:

df /backup

so we can figure out which field and line we need to examine to determine if you have reached your goal.

Hi RudiC,
Why not:

ls -r1 2[0-9][0-9][0-9][01][0-9][0-3][0-9]_[0-2][0-9][0-5][0-9].tar.gz | awk -F_ 'T[$1]++ {print "echo rm " $0}'  | sh

Because that will remove old files from the most recent date first. And onenessboy wants to remove old backup files from the oldest date first. And he wants to add code to exit the script when the capacity on that filesystem drops below 60% after each date change when one or more backup files were removed for any given date.

2 Likes

The

2[0-9][0-9][0-9][01][0-9][0-3][0-9]_[0-2][0-9][0-5][0-9].tar.gz

has a Y3K problem and is less readable than

????????_????.tar.gz

The latter is certainly precise enough.

Hi Don Cragun,

I tried your command ..it does not complain about -P option...the output is below

[root@ip-11-66-77-99 ~]$ df -P /backup
Filesystem     1024-blocks      Used       Available     Capacity   Mounted on
/dev/xvdf        515928320  299040832    192733092         61%     /backup

Didn't you say the disk was 300GB? Looks more like 500GB...?

EDIT: Does your OS provide the stat command?

Hi RudiC,

apologies, actually since we had issues with disc space team added more space to it recently) I did not notice, the original design was to have 300 GB. Ideally each tar file size around 2.6 GB approx..so all the huddle here

However objective of solution is same...to remove file as i explained in previous posts.

Yes , stat command working.. I said man stat it showed help

Assuming the stat command on your system allows for the -f ( --file-system ) option, and shamelessly stealing from Don Cragun's earlier post, and NOT being able to ultimately test this on my system, I'd propose this (printing some meaningful numbers for debug purposes) and ask you to comment back:

{ cd /backup; stat -fc"%b %a %S" .; stat -c"%n %b %B" 2018*.tar.gz | sort; } | awk '
NR == 1         {print Needed = ($1 * PCT - $2) * $3 
                 next
                }

                {split ($1, T, "_")
                 if (T[1] == KnownTime) print "echo rm " LastFile
                 KnownTime = T[1]
                 LastFile = $1
                 print SUM += $2 * $3
                 if (SUM >= Needed) exit
                }
 ' PCT=0.4

Once happy with what is delivered, we can pipe this into an sh command to be executed.

1 Like

Hi RudiC,

Thankyou very much for snippet, shall try it on different test machine, sorry for dumb question, which line one specifies 60 % as thread hold...I wanted to test on a sample % as threshold on test machine. because test machine i have 10 gb of these tar files

PCT=0.4 defines the necessary free space... this minus the actually available space yields the files' sizes to be deleted.

1 Like

Thank you

Shall test it

Is your data perhaps more compressible ?
Then perhaps a simple switch in your source zip command can save a lot more.

This is just a wild guess, perhaps to avoid more coding.

Regards
Peasant.

Hi RudiC,

I have tested your code . do you see if your wonderful solution to on right track to reach my goal). Its lot of pain to duplicate files and renamed it correctly to make sure it crosses 70% of discspace (as this is test vm):rolleyes: By the way, if every thing looks good, where I need to place actual rm command :slight_smile:

[asp@abcsd34 ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        40G   30G   11G  74% /
devtmpfs        1.9G     0  1.9G   0% /dev
tmpfs           1.9G     0  1.9G   0% /dev/shm
tmpfs           1.9G   18M  1.9G   1% /run
tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup
tmpfs           377M     0  377M   0% /run/user/1000
[asp@abcsd34 ~]$ ls
20180701_0012.tar.gz  20180705_1412.tar.gz  20180710_2112.tar.gz  20180716_0012.tar.gz  20180721_1412.tar.gz  20180726_2112.tar.gz  20180801_0012.tar.gz  20180805_0712.tar.gz
20180701_0712.tar.gz  20180705_2112.tar.gz  20180711_0012.tar.gz  20180716_1412.tar.gz  20180721_2112.tar.gz  20180727_0012.tar.gz  20180801_0712.tar.gz  20180805_1412.tar.gz
20180701_1412.tar.gz  20180706_0012.tar.gz  20180711_1412.tar.gz  20180716_2112.tar.gz  20180722_0012.tar.gz  20180727_1412.tar.gz  20180801_1412.tar.gz  20180805_2112.tar.gz
20180701_2112.tar.gz  20180706_1412.tar.gz  20180711_2112.tar.gz  20180717_0012.tar.gz  20180722_1412.tar.gz  20180727_2112.tar.gz  20180801_2112.tar.gz  20180806_0012.tar.gz
20180702_0012.tar.gz  20180706_2112.tar.gz  20180712_0012.tar.gz  20180717_1412.tar.gz  20180722_2112.tar.gz  20180728_0012.tar.gz  20180802_0012.tar.gz  20180806_0712.tar.gz
20180702_0712.tar.gz  20180707_0012.tar.gz  20180712_1412.tar.gz  20180717_2112.tar.gz  20180723_0012.tar.gz  20180728_2112.tar.gz  20180802_0712.tar.gz  20180806_1412.tar.gz
20180702_1412.tar.gz  20180707_1412.tar.gz  20180712_2112.tar.gz  20180718_0012.tar.gz  20180723_1412.tar.gz  20180729_0012.tar.gz  20180802_1412.tar.gz  20180806_2112.tar.gz
20180702_2112.tar.gz  20180707_2112.tar.gz  20180713_0012.tar.gz  20180718_1412.tar.gz  20180723_2112.tar.gz  20180729_1412.tar.gz  20180802_2112.tar.gz  20180807_0012.tar.gz
20180703_0012.tar.gz  20180708_0012.tar.gz  20180713_1412.tar.gz  20180718_2112.tar.gz  20180724_0012.tar.gz  20180729_2112.tar.gz  20180803_0012.tar.gz  abspacetest.sh
20180703_0712.tar.gz  20180708_1412.tar.gz  20180713_2112.tar.gz  20180719_0012.tar.gz  20180724_1412.tar.gz  20180730_0012.tar.gz  20180803_0712.tar.gz  
20180703_1412.tar.gz  20180708_2112.tar.gz  20180714_0012.tar.gz  20180719_1412.tar.gz  20180724_2112.tar.gz  20180730_1412.tar.gz  20180803_1412.tar.gz
20180703_2112.tar.gz  20180709_0012.tar.gz  20180714_1412.tar.gz  20180719_2112.tar.gz  20180725_0012.tar.gz  20180730_2112.tar.gz  20180803_2112.tar.gz
20180704_0012.tar.gz  20180709_1412.tar.gz  20180714_2112.tar.gz  20180720_0012.tar.gz  20180725_1412.tar.gz  20180731_0012.tar.gz  20180804_0012.tar.gz
20180704_1412.tar.gz  20180709_2112.tar.gz  20180715_0012.tar.gz  20180720_1412.tar.gz  20180725_2112.tar.gz  20180731_0712.tar.gz  20180804_0712.tar.gz
20180704_2112.tar.gz  20180710_0012.tar.gz  20180715_1412.tar.gz  20180720_2112.tar.gz  20180726_0012.tar.gz  20180731_1412.tar.gz  20180804_1412.tar.gz
20180705_0012.tar.gz  20180710_1412.tar.gz  20180715_2112.tar.gz  20180721_0012.tar.gz  20180726_1412.tar.gz  20180731_2112.tar.gz  20180805_0012.tar.gz
[asp@abcsd34 ~]$ cat abspacetest.sh
{ cd /home/asp; stat -fc"%b %a %S" .; stat -c"%n %b %B" 2018*.tar.gz | sort; } | awk '
NR == 1         {print Needed = ($1 * PCT - $2) * $3
                 next
                }

                {split ($1, T, "_")
                 if (T[1] == KnownTime) print "echo rm " LastFile
                 KnownTime = T[1]
                 LastFile = $1
                 print SUM += $2 * $3
                 if (SUM >= Needed) exit
                }
 ' PCT=0.4
[asp@abcsd34 ~]$ sh abspacetest.sh
5.63358e+09
251428864
echo rm 20180701_0012.tar.gz
502857728
echo rm 20180701_0712.tar.gz
754286592
echo rm 20180701_1412.tar.gz
1005715456
1257144320
echo rm 20180702_0012.tar.gz
1508573184
echo rm 20180702_0712.tar.gz
1760002048
echo rm 20180702_1412.tar.gz
2011430912
2262859776
echo rm 20180703_0012.tar.gz
2514288640
echo rm 20180703_0712.tar.gz
2765717504
echo rm 20180703_1412.tar.gz
3017146368
3268575232
echo rm 20180704_0012.tar.gz
3520004096
echo rm 20180704_1412.tar.gz
3771432960
4022861824
echo rm 20180705_0012.tar.gz
4274290688
echo rm 20180705_1412.tar.gz
4525719552
4777148416
echo rm 20180706_0012.tar.gz
5028577280
echo rm 20180706_1412.tar.gz
5280006144
5531435008
echo rm 20180707_0012.tar.gz
5782863872
[asp@abcsd34 ~]$

So after every echo, the printed line is number of blocks to be removed ?.. I think file wise it showing first 3 files based on names marked for deletion..thats good.. I think i am almost there what I need.. Thank you very much.. Awaiting your comments.

Unfortunately, neither the file system you want to "clean" is shown in your upfront df -h nor the files' sizes. But we see that (what I think are) the sizes are summed up, starting with the oldest, and leaving out the youngest per day, exiting when the target is reached or topped. What concerns me is that, obviously, the size of the files left out is summed up nevertheless (this is what I could not test over here). Please try this modification and report back:

                if (T[1] == KnownTime)  {print "echo rm " LastFile

                                         print SUM += $2 * $3
                                         if (SUM >= Needed) exit
                                        }

Hi RudiC,

Here the changed script , i ran the output is below

[asp@abcsd34 ~]$ sh abc.sh
5.6336e+09
[asp@abcsd34 ~]$ cat abc.sh
{ cd /home/asp; stat -fc"%b %a %S" .; stat -c"%n %b %B" 2018*.tar.gz | sort; } | awk '
NR == 1         {print Needed = ($1 * PCT - $2) * $3
                 next
                }

                {split ($1, T, "_")
                 if (T[1] == KnownTime)
                  {print "echo rm " LastFile
                   print SUM += $2 * $3
                   if (SUM >= Needed) exit
                  }
                 }
' PCT=0.4
[asp@abcsd34 ~]$ sh abc.sh
5.6336e+09
[asp@abcsd34 ~]$

Sorry to have been imprecise - you should have left

 KnownTime = T[1]
 LastFile = $1

where it was, just move the two lines (SUM & exit) into the if branch...

1 Like