How to find all files other than last two dates per month and year?

Makarand_Dodmis · April 28, 2014, 11:59am

Hi All,
lets say there are following files in directory

-rwxr-xr-x   1 user  userg         1596 Mar 19 2012 a.txt
-rwxr-xr-x   1 user  userg         1596 Mar 19 2012 b.txt
-rwxr-xr-x   1 user  userg         1596 Mar 22 2012 c.txt
-rwxr-xr-x   1 user  userg         1596 Mar 24 2012 d.txt
-rwxr-xr-x   1 user  userg         1596 Mar 25 2012 e.txt
-rwxr-xr-x   1 user  userg         1596 Mar 25 2012 eio.txt
-rwxr-xr-x   1 user  userg         1596 Mar 27 2012 ee.txt
-rwxr-xr-x   1 user  userg         1596 Feb 12 2012 f.txt
-rwxr-xr-x   1 user  userg         1596 Feb 12 2012 g.txt
-rwxr-xr-x   1 user  userg         1596 Feb 22 2012 h.txt
-rwxr-xr-x   1 user  userg         1596 Feb 23 2012 i.txt
-rwxr-xr-x   1 user  userg         1596 Feb 28 2012 j.txt
-rwxr-xr-x   1 user  userg         1596 Feb 28 2012 jj.txt
-rwxr-xr-x   1 user  userg         1596 Apr 02 2013 k.txt
-rwxr-xr-x   1 user  userg         1596 Apr 11 2013 l.txt
-rwxr-xr-x   1 user  userg         1596 Apr 11 2013 m.txt
-rwxr-xr-x   1 user  userg         1596 Apr 23 2013 n.txt
-rwxr-xr-x   1 user  userg         1596 Apr 27 2013 o.txt
-rwxr-xr-x   1 user  userg         1596 Apr 29 2013 oo.txt

for Mar 2012
last two dates are Mar25 & Mar27

for Feb 2012
last two dates are Feb28 & Feb23

for Apr 2012
last two dates are Apr27 & Apr29

So i want all files in directory other than above dates which are in red colour

output should be:

-rwxr-xr-x   1 user  userg         1596 Mar 19 2012 a.txt
-rwxr-xr-x   1 user  userg         1596 Mar 19 2012 b.txt
-rwxr-xr-x   1 user  userg         1596 Mar 22 2012 c.txt
-rwxr-xr-x   1 user  userg         1596 Mar 24 2012 d.txt
-rwxr-xr-x   1 user  userg         1596 Feb 12 2012 f.txt
-rwxr-xr-x   1 user  userg         1596 Feb 12 2012 g.txt
-rwxr-xr-x   1 user  userg         1596 Feb 22 2012 h.txt
-rwxr-xr-x   1 user  userg         1596 Apr 02 2013 k.txt
-rwxr-xr-x   1 user  userg         1596 Apr 11 2013 l.txt
-rwxr-xr-x   1 user  userg         1596 Apr 11 2013 m.txt
-rwxr-xr-x   1 user  userg         1596 Apr 23 2013 n.txt

---------- Post updated at 10:59 AM ---------- Previous update was at 09:02 AM ----------

guys Any help??

blackrageous · April 28, 2014, 12:34pm

ls -l | egrep -v "Mar 2[57]|Feb 2[38]|Apr 2[79]"

Did not test this.

Makarand_Dodmis · April 29, 2014, 4:23am

I dont want hardcoded .. there can be any month any year in the directory...

Don_Cragun · April 29, 2014, 4:37am

You have asked several similar questions and been given code that solved those problems. Based on what you learned from those previous examples, what have you tried?

We are happy to help you learn how to use the tools that are available on UNIX and Linux systems, but we are not here to act as your unpaid programming staff.

vbe · April 29, 2014, 11:41am

My 2 cents:
Since you have modified so many times the request, now it is clear to me that the solutions will all be of some complexity, I would 23 years ago solve this writing a cobol program (dont laugh I did a french conjugaison program in cobol for my diploma, the biggest part was algorithmics using only 2 terminaison arrays for all verbs finishing in -er ...).
Oracle taught me to always start with the table is less records..
So
One way of speeding up the whole data processing would be to split the tasks into smaller ones: making lists of files per month by year then processing these lists could be parallelised.
The list treatment:
date sort first and read the last file's date then compare with its previous till date changes saving files names - at date change start again (saving the files names) then stop at next date change
You have now all the files concerned by last 2 dates of month: a grep -v of this new list will give the files you want
You repeat that for all the month lists you have

Makarand_Dodmis · April 29, 2014, 11:42am

till now i only get expected result by below script

nawk '$6!=m{m=$6; c=0} {if($7!=d){if(c++>n)print b p; b=x} else b=b p ORS} {p=$0; d=$7}' n=2

this was created by Scritinizer but frankly speaking i am not able to reuse it for this thread.

I also currently i have solution for this thread but it is taking 40-45 mins and it includes for loops.

So if i get any better solution then it would be good.

vbe · April 29, 2014, 11:57am

I saw your solution, and it takes time because comparing date over all the files...
I dont know what box is running this but using month lists can speed quite alot because less files, and can do more than one month at a time...Because you dont want the last 2 files of a month but the 2 latest dates of files you will have to find the last and compare, but once you got the date, find also the files of that date...
I did write a script for you last before Easter, but went on vacation and someone removed it...

Don_Cragun · April 29, 2014, 12:09pm

makarand dodmis:

till now i only get expected result by below script
nawk '$6!=m{m=$6; c=0} {if($7!=d){if(c++>n)print b p; b=x} else b=b p ORS} {p=$0; d=$7}' n=2
this was created by Scritinizer but frankly speaking i am not able to reuse it for this thread.

I also currently i have solution for this thread but it is taking 40-45 mins and it includes for loops.

So if i get any better solution then it would be good.

So please show us your working solution.

Makarand_Dodmis · April 29, 2014, 12:19pm

function criteria_purge
{
    tradecheck=`pwd`
    echo "Files eligible for purging...."
    $1| while read dm_file
    do
 
  day=`perl -MPOSIX -le 'print strftime "%d", localtime((lstat)[9]) for @ARGV' "$dm_file"`
        year=`perl -MPOSIX -le 'print strftime "%Y ", localtime((lstat)[9]) for @ARGV' "$dm_file"`
  mon=`perl -MPOSIX -le 'print strftime "%b", localtime((lstat)[9]) for @ARGV' "$dm_file"`        
        dday=`expr $day + 0`
        cntr=0
                                cntr2=0
                                flag1=0
                                flag2=0
 
         ls -ltr * |nawk -v mon=$mon -v year=$year '{if ($8 == year && $6 == mon) {print $9}}'| while read inner_file
         do
            imfile=`find $tradecheck -name $inner_file`
            i_day=`perl -MPOSIX -le 'print strftime "%d", localtime((lstat)[9]) for @ARGV' "$imfile"`
            i_dday=`expr $i_day + 0`
 
            purge_last "$dday" "$i_dday"
         done
         if [ $flag1 -eq 0 ] && [ $flag2 -eq 0 ]; then
         echo $dm_file $day $mon $year
         fi
    done
}
function purge_last
{
 if [ $1 -ge $2 ]; then
  flag1=1
 else
  if [ $cntr2 -eq 0 ]; then
   check_day=$2
   cntr2=`expr $cntr2 + 1`
  fi
 
     if [ ! $check_day -eq $2 ]; then
   cntr=`expr $cntr + 1`
  fi
 fi
 if [ $cntr -eq 0 ]; then
  flag2=1
 else
     flag1=0
     flag2=0
 fi
}
# purge the files which are 90 days older
cd /purge_dir
var3="find . ! -name -prune -type f -mtime +90"
criteria_purge "$var3"

Also i am new to unix so need some time and guidance to learn awk & perl

Don_Cragun · April 30, 2014, 2:35am

Your script is slow because it is invoking several utilities ( perl (multiple times), awk , and find ) for each file it is processing.

And although it is invoking perl three times to get the month, day, and year for each file and again for each file that it is being compared to, the awk statement that is looking for a match on the month and year is still using the ls timestamp or year field to compare against the year field for the current file. Therefore, it is not listing all of the files eligible for purging that are in months that contain days that are 90 to 180 days ago. For example, in a directory that contains the files:

-rw-r--r--  2 dwc  staff     0 Oct 31 12:00 z.txt
-rw-r--r--  2 dwc  staff     0 Oct 30 13:00 z10.2.txt
-rw-r--r--  2 dwc  staff     0 Oct 30 12:00 z10.txt
-rw-r--r--  2 dwc  staff     0 Oct  1  2013 b.txt

your script will not list b.txt as a purge candidate.

---------

One of your find statements:

find . ! -name -prune -type f -mtime +90

is weird. Are you really trying to exclude a file named -prune ? Were you, perhaps, trying to exclude files in subdirectories instead? That would be:

find . ! -name . -prune -type f -mtime +90

but it still won't work because you have another find statement nested inside the loop that doesn't ignore subdirectories. So, assuming that /purge_dir doesn't contain any subdirectories, you just need:

find . -type f -mtime +90

(Note that you can process directories with subdirectories as long as there aren't any files in the subdirectories with the same names as files in /purge_dir if you make the change suggested above.)

Assuming that you are using a Solaris system (since you're script contains nawk instead of awk ) and that you're using an old Bourne shell (rather than ksh or bash since you're using the `command ` form of command substitution rather than $(command) ), the following should work for you. In a test on a small directory with one subdirectory containing the files:

ls -lR
total 24
-rwxr-xr-x   1 dwc  staff  1512 Apr 29 16:07 Makarand.sh
-rw-r--r--   2 dwc  staff     0 Feb 21  2012 a.txt
-rw-r--r--   3 dwc  staff     0 Oct  1  2013 b.txt
-rw-r--r--   3 dwc  staff     0 Mar 19  2012 c.txt
-rw-r--r--   3 dwc  staff     0 Mar 21  2012 d.txt
-rw-r--r--   3 dwc  staff     0 Apr 12 01:02 e.txt
-rw-r--r--   3 dwc  staff     0 Mar 22  2012 f.txt
-rw-r--r--   3 dwc  staff     0 Apr 21 03:04 g.txt
-rw-r--r--   3 dwc  staff     0 Mar 24  2012 h.txt
-rw-r--r--   3 dwc  staff     0 Apr 22 05:06 i.txt
-rw-r--r--   2 dwc  staff     0 Feb 27  2012 j.txt
-rw-r--r--   2 dwc  staff     0 Feb 23  2012 k.txt
-rw-r--r--   3 dwc  staff     0 Apr 23 07:08 m.txt
-rw-r--r--   3 dwc  staff     0 Apr 27 09:10 n.txt
-rw-r--r--   1 dwc  staff  2636 Apr 29 10:01 problem
-rw-r--r--   2 dwc  staff     0 Feb 12  2012 q.txt
-rw-r--r--   2 dwc  staff     0 Feb 22  2012 s.txt
drwxr-xr-x  16 dwc  staff   544 Apr 29 13:32 sub
-rwxr-xr-x   1 dwc  staff   832 Apr 29 16:43 tester
-rw-r--r--   3 dwc  staff     0 Mar  1  2013 y.txt
-rw-r--r--   3 dwc  staff     0 Oct 31 12:00 z.txt
-rw-r--r--   3 dwc  staff     0 Oct 30 13:00 z10.2.txt
-rw-r--r--   3 dwc  staff     0 Oct 30 12:00 z10.txt

./sub:
total 0
-rw-r--r--  3 dwc  staff  0 Oct  1  2013 b.txt
-rw-r--r--  3 dwc  staff  0 Mar 19  2012 c.txt
-rw-r--r--  3 dwc  staff  0 Mar 21  2012 d.txt
-rw-r--r--  3 dwc  staff  0 Apr 12 01:02 e.txt
-rw-r--r--  3 dwc  staff  0 Mar 22  2012 f.txt
-rw-r--r--  3 dwc  staff  0 Apr 21 03:04 g.txt
-rw-r--r--  3 dwc  staff  0 Mar 24  2012 h.txt
-rw-r--r--  3 dwc  staff  0 Apr 22 05:06 i.txt
-rw-r--r--  3 dwc  staff  0 Apr 23 07:08 m.txt
-rw-r--r--  3 dwc  staff  0 Apr 27 09:10 n.txt
-rw-r--r--  3 dwc  staff  0 Mar  1  2013 y.txt
-rw-r--r--  3 dwc  staff  0 Oct 31 12:00 z.txt
-rw-r--r--  3 dwc  staff  0 Oct 30 13:00 z10.2.txt
-rw-r--r--  3 dwc  staff  0 Oct 30 12:00 z10.txt

the script:

#!/bin/sh
function criteria_purge {
	print "Files eligible for purging...."
	ls -lt `$1` | /usr/xpg4/bin/awk -v cy=`date +%Y` '
	BEGIN {	y["Jan"] = y["Feb"] = y["Mar"] = cy
		y["Apr"] = y["May"] = y["Jun"] = cy
		y["Jul"] = y["Aug"] = y["Sep"] = cy - 1
		y["Oct"] = y["Nov"] = y["Dec"] = cy - 1
	}
	NF > 8 {if(length($8) == 4)	# Do we have a year or a timestamp?
			yr = $8		#   year
		else	yr = y[$6]	#   timestamp
	}
	lmo != $6 || lyr != yr {
		dim = ld = 0
		lmo = $6
		lyr = yr
	}
	ld != $7 {
		dim++
		ld = $7
	}
	dim > 2 {
		printf("%s %s %s %s\n", $9, $7, $6, yr)
	}'
}

cd /purge_dir
# Uncomment one, and only one, of the follwoing definitions for var3.
# Use following line to process files in current directory and subdirectories.
# var3="find . -type f -mtime +90"
# Use to process files in current directory only.
var3="find . ! -name . -prune -type f -mtime +90"
criteria_purge "$var3"

produces the output:

./b.txt 1 Oct 2013
./d.txt 21 Mar 2012
./c.txt 19 Mar 2012
./s.txt 22 Feb 2012
./a.txt 21 Feb 2012
./q.txt 12 Feb 2012

in about 0.02 seconds on an old MacBook Pro laptop, while your script (modified to use the same setting for var3 produces the output:

./a.txt 21 Feb 2012
./q.txt 12 Feb 2012
./s.txt 22 Feb 2012

in about 3.51 seconds.

If I switch the setting of var3 from:

var3="find . ! -name . -prune -type f -mtime +90"

to:

var3="find . -type f -mtime +90"

in both scripts, your script produces the output:

Files eligible for purging....
./a.txt 21 Feb 2012
./q.txt 12 Feb 2012
./s.txt 22 Feb 2012

in about 5.84 seconds, while the script above produces the output:

./b.txt 1 Oct 2013
./sub/b.txt 1 Oct 2013
./d.txt 21 Mar 2012
./sub/d.txt 21 Mar 2012
./c.txt 19 Mar 2012
./sub/c.txt 19 Mar 2012
./s.txt 22 Feb 2012
./a.txt 21 Feb 2012
./q.txt 12 Feb 2012

still in about 0.02 seconds. I believe the output from the above script is producing the desired output.

However, the order of the output from the above script is sorted in decreasing date order instead of being sorted in increasing alphanumeric filename order. If you want the script above to print the results in alphanumeric order, change the line:

}'

at the end of the awk script to:

	}' | sort

Doing that will add about another 0.01 seconds running time for the sample data shown.

If the argument list given to ls is too long, we can work on an alternative, but it won't be quite as fast.

Makarand_Dodmis · April 30, 2014, 5:26am

Thanks Don for your comments

1) I want to consider subdirectories.
2) i want to delete all files 180 days older hence

b.txt

not in list;its fine.
on find command it was 90 ..testing going on .. forgot to remove.. it should be 180
3) yes subdirectories contain arround 2000 files hence it is taking 40-45 mins
4) i am using solaris + ksh

Don_Cragun · April 30, 2014, 2:07pm

makarand dodmis:

Thanks Don for your comments

1) I want to consider subdirectories.
2) i want to delete all files 180 days older hence
b.txt
not in list;its fine.
on find command it was 90 ..testing going on .. forgot to remove.. it should be 180
3) yes subdirectories contain arround 2000 files hence it is taking 40-45 mins
4) i am using solaris + ksh

So did you try my suggestion with:

var3="find . -type f -mtime +180"

How long did it take? Or, did you hit an arg max limit on the ls -lt ?

What were you trying to do in:

find . ! -name -prune -type f -mtime +90

with the operands shown in red?

In the future when you present problems like this, mention that you're working on a file hierarchy (rather than just files in a single directory). Knowing what we're trying to do makes life easier for all of us and will get you suggestions that apply to your situation MUCH faster.

Makarand_Dodmis · May 1, 2014, 5:39am

i have changed to

find . -type f -mtime +180

but it is not saving much as it is getting called only once in the script.

i am not getting arg max limit on the ls -lt

Don_Cragun · May 1, 2014, 1:07pm

The question was how long does this script take:

#!/bin/ksh
function criteria_purge {
	print "Files eligible for purging...."
	ls -lt $($1) | /usr/xpg4/bin/awk -v cy=`date +%Y` '
	BEGIN {	y["Jan"] = y["Feb"] = y["Mar"] = cy
		y["Apr"] = y["May"] = y["Jun"] = cy
		y["Jul"] = y["Aug"] = y["Sep"] = cy - 1
		y["Oct"] = y["Nov"] = y["Dec"] = cy - 1
	}
	NF > 8 {if(length($8) == 4)	# Do we have a year or a timestamp?
			yr = $8		#   year
		else	yr = y[$6]	#   timestamp
	}
	lmo != $6 || lyr != yr {
		dim = ld = 0
		lmo = $6
		lyr = yr
	}
	ld != $7 {
		dim++
		ld = $7
	}
	dim > 2 {
		printf("%s %s %s %s\n", $9, $7, $6, yr)
	}'
}

cd /purge_dir
criteria_purge "find . -type f -mtime +180"

Makarand_Dodmis · May 2, 2014, 4:25am

Thanks a ton Don!!!
i like your fighting attitude till the problem gets solved.

Previously it was taking 35 mins and now it is taking 0.9 secs
I should say at least 10 Thanks but there is no such button on forum...
Hats Off

alister · May 2, 2014, 4:42am

I agree. Don's patience is extraordinary. I'm tempted to let the enterprise reap what they've sown.

Regards,
Alister

Scrutinizer · May 2, 2014, 4:55am

FWIW, an adaptation to my previous script...

nawk '$6!=m{m=$6; c=0} {if($7!=p) {A[++c]=$0; if(c>2)print A[c-n]} else A[c]=A[c] ORS $0} {p=$7}' n=2

Mind you, parsing ls output like this may not be reliable, since ls may change format if the file is less than 6 months old..
In the POSIX locale, at least $6 and $7 are stable, so it may be best to use:

LANG=C ls -ltr