remember processed files

sidorenko · September 24, 2009, 4:50pm

Hello dear community!

I have the following task to accomplish: there is a directory with approximately 2 thousand files. I have to write a script which would randomly extract 200 files on the first run. On the second run it should extract again 200 files but that files mustn't intersect with those extracted during the first run of the script. So I have to remember the names (or probably inodes) of already extracted files. What do you think is the best way to do that? So far my decision is to create a new file with a list of inodes of already extracted files. On the subsequent runs of my script I'll then check whether the inodes of randomly chosen files are already present in the list. What do you think about this approach? Are there other probably more elegant ways to remember (or to mark) what files have already been extracted?

peterro · September 24, 2009, 5:21pm

Don't have code for you but you could just take a list of all files and randomize them. Then take the first 200, then next 200, etc.

varontron · September 24, 2009, 5:25pm

can you mv the files to a new directory ?

can you cp the files to a new directory and then diff the 'ls -1' on the two dirs?

edidataguy · September 24, 2009, 11:09pm

What you need is not remembering file names.
You need the count of files.
Try this.

 
#-- Move away from org. files 
cd /dum/dumma/here/
 
if [ ! -r counter.txt ] ; then
    echo "1" > counter.txt
fi;

typeset -i from=$(<counter.txt)
typeset -i till=$(expr $from + 199)

#-- If you want, you can merge this line with "| sed"
#-- But this way, you have your own advantages
ls -1 > dummyy.txt

sed -n "$from,${till}p" dummyy.txt | do_some_thing.sh
echo $(expr $till + 1) > counter.txt

dr.house · September 25, 2009, 12:47am

I'd create a "numbered list of files given" (see code below), then use those numbers for random selection and finally remove list entries in accordance to files extracted ...

ls -1 $FOLDER | nl -n nl >> files.list

sidorenko · September 25, 2009, 2:25am

Thank you very much for your ideas. They were very useful to me. Indeed, creating once a randomized list of files and then dealing with it is much more efficient than randomize files every time my script is run. Thanks