print out missing files in a sequence

avatar_007 · March 30, 2012, 7:44pm

Hello all,

I have several directories with a sequence of files like this

IM-0001-0001.dcm
IM-0001-0002.dcm
IM-0001-0003.dcm
IM-0001-0004.dcm
IM-0001-0005.dcm

I would like to print out the name of the file that is missing.

I currently have the following ineffecient way to do this and wondering if you would suggest me a better way to do this in multiple directories.

ls -1 *.dcm | awk -F"-" '{print $3}' > ori.txt

[]$ cat ori.txt 
0001.dcm
0002.dcm
0004.dcm
0005.dcm

Create another list with all files that are supposed to be there

[]$ cat main.txt 
0001.dcm
0002.dcm
0003.dcm
0004.dcm
0005.dcm

[]$ diff ori.txt main.txt 
2a3
> 0003.dcm

It would be good if I could display the full name of the missing file.

Thanks,

Corona688 · March 30, 2012, 8:32pm

The trouble with detecting holes in sequences is, how do you detect a hole at the beginning, or the end? Unless you really do know what files are supposed to be there, you're going to be reduced to guessing in some situations no matter what.

Will there ever be more than one sequence in this folder, or just the one?

Corona688 · March 30, 2012, 9:19pm

This can detect some kinds of sequences. It assumes anything with digits and an extension is part of a sequence, and tells different sequences apart from the string before the last set of digits and the extension. It doesn't need the files in sorted order.

$ cat missing.awk

X=match($0, /[0-9]+\.[^.]*$/) {
        Y=match($0, /\.[^.]*$/);
        PFIX=substr($0, 0, X-1); # IM-0001-
        EXT=substr($0, Y);        # .dcm
        VAL=substr($0, X, Y-X); # 0003

        # To check if the number of digits is changing.
        DIGITS[PFIX,EXT,length(VAL)]++;

        # The +0 is to guarantee a numeric sort, not alphabetic, so "01" < "2".
        if((!SMIN[PFIX,EXT]) || (SMIN[PFIX,EXT]>(VAL+0))) SMIN[PFIX,EXT]=VAL+0;
        if((!SMAX[PFIX,EXT]) || (SMAX[PFIX,EXT]<(VAL+0))) SMAX[PFIX,EXT]=VAL+0;
        F[PFIX,EXT,VAL]=1;
}

END {
        for(X in SMAX)
        {

                split(X, A, SUBSEP);
                PFIX=A[1];      EXT=A[2];

                DC=0;
                DMAX=0;
                for(Z in DIGITS)
                {
                        split(Z, A, SUBSEP);
                        if((A[1] != PFIX) || (A[2] != EXT)) continue;
                        if(A[3] > DMAX) DMAX=A[3];
                        DC++;
                }

                if(DC == 1)     CMDSTR="%0" DMAX "d"
                else            CMDSTR="%d"

                for(N=SMIN[X]+0; N<=(SMAX[X]+0); N++)
                {
                        VAL=sprintf(CMDSTR, N);
                        if(!F[PFIX,EXT,VAL])
                                print "Missing", PFIX VAL EXT;
                }
        }
}

$ touch IM-0001-{0001..0005}.dcm file-{8..15}.dat
$ rm IM-0001-0003.dcm file-9.dat file-11.dat
$ ls | awk -f missing.awk
Missing file-9.dat
Missing file-11.dat
Missing IM-0001-0003.dcm

$

Scrutinizer · March 31, 2012, 4:10am

Alternatively try this less general approach:

printf "%s\n" *.dcm | awk -F'[-.]' '$3>p+1{for(i=p+1;i<$3;i++){s=$0; sub($3"."$4,sprintf("%04d",i)"."$4,s); print s}}{p=$3}'

This assumes that all files have a fixed length, zero-padded counter in the third field, that they have an extension in the fourth field and that all fields (and field separators) other than the third field are identical. This also ensures wildcard expansion is in the right order..

avatar_007 · April 2, 2012, 2:26pm

Thanks a lot for your help guys.

Scrutinizer: It works great. I will used the code tages from next time on..