Finding consecutive numbers in version names on a txt file

fox1212 · March 11, 2011, 6:06am

Hi all.

I have a directory which contains files that can be versioned. All the files are named according to a pattern like this:

TEXTSTRING1-001.EXTENSION
TEXTSTRING2-001.EXTENSION
TEXTSTRING3-001.EXTENSION
...
TEXTSTRINGn-001.EXTENSION

If a file is versioned, a file called

TEXTSTRINGn-002.EXTENSION

would be created, then a 003, etc...

What I want is, working with an ordered list of the files on a txt file, to extract the names of the files who have been versioned so I can move the older ones to another dir. That is, for a given line:

TEXTSTRINGn-j.EXTENSION

if the next line is
TEXTSTRINGn-(j+1).EXTENSION

then return the first of the two lines.

Could you help me with it? I'm not all that knowledgeable of perl to get it right, but I really guess it's just a simple script.

Regards.

panyam · March 11, 2011, 10:10am

Please post a sample input and expected output.
It will help to understand the needs in a better way.

fox1212 · March 12, 2011, 6:44am

Ok, a sample. Say this is the text file resulting from the sorted list of the directory:

abcd-001.pdf
dfcd-003.pdf
grdr-001.pdf
grdr-002.pdf
hhgt-001.pdf
htfr-001.pdf
htfr-002.pdf
htfr-003.pdf
htfr-004.pdf
jkdyr-001.pdf
tasd-018.pdf
tasd-019.pdf
yrdf-003.pdf
zgdr-001.pdf

As you see, the file grdr-xxx has been versioned once, and so there's the 001 and 002 versions, and the htfr-xxx three times, so we have from 001 to 004.
The file tasd has also been versioned and we have the 018 and 019 version because in previous executions the 001 to 017 have already been deleted.

What I want is to extract from a file like this the old versions. That is, from the sample above, I want to return the following entries:

grdr-001.pdf
htfr-001.pdf
htfr-002.pdf
htfr-003.pdf
tasd-018.pdf

so I can move or delete them and keep only the last version.

binlib · March 12, 2011, 6:54pm

Assuming there is only one dash and one dot in the file name, then

awk '$1==p{print f}{f=$0;p=$1}' FS='[-.]'

fox1212 · March 13, 2011, 2:41pm

Let's see if I read this right, as it's a bit obscure code to me:

$1==p{print f}  # Print f if $1 is equal to p
{f=$0;p=$1}  # Then store the entire line in f and up to the first dash in p
FS='[-.]'  # Last, select dash and dot as character separators.

Is it so? If it is, then a couple of things: I want to be sure that the numbers are consecutive, and there is more than one dash in the files, but the number is always the last three digits before the dot.

I'll try to work out with this tomorrow, as now I have an idea of where to begin, and post it if I get it working. I'll have to:

Separate the name in three fields:
khdhdh-ywwwnds-dhs-001.ext

The first would be khdhdh-ywwwnds-dhs
The second 001
the third the extension

Then compare them the way binlib suggests.

Ok, tomorrow at work I'll give it a try and see what I come with.

rdcwayx · March 13, 2011, 8:57pm

find . -type f -name "*.pdf" |sort -rn |awk -F \-  'a[$1]++'

Chubler_XL · March 14, 2011, 1:06am

awk -F'[-.]' '$1==p&&$(NF-1)==n {print f}; {n=$(NF-1)+1; p=$1;f=$0}' infile

fox1212 · March 14, 2011, 7:36am

Thanks, Chubler_XL. This works nearly perfectly, except that there's more than one dash in the entries.

This would be one line:
E4-Z61-7393486-001.jpg

Buy I can just compare $1, $2 and $3.

---------- Post updated at 12:36 PM ---------- Previous update was at 12:16 PM ----------

Finally, and for lack of a better version, this is what I've used, and it works:

nawk 'BEGIN {FS="[-.]"} $(NF-1)==n&&$1==a&&$2==b&&$3==c {print f}; {n=$(NF-1)+1; a=$1;b=$2;c=$3;f=$0}' kk

Thank you very much for your help, guys.

Chubler_XL · March 14, 2011, 6:35pm

This might be a little more robust as it supports any number of dashes:

nawk 'BEGIN {FS="[-.]"} {b=$0; sub(/\-[0-9][0-9]*.[^.]*/, "", b);} $(NF-1)==n&&b==a {print f}; {n=$(NF-1)+1; a=b;f=$0;}' kk

fox1212 · March 17, 2011, 7:21am

chubler_xl:

This might be a little more robust as it supports any number of dashes:
nawk 'BEGIN {FS="[-.]"} {b=$0; sub(/\-[0-9][0-9]*.[^.]*/, "", b);} $(NF-1)==n&&b==a {print f}; {n=$(NF-1)+1; a=b;f=$0;}' kk

Hmmm, I don't get your substitution. Let's see:
\- -> A slash
[0-9] -> A number
[0-9]* -> Zero or more numbers
. -> Any character
[^.]* -> Zero or more characters that are not a dot

And then you clear it? Am I misunderstanding this? When I try the sub in the test name I gave, it just turns:
E4-Z61-7393486-001.jpg
into:
E4-Z61.jpg
by removing the substring:
-7393486-001

I believe you might have wanted to remove from "b" the substring -001? Maybe then we could use as a regexp:

/\-[0-9][0-9][0-9]\./

Chubler_XL · March 17, 2011, 8:00pm

Thanks, I actually wanted to remove from "b" the substring -001.jpg so correct code is (change shown in red):

nawk 'BEGIN {FS="[-.]"} {b=$0; sub(/\-[0-9][0-9]*\.[^.]*/, "", b);} $(NF-1)==n&&b==a {print f}; {n=$(NF-1)+1; a=b;f=$0;}' kk