ISSUE in handling multiple same name files :-(

Dear all,
My work is completely stuck cos of the following issue. Please find it here and kindly help me.
Task is following:
I have set of files with such pattern

1t-rw-rw-r-- 1 emily emily 119 Jun 11 10:45 vgtree_5_1_pfs.root
3t-rw-rw-r-- 1 emily emily 145 Jun 11 10:46 vgtree_5_3_pfs.root
1t-rw-rw-r-- 1 emily emily  20 Jun 11 10:45 vgtree_75_1_pfs.root
3t-rw-rw-r-- 1 emily emily  73 Jun 11 10:45 vgtree_75_3_pfs.root
2t-rw-rw-r-- 1 emily emily  41 Jun 11 10:45 vgtree_75_2_pfs.root
2t-rw-rw-r-- 1 emily emily   8 Jun 11 10:46 vgtree_3_2_pls.root
3t-rw-rw-r-- 1 emily emily  28 Jun 11 10:46 vgtree_2_3_pfs.root
3t-rw-rw-r-- 1 emily emily  75 Jun 11 10:46 vgtree_3_3_pfs.root

As you can see that file are repeating, which means the repetition of pattern
vgtree_5_*. So the following file are repeating:

1t-rw-rw-r-- 1 emily emily 119 Jun 11 10:45 vgtree_5_1_pfs.root
3t-rw-rw-r-- 1 emily emily 145 Jun 11 10:46 vgtree_5_3_pfs.root

similarly, file with vgtree_75_* are repeating.

What I want is to make a separate text file with the only files name present which are non repeating and if repeating, I would want to chose the file with the maximum file size. So bsically all the files mark RED in above file.
Also shown here:

3t-rw-rw-r-- 1 emily emily 145 Jun 11 10:46 vgtree_5_3_pfs.root
3t-rw-rw-r-- 1 emily emily  73 Jun 11 10:45 vgtree_75_3_pfs.root
3t-rw-rw-r-- 1 emily emily  75 Jun 11 10:46 vgtree_3_3_pfs.root

Greetings,
emily

Try

ls -l | sort -nrk5 | awk '{split($NF,A,"_");if(!X[A[1],A[2]]++){print}}'
1 Like

I tried this shell scripting solution.

DIR=/your/directory
PATTERNS=/tmp/available_patterns.txt
SORTED_PATTERNS=/tmp/unique_patterns.txt
NON_REPEATING=/tmp/non_repeating_files.dat

#Get all available file patterns or prefixes before the second "_"
for FILENAME in `ls $DIR`
do
 PATTERN=$(echo $FILENAME | awk -F"_" '{print $1"_"$2}' )
 echo $PATTERN >> $PATTERNS
done

#Get the unique patterns. Either sort -u or command uniq would work
sort -u $PATTERNS > $UNIQUE_PATTERNS

#From each unique pattern get the nr of occurrences and split the files
for FILENAME in `cat $UNIQUE_PATTERNS`
do
 OCCURS=$(ls ${FILENAME}* | wc -l)
 if [ $OCCURS -eq 1 ]	#Move the file to non_repeating
 then
	ls ${FILENAME}* >> $NON_REPEATING
 else	#Else sort by file size and move the max sized file to repeating
	ls -l ${FILENAME}* | sort -nrk5,5 | head -1 >>$REPEATING
 fi
done

Note: This is untested and loops through the directory twice

1 Like

I would go with Pamu's solution :slight_smile:

simple and effective

1 Like

Thanks Pamu , it worked like charm..:slight_smile:

---------- Post updated at 04:20 AM ---------- Previous update was at 02:14 AM ----------

Dear Pamu,
I just realized still, I am accepting some extra files (actually, I deal with hundreds of such files..:frowning: )
And this time it is little tricky as well.
So on the following files:

1t-rw-rw-r-- 1 emily emily 119 Jun 11 10:45 vgtree_5_1_pfs.root
3t-rw-rw-r-- 1 emily emily 145 Jun 11 10:46 vgtree_5_3_pfs.root
1t-rw-rw-r-- 1 emily emily  20 Jun 11 10:45 vgtree_75_1_pfs.root
3t-rw-rw-r-- 1 emily emily  73 Jun 11 10:45 vgtree_75_3_pfs.root
2t-rw-rw-r-- 1 emily emily  41 Jun 11 10:45 vgtree_75_2_pfs.root
2t-rw-rw-r-- 1 emily emily   8 Jun 11 10:46 vgtree_3_2_pls.root
3t-rw-rw-r-- 1 emily emily  28 Jun 11 10:46 vgtree_2_3_pfs.root
3t-rw-rw-r-- 1 emily emily  75 Jun 11 10:46 vgtree_3_3_pfs.root

I selected files interest of mine, which your command line does.

3t-rw-rw-r-- 1 emily emily 145 Jun 11 10:46 vgtree_5_3_pfs.root
3t-rw-rw-r-- 1 emily emily  73 Jun 11 10:45 vgtree_75_3_pfs.root
3t-rw-rw-r-- 1 emily emily  75 Jun 11 10:46 vgtree_3_3_pfs.root

Now, again, I have to cross-check the this file_ID (like, 5, 75 and 3) with another available text file. The text file would look like this:

crab:  ExitCodes Summary
 >>>>>>>>> 396 Jobs with Wrapper Exit Code : 0 
	 List of jobs: 1-8,13-66,68,70-81,86-95,97-126,128-166,168-185,187-195,197,200-246,248-261,266-305,307-309,311-326,328-336,340-349,351-352,354-367,369-395,397-411,413-429 
	See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:  ExitCodes Summary
 >>>>>>>>> 1 Jobs with Wrapper Exit Code : 8021 
	 List of jobs: 127 
	See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:  ExitCodes Summary
 >>>>>>>>> 1 Jobs with Wrapper Exit Code : 50115 
	 List of jobs: 96 
	See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:   429 Total Jobs 
 >>>>>>>>> 399 Jobs Retrieved 
	List of jobs Retrieved: 1-8,13-66,68,70-81,86-166,168-185,187-195,197,200-246,248-261,266-309,311-326,328-336,340-349,351-352,354-367,369-395,397-411,413-429 
 >>>>>>>>> 1 Jobs Cancelled by user 
	List of jobs Cancelled by user: 327 
 >>>>>>>>> 29 Jobs Cancelled 
	List of jobs Cancelled: 9-12,67,69,75, 82-85,167,186,196,198-199,247,262-265,310,337-339,350,353,368,396,412 

Now, I need to compare the fileID against the numbers marked in RED SECTION here, if they match. I should discard that file...
For example, 75 lies in the cancelled job And I should discard it after comparing this file.
Not sure if its obvious.

Thanks
emily