Process 2 lists at the same time and move files.

I have this while loop, that works but is quite slow to process though. I'm hopping there might be a faster/better way to find what I'm looking for.

I have 2 lists of numbers, and want to only find files where a file name has both values present.

each list has about 100 values.

while read lineA 
    do echo $lineA
    while read lineB
        do cp /archive/$year$month/storage/*$lineA[NIT]??????$lineB*#$year$month$day* /foundfiles/ 
    done < list1.txt
done < list2.txt

Thank you.

Some possible improvements

If the files are large, consider:

the mv command to place the file in /foundfiles/ directory

the ln command to create a symlink in /foundfiles/ back to the original file.

Example:
change :

cp /archive/$year$month/storage/*$lineA[NIT]??????$lineB*#$year$month$day*
# to:
ls  /archive/$year$month/storage/*$lineA[NIT]??????$lineB*#$year$month$day* | 
while read $fname 
do
   mv $fname /foundfiles/
done

The mv command assumes the /foundfiles/ directory and the source directory are on the same filesystem.

Also your cp command assumes files all exist - with all of those variables

I think that, with all those * metacharacters and so on, you may be matching many many files. Decreased I/O is a big win for performance. Assuming all 10000 or so of those possible patterns match.

You seem to be assuming that all those files with all possible value combinations exist. Should that not be the case, running cp for each and ignoring the error message is a huge waste of resources. And, although it will most probably be cached/buffered by the system, you are reading list1 about a hundred times. How about

  • reading list1 and list2 once into memory
  • running cp only for existing files?
    Try
cd /archive/$year$month/storage
ls | awk '
FNR==1          {FC++}
FC<=2           {A[FC,FNR] = $0
                 C[FC] = FNR
                 next
                }
                {for (i=1; i<=C[1]; i++)
                   for (j=1; j<=C[2]; j++)
                     if ($0 ~ A[1,i]"[NIT]......"A[2,j]"#") print "cp", $0, "/foundfiles"
                }
' SUBSEP="\t" path/to/list1 path/to/list2 - | sh

Please adapt the match regex to your needs!

---------- Post updated at 22:59 ---------- Previous update was at 22:50 ----------

You could further improve performance by excluding unapt files with a pattern to ls . And, you could break out of the loops when a file was found.

Thank you for the assistance.

In the end I decided to dumb down my approach and keep it simple.

Here is what I did.

#Build full list of all files from the day in question
cd /archive/$year$month/storage/
ls *#$year$month$day* > /home/login/scripts/fulllist.txt

#Find all files that match list1 and output to new file
while read line
        do awk 'substr($0,35,7)=='"$line"' {print}' /home/login/scripts/fulllist.txt >> /home/login/scripts/fulllist2.txt
done < /home/login/scripts/list1.txt

#Find all files that match list2 from the list created above
while read line
        do awk 'substr($0,49,7)=='"$line"' {print}' /home/login/scripts/fulllist2.txt >> /home/login/scripts/fulllist3.txt
done < /home/login/scripts/list2.txt

#copy all files to temp folder
while read line ; do
        cp /archive/$year$month/storage/$line /foundfiles/
done < /home/login/scripts/fulllist3.txt

time ./program
real 0m6.327s
user 0m4.822s
sys 0m1.336s

The other methods I tried took way longer to process, my original method was taking many hours.

The following does two loops within awk,
and uses pipes between the different parts:

#Build full list of all files from the day in question and send to pipe
cd /archive/$year$month/storage/
ls *"#$year$month$day"* |

#Find all files that match list1 and send to pipe
awk '
FILENAME!="-" { nums[++nmax]=$0; next } 
{ for (n in nums) if (substr($0,35,7)==n) print }
' /home/login/scripts/list1.txt - |

#Find all files that match list2 and send to pipe
awk '
FILENAME!="-" { nums[++nmax]=$0; next } 
{ for (n in nums) if (substr($0,49,7)==n) print }
' /home/login/scripts/list2.txt - |

#copy all files to temp folder
while read line ; do
        cp "/archive/$year$month/storage/$line" /foundfiles/
done

Just for my own sanity, could you please try the following with your data and let me know how the runtime compares to your script? I get the feeling it should be faster, but maybe all of the processing time is being spent copying files and the time used determining which files to copy doesn't matter...

#!/bin/ksh
year="2015"	# Replace with your desired year.
month="11"	# Replace with your desired month.
day="11"	# Replace with your desired day.

#Get full list of all files from the day in question
cd /archive/$year$month/storage/
ls *"#$year$month$day"* |

#Find all files that match list1 and list2...
awk '
FNR == 1 {
	f++
}
f == 1 {list1[$0]
	next
}
f == 2 {list2[$0]
	next
}
substr($0, 35, 7) in list1 && substr($0, 49, 7) in list2
' /home/login/scripts/list1.txt /home/login/scripts/list2.txt - |

#copy all files to temp folder
while read -r line
do	cp "$line" /foundfiles/
done
1 Like

Doh! I wanted the loop in awk and overlooked the smart lookup (substr($0,35,7) in nums)
Thanks for showing it!

I'm glad to help. Since no samples were provided for any of the list file contents nor actual names of files, nor variables being defined; I couldn't test either or our suggestions.

I was a little confused by your script. Shouldn't the lines like:

for (n in nums) if (substr($0,49,7)==n) print }

have been something more like:

for (n in nums) if (substr($0,49,7)==nums[n]) print }

???

Oh, even a bug :o. And yes, it was untested.

Me too I'd be interested in a performance comparison of the solutions provided. Thanks for posting it!

Here is the time :

real 0m0.798s
user 0m0.266s
sys 0m0.418s

folder that's being searched has 41k files, it moved 248 files. List 1 has 168 values, list 2 has 124.

Very fast.