Process 2 lists at the same time and move files.

whegra · November 10, 2015, 11:28am

I have this while loop, that works but is quite slow to process though. I'm hopping there might be a faster/better way to find what I'm looking for.

I have 2 lists of numbers, and want to only find files where a file name has both values present.

each list has about 100 values.

while read lineA 
    do echo $lineA
    while read lineB
        do cp /archive/$year$month/storage/*$lineA[NIT]??????$lineB*#$year$month$day* /foundfiles/ 
    done < list1.txt
done < list2.txt

Thank you.

jim_mcnamara · November 10, 2015, 12:03pm

Some possible improvements

If the files are large, consider:

the mv command to place the file in /foundfiles/ directory

the ln command to create a symlink in /foundfiles/ back to the original file.

Example:
change :

cp /archive/$year$month/storage/*$lineA[NIT]??????$lineB*#$year$month$day*
# to:
ls  /archive/$year$month/storage/*$lineA[NIT]??????$lineB*#$year$month$day* | 
while read $fname 
do
   mv $fname /foundfiles/
done

The mv command assumes the /foundfiles/ directory and the source directory are on the same filesystem.

Also your cp command assumes files all exist - with all of those variables

I think that, with all those * metacharacters and so on, you may be matching many many files. Decreased I/O is a big win for performance. Assuming all 10000 or so of those possible patterns match.

RudiC · November 10, 2015, 4:59pm

You seem to be assuming that all those files with all possible value combinations exist. Should that not be the case, running cp for each and ignoring the error message is a huge waste of resources. And, although it will most probably be cached/buffered by the system, you are reading list1 about a hundred times. How about

reading list1 and list2 once into memory
running cp only for existing files?
Try

cd /archive/$year$month/storage
ls | awk '
FNR==1          {FC++}
FC<=2           {A[FC,FNR] = $0
                 C[FC] = FNR
                 next
                }
                {for (i=1; i<=C[1]; i++)
                   for (j=1; j<=C[2]; j++)
                     if ($0 ~ A[1,i]"[NIT]......"A[2,j]"#") print "cp", $0, "/foundfiles"
                }
' SUBSEP="\t" path/to/list1 path/to/list2 - | sh

Please adapt the match regex to your needs!

---------- Post updated at 22:59 ---------- Previous update was at 22:50 ----------

You could further improve performance by excluding unapt files with a pattern to ls . And, you could break out of the loops when a file was found.

whegra · November 10, 2015, 7:34pm

Thank you for the assistance.

In the end I decided to dumb down my approach and keep it simple.

Here is what I did.

#Build full list of all files from the day in question
cd /archive/$year$month/storage/
ls *#$year$month$day* > /home/login/scripts/fulllist.txt

#Find all files that match list1 and output to new file
while read line
        do awk 'substr($0,35,7)=='"$line"' {print}' /home/login/scripts/fulllist.txt >> /home/login/scripts/fulllist2.txt
done < /home/login/scripts/list1.txt

#Find all files that match list2 from the list created above
while read line
        do awk 'substr($0,49,7)=='"$line"' {print}' /home/login/scripts/fulllist2.txt >> /home/login/scripts/fulllist3.txt
done < /home/login/scripts/list2.txt

#copy all files to temp folder
while read line ; do
        cp /archive/$year$month/storage/$line /foundfiles/
done < /home/login/scripts/fulllist3.txt

time ./program
real 0m6.327s
user 0m4.822s
sys 0m1.336s

The other methods I tried took way longer to process, my original method was taking many hours.

MadeInGermany · November 12, 2015, 2:30pm

The following does two loops within awk,
and uses pipes between the different parts:

#Build full list of all files from the day in question and send to pipe
cd /archive/$year$month/storage/
ls *"#$year$month$day"* |

#Find all files that match list1 and send to pipe
awk '
FILENAME!="-" { nums[++nmax]=$0; next } 
{ for (n in nums) if (substr($0,35,7)==n) print }
' /home/login/scripts/list1.txt - |

#Find all files that match list2 and send to pipe
awk '
FILENAME!="-" { nums[++nmax]=$0; next } 
{ for (n in nums) if (substr($0,49,7)==n) print }
' /home/login/scripts/list2.txt - |

#copy all files to temp folder
while read line ; do
        cp "/archive/$year$month/storage/$line" /foundfiles/
done

Don_Cragun · November 12, 2015, 3:32pm

Just for my own sanity, could you please try the following with your data and let me know how the runtime compares to your script? I get the feeling it should be faster, but maybe all of the processing time is being spent copying files and the time used determining which files to copy doesn't matter...

#!/bin/ksh
year="2015"	# Replace with your desired year.
month="11"	# Replace with your desired month.
day="11"	# Replace with your desired day.

#Get full list of all files from the day in question
cd /archive/$year$month/storage/
ls *"#$year$month$day"* |

#Find all files that match list1 and list2...
awk '
FNR == 1 {
	f++
}
f == 1 {list1[$0]
	next
}
f == 2 {list2[$0]
	next
}
substr($0, 35, 7) in list1 && substr($0, 49, 7) in list2
' /home/login/scripts/list1.txt /home/login/scripts/list2.txt - |

#copy all files to temp folder
while read -r line
do	cp "$line" /foundfiles/
done

MadeInGermany · November 12, 2015, 3:47pm

Doh! I wanted the loop in awk and overlooked the smart lookup (substr($0,35,7) in nums)
Thanks for showing it!

Don_Cragun · November 12, 2015, 4:03pm

I'm glad to help. Since no samples were provided for any of the list file contents nor actual names of files, nor variables being defined; I couldn't test either or our suggestions.

I was a little confused by your script. Shouldn't the lines like:

for (n in nums) if (substr($0,49,7)==n) print }

have been something more like:

for (n in nums) if (substr($0,49,7)==nums[n]) print }

???

MadeInGermany · November 12, 2015, 4:12pm

Oh, even a bug :o. And yes, it was untested.

RudiC · November 12, 2015, 4:42pm

Me too I'd be interested in a performance comparison of the solutions provided. Thanks for posting it!

whegra · November 16, 2015, 9:58am

don cragun:

Just for my own sanity, could you please try the following with your data and let me know how the runtime compares to your script? I get the feeling it should be faster, but maybe all of the processing time is being spent copying files and the time used determining which files to copy doesn't matter...
#!/bin/ksh
year="2015"	# Replace with your desired year.
month="11"	# Replace with your desired month.
day="11"	# Replace with your desired day.

#Get full list of all files from the day in question
cd /archive/$year$month/storage/
ls *"#$year$month$day"* |

#Find all files that match list1 and list2...
awk '
FNR == 1 {
	f++
}
f == 1 {list1[$0]
	next
}
f == 2 {list2[$0]
	next
}
substr($0, 35, 7) in list1 && substr($0, 49, 7) in list2
' /home/login/scripts/list1.txt /home/login/scripts/list2.txt - |

#copy all files to temp folder
while read -r line
do	cp "$line" /foundfiles/
done

Here is the time :

real 0m0.798s
user 0m0.266s
sys 0m0.418s

folder that's being searched has 41k files, it moved 248 files. List 1 has 168 values, list 2 has 124.

Very fast.