Picking up files conditionally

Saanvi1 · August 25, 2015, 2:08pm

Hi
I have a scenario:
I have a directory say DIR1 (no sub directories) and have few files in that directory as given below:

app-cnd-imp-20150820.txt
app-cxyzm-imp-20150820.txt
app-petco-imp-20150820.txt
app-mobility-imp-20150820.txt
app-mobility-imp-20150821.txt
app-mobility-imp-20150822.txt
app-cellular-imp-20150824.txt

I have a pseudo code in a script something like below that grabs the filename above and pass it to another script.

 
for filename in `ls -tr app-*-imp-*.txt`

 ksh Script2.sh ${filename} &   � Script2.sh will run in parallel and consume these file and remove the respective file from directory once complete processing.
  
done
  
 wait
  
 .....
 ......

Issue: I need to pass in the filename to Script2.sh above as shown. But if the same file is coming more than once with different dates (for example: app-mobility-imp-*.txt ) then I need to have to use them in next FOR LOOP pass one by one. So if the same name file exists more than once, I need to process the earliest file first but one by one in different loop.

For example:
So in first pass in the FOR loop, I want to pass the filename below to Script2.sh :

app-cnd-imp-20150820.txt
app-cxyzm-imp-20150820.txt
app-petco-imp-20150820.txt
app-mobility-imp-20150820.txt
app-cellular-imp-20150824.txt

Once the file is processed by Script2.sh , it will get deleted by Script2.sh . So in the next pass of the FOR loop the file above

So in second pass in the FOR loop, I want to pass the filename below to Script2.sh :
app-mobility-imp-20150821.txt

So in third pass in the FOR loop, I want to pass the filename below to Script2.sh :
app-mobility-imp-20150822.txt

I would really appreciate your help and guidance.
Thanks

Corona688 · August 25, 2015, 2:43pm

for filename in `ls �tr app-*-imp-*.txt`

 ksh Script2.sh ${filename} &   � Script2.sh will run in parallel and consume these file and remove the respective file from directory once complete processing.
  
done

This is a useless use of backticks.

As for separating the dates, how about list them all, extract the dates, and sort -u:

ls app-*-imp-*.txt | sed 's/[^0-9]//g' | sort -u | while read DATE
do
        for FILE in *"${DATE}.txt"
        do
                echo "Processing $FILE"
        done
done

As an aside, running 30 simultaneous processes does not mean your machine or disk can handle 30 simultaneous processes.

Saanvi1 · August 25, 2015, 3:23pm

Script2.sh

actually kicks off external software that identifies the process based on the first part of the filename passed before date.

 
 For example: 
 app-mobility-imp will kick off the mobility app process.

If we process file sequentially as suggested, it may run for days.
I somehow am looking for the logic to somehow determine the files with same name and run only those files one by one and remaining ones can run in parallel in first pass of the loop.

Thanks

Corona688 · August 25, 2015, 3:42pm

What kind of load do they put on the machine/disk/network? Overloading them will waste more time, not less.

Does having to process those files sequentially mean you have to wait for everything to stop, before you launch more? Otherwise, one of your background ones might finish in-between files A, B, C, and D.

Saanvi1 · August 25, 2015, 4:07pm

It is one of the ETL servers that loads these files into databases. Each file pertains to different load process. So passing filename in parrallel will load various tables based on filename simultaneously.
Currently it is working fine in production without any issues. But ocassionally now we started receiving multiple files( not high in number) with same name but different dates, hence kicking off same file will execute the same ETL code causing it to fail multiple times. I am trying to avoid a situation of failure and want to keep the parallel execution in place for individual files and for the ones that are more than one file with same name that needs to be sequential load one after the other.

 
 1.    app-cnd-imp-20150820.txt
 2.    app-cxyzm-imp-20150820.txt
 3.    app-petco-imp-20150820.txt
 4.    app-mobility-imp-20150820.txt
 5.    app-mobility-imp-20150821.txt
 6.    app-mobility-imp-20150822.txt
 7.    app-cellular-imp-20150824.txt

So in the file list above I can run file number 1,2,3,4,7 in one pass of loop and wait for completion and number 5 file in second pass of the loop and wait for completion and number 6 file in third pass of loop as the 4, 5,6 pertains to the same ETL code and will fail the load process.

Thanks

Scrutinizer · August 25, 2015, 4:07pm

Another option might be something like this, if the file names may not have spaces in them:

ls app-*-imp-*.txt | awk '{$NF="*.txt"}!A[$0]++' FS=- OFS=- |
while read subpattern
do
  for i in $subpattern
  do
    echo "processing file $i"
  done &
done
wait

What happens is that for each sub-pattern a for loop is processed in the background. Each sub-pattern expands to one or more related files which are in alphabetical order, which is the right order because of the way the files are named... So different files will be processed in parallel. If there are more than one files with a sub-pattern, these wil be processed sequentially..

Saanvi1 · August 25, 2015, 4:22pm

Thanks Scrutinizer. Let me try that out

---------- Post updated at 03:22 PM ---------- Previous update was at 03:12 PM ----------

Hi,
I tried the script below:

 
 #!/bin/ksh
ls app-*-imp-*.txt | awk '{$NF="*.txt"}!A[$0]++' FS=- OFS=- |
while read pattern
do
  for i in $pattern
  do
    echo "processing file $i"
  done &
done

I am getting the error below. I am using Sun Solaris box.

 
 awk: syntax error near line 1
awk: bailing out near line 1

Scrutinizer · August 25, 2015, 4:23pm

On Solaris use /usr/xpg4/bin/awk rather than awk