Help speeding up script

This is my first experience writing unix script. I've created the following script. It does what I want it to do, but I need it to be a lot faster. Is there any way to speed it up?

cat 'Tax_Provision_Sample.dat' | sort | while read p; do fn=`echo $p|cut -d~ -f2,4,3,8,9`; echo $p >> "$fn.txt"; done

Thank you.

Hi

You could use sort directly instead of cat.
Cant say anything else without the content of the Tax_Provision_sample.dat.

hth

Test this' ( bash ism's) run time

IFS="~"; while read -a p; do echo "${p
[*]}" >> "${p[1]}${p[3]}${p[2]}${p[7]}${p[8]}.txt"; done <Tax_Provision_Sample.dat

You may want to save the old IFS beforehand. If need be, you can pipe sort 's output into the loop.

1 Like

It's because you're running an external program for each line.

Here is an awk solution:

sort Tax_Provision_Sample.dat | awk -F~ '{f=$2 FS $4 FS $3 FS $8 FS $9 ".txt"; print $0 >> f; close(f);}'
1 Like

Wow. That was a lot faster. Thanks!

You don't need to save IFS if you run it in a subshell:

(IFS="~"; while read -a p; do echo "${p
[*]}" >> "${p[1]}${p[3]}${p[2]}${p[7]}${p[8]}.txt"; done) <Tax_Provision_Sample.dat
sort Tax_Provision_Sample.dat | (...)
1 Like

Note that each file will only contain one line of the original file, namely the last occurence of a particular combination of those fields. If there can be more than one line with that combination you need to use >> instead of > , but you would need to empty the files beforehand...

1 Like

I promise I edited it twice, the second time to do that. :smiley: I don't know what happened! :confused:

Good point. Like

sort Tax_Provision_Sample.dat | awk -F~ '{f=$2 FS $4 FS $3 FS $8 FS $9 ".txt"; if (NR==1) {print > f; close(f);} else {print >> f; close(f)}}'

The simple approach

sort Tax_Provision_Sample.dat | awk -F~ '{f=$2 FS $4 FS $3 FS $8 FS $9 ".txt"; print > f}'

is fastest but there is the risk of running out of file descriptors.

NR==1 likely isn't a good test since the files will change possibly with each line. perhaps a[f]++ to test if it's in an array of used filenames. and yes, some awk only allow 10 filedesc!

1 Like

Here comes the implementation:

sort Tax_Provision_Sample.dat | awk -F~ '{f=$2 FS $4 FS $3 FS $8 FS $9 ".txt"; if (f in a) {print >> f} else {a[f]; print > f}; close(f);}'

Something else to note perhaps, is that with

cut -d~ -f2,4,3,8,9

the output of column 3 will come before column 4, so a direct translation to awk would be:

awk -F~ '{f=$2 FS $3 FS $4 FS $8 FS $9....'

True, but was it intended?? To be answered by JohnN6.

The original script did result in 3 coming before 4. Having it the other way is preferable, but I figured at the time that it was something I could live with.

You could also try:-

Orig_IFS="$IFS"
IFS=\~
sort Tax_Provision_Sample.dat | while read p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
do
   echo "$p2~$p4~$p3~$p8~$p9"
done > $fn.txt
IFS="$Orig_IFS"

I'm not great with awk, so this is simpler to read, but there may well be a trade off on performance.

Just another option, although you may already have better.

Robin

Robin, where does fn get set and shouldn't that be changed inside the loop?

I was just using the output file name from the originally supplied code. I cannot say how it was originally set.

What do you think of the performance issues for a large input? Would my code by horribly slower? If so, then the original poster must make the decision of speed over clarity (assuming that my suggestion is clear, and I'm not sure if it is)

Regards,
Robin


Hang on, no I have totally misread the supplied attempt.

If I re-read it, the code is generating multiple output files based on the input records and writing the whole line to the appropriate file.

No, forget my suggestion, totally wrong.

I can't think of a way to remove the "open/append, write, close" operations to write to multiple files unless we force is another way by working out what files there could possibly be and then getting the records for each required output file in turn.  That would just generate more headaches than it solves and for a large input file could still be quite slow.


Am am a fool :o



Robin

Don't we all have that from time to time :wink:

In shell the open/append, write, close operations are performed implicitly by the scope of the redirection of the file descriptor. RudiC already gave a suggestion using an array.

With you multi-variable approach:

sort Tax_Provision_Sample.dat | 
while IFS="~" read p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
do
  printf "%s\n" "$p2~$p4~$p3~$p8~$p9" >> $fn.txt
done

The files would need to be empty beforehand..

And where does fn get set now? :confused:
I think you mean

#!/bin/bash
declare -A AF
sort Tax_Provision_Sample.dat | 
while IFS="~" read p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
do
  fn="$p2~$p4~$p3~$p8~$p9"
  if [ -z "${AF[$fn]}" ]; then
    > $fn.txt
    AF[$fn]=1
  fi
  echo "$p1~$p2~$p3~$p4~$p5~$p6~$p7~$p8~$p9~$p10" >> $fn.txt
done

I have added some bash-4 code that will empty the output files when first met.
Omit the code if you have bash-3 (and delete/empty the files before you run the script).

2 Likes

Yes something like that :slight_smile: . Was trying to show open/append, write, close operations in shell .. and forgot about $fn ...