Help speeding up script

JohnN6 · April 27, 2015, 6:09pm

This is my first experience writing unix script. I've created the following script. It does what I want it to do, but I need it to be a lot faster. Is there any way to speed it up?

cat 'Tax_Provision_Sample.dat' | sort | while read p; do fn=`echo $p|cut -d~ -f2,4,3,8,9`; echo $p >> "$fn.txt"; done

Thank you.

sea · April 27, 2015, 6:13pm

Hi

You could use sort directly instead of cat.
Cant say anything else without the content of the Tax_Provision_sample.dat.

hth

RudiC · April 28, 2015, 7:57am

Test this' ( bash ism's) run time

IFS="~"; while read -a p; do echo "${p
[*]}" >> "${p[1]}${p[3]}${p[2]}${p[7]}${p[8]}.txt"; done <Tax_Provision_Sample.dat

You may want to save the old IFS beforehand. If need be, you can pipe sort 's output into the loop.

neutronscott · April 28, 2015, 9:14am

It's because you're running an external program for each line.

Here is an awk solution:

sort Tax_Provision_Sample.dat | awk -F~ '{f=$2 FS $4 FS $3 FS $8 FS $9 ".txt"; print $0 >> f; close(f);}'

JohnN6 · April 28, 2015, 11:25am

Wow. That was a lot faster. Thanks!

MadeInGermany · April 28, 2015, 11:51am

rudic:

Test this' ( bash ism's) run time
IFS="~"; while read -a p; do echo "${p
[*]}" >> "${p[1]}${p[3]}${p[2]}${p[7]}${p[8]}.txt"; done <Tax_Provision_Sample.dat
You may want to save the old IFS beforehand. If need be, you can pipe sort 's output into the loop.

You don't need to save IFS if you run it in a subshell:

(IFS="~"; while read -a p; do echo "${p
[*]}" >> "${p[1]}${p[3]}${p[2]}${p[7]}${p[8]}.txt"; done) <Tax_Provision_Sample.dat

sort Tax_Provision_Sample.dat | (...)

Scrutinizer · April 28, 2015, 12:21pm

neutronscott:

It's because you're running an external program for each line.

Here is an awk solution:
sort Tax_Provision_Sample.dat | awk -F~ '{f=$2 FS $4 FS $3 FS $8 FS $9 ".txt"; print $0 > f; close(f);}'

Note that each file will only contain one line of the original file, namely the last occurence of a particular combination of those fields. If there can be more than one line with that combination you need to use >> instead of > , but you would need to empty the files beforehand...

neutronscott · April 28, 2015, 12:30pm

I promise I edited it twice, the second time to do that. I don't know what happened!

MadeInGermany · April 28, 2015, 12:30pm

Good point. Like

sort Tax_Provision_Sample.dat | awk -F~ '{f=$2 FS $4 FS $3 FS $8 FS $9 ".txt"; if (NR==1) {print > f; close(f);} else {print >> f; close(f)}}'

The simple approach

sort Tax_Provision_Sample.dat | awk -F~ '{f=$2 FS $4 FS $3 FS $8 FS $9 ".txt"; print > f}'

is fastest but there is the risk of running out of file descriptors.

neutronscott · April 28, 2015, 12:35pm

NR==1 likely isn't a good test since the files will change possibly with each line. perhaps a[f]++ to test if it's in an array of used filenames. and yes, some awk only allow 10 filedesc!

MadeInGermany · April 28, 2015, 12:44pm

Here comes the implementation:

sort Tax_Provision_Sample.dat | awk -F~ '{f=$2 FS $4 FS $3 FS $8 FS $9 ".txt"; if (f in a) {print >> f} else {a[f]; print > f}; close(f);}'

Scrutinizer · April 28, 2015, 12:57pm

Something else to note perhaps, is that with

cut -d~ -f2,4,3,8,9

the output of column 3 will come before column 4, so a direct translation to awk would be:

awk -F~ '{f=$2 FS $3 FS $4 FS $8 FS $9....'

MadeInGermany · April 28, 2015, 1:10pm

True, but was it intended?? To be answered by JohnN6.

JohnN6 · April 28, 2015, 1:21pm

The original script did result in 3 coming before 4. Having it the other way is preferable, but I figured at the time that it was something I could live with.

rbatte1 · April 29, 2015, 8:59am

You could also try:-

Orig_IFS="$IFS"
IFS=\~
sort Tax_Provision_Sample.dat | while read p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
do
   echo "$p2~$p4~$p3~$p8~$p9"
done > $fn.txt
IFS="$Orig_IFS"

I'm not great with awk, so this is simpler to read, but there may well be a trade off on performance.

Just another option, although you may already have better.

Robin

Scrutinizer · April 29, 2015, 12:03pm

Robin, where does fn get set and shouldn't that be changed inside the loop?

rbatte1 · April 29, 2015, 12:23pm

I was just using the output file name from the originally supplied code. I cannot say how it was originally set.

What do you think of the performance issues for a large input? Would my code by horribly slower? If so, then the original poster must make the decision of speed over clarity (assuming that my suggestion is clear, and I'm not sure if it is)

Regards,
Robin


Hang on, no I have totally misread the supplied attempt.

If I re-read it, the code is generating multiple output files based on the input records and writing the whole line to the appropriate file.

No, forget my suggestion, totally wrong.

I can't think of a way to remove the "open/append, write, close" operations to write to multiple files unless we force is another way by working out what files there could possibly be and then getting the records for each required output file in turn.  That would just generate more headaches than it solves and for a large input file could still be quite slow.


Am am a fool :o



Robin

Scrutinizer · April 29, 2015, 12:57pm

Don't we all have that from time to time

In shell the open/append, write, close operations are performed implicitly by the scope of the redirection of the file descriptor. RudiC already gave a suggestion using an array.

With you multi-variable approach:

sort Tax_Provision_Sample.dat | 
while IFS="~" read p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
do
  printf "%s\n" "$p2~$p4~$p3~$p8~$p9" >> $fn.txt
done

The files would need to be empty beforehand..

MadeInGermany · April 29, 2015, 4:54pm

And where does fn get set now?
I think you mean

#!/bin/bash
declare -A AF
sort Tax_Provision_Sample.dat | 
while IFS="~" read p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
do
  fn="$p2~$p4~$p3~$p8~$p9"
  if [ -z "${AF[$fn]}" ]; then
    > $fn.txt
    AF[$fn]=1
  fi
  echo "$p1~$p2~$p3~$p4~$p5~$p6~$p7~$p8~$p9~$p10" >> $fn.txt
done

I have added some bash-4 code that will empty the output files when first met.
Omit the code if you have bash-3 (and delete/empty the files before you run the script).

Scrutinizer · April 29, 2015, 5:10pm

Yes something like that . Was trying to show open/append, write, close operations in shell .. and forgot about $fn ...