Script Optimization - large delimited file, for loop with many greps

verge · April 21, 2011, 4:55pm

Since there are approximately 75K gsfiles and hundreds of stfiles per gsfile, this script can take hours. How can I rewrite this script, so that it's much faster? I'm not as familiar with perl but I'm open to all suggestions.

ls file.list>$split
for gsfile in `cat $split`;
do
  csplit -ks -n 6 -f $gsfile.ST $gsfile /^ST/ {100000} 2>>$diagnostic
  for stfile in `ls $gsfile.ST*|sort -n`; 
  do 
    delim=`LC_ALL=C grep "^GS" $gsfile|cut -c3` 2>>$diagnostic
  gscode=`LC_ALL=C grep "^GS" $gsfile|cut -d "$delim" -f3` 2>>$diagnostic
  supcd=`LC_ALL=C grep "^N1.SU" $stfile|cut -d "$delim" -f5|head -1` 2>>$diagnostic
    sellcd=`LC_ALL=C grep "^N1.SE" $stfile|cut -d "$delim" -f5|head -1` 2>>$diagnostic
    firponum=`LC_ALL=C grep "^IT1" $stfile|cut -d "$delim" -f10|head -1` 2>>$diagnostic
    invtl=`LC_ALL=C grep "^TDS" $stfile|cut -d "$delim" -f2|tr -cd '[[:digit:]]'` 2>>$diagnostic 
    #I have about ten more greps here
    echo "$gscode,$supcd,$sellcd,$firponum,$invtl">>$detail_file
    rm -f $stfile 2>>$diagnostic                                                                                          
  done 
done

Here's an example of an input file. The delimiters can be any non-word character.

 
gsfile_1
GS*IN*TPU*TPM*110303*0634*65433*X*002000 
ST*810*0001  
N1*SU*TPUNAME*92*TPUCD21 
N1*SE*SELNAME*92*789 
IT1*1*8*EA*909234.12**BP*PARTNUM123*PO*PONUM342342*PL*526 
IT1*2*3*EA*53342.65**BP*PARTNUM456*PO*PONUM31131*PL*528 
TDS*32424214  
SE*7*0001
ST*810*0002  
N1*SU*TPUNAME*92*TPUCD43 
N1*SE*SELNAME*92*543 
DTM*011*110302 
IT1*1*10*EA*909234.12**BP*PARTNUM575*PO*PONUM1253123*PL*001  
IT1*2*15*EA*53342.65**BP*PARTNUM483*PO*PONUM646456*PL*002 
TDS*989248095 
SE*8*0002 
GE*2*65433
gs_file2
GS~IN~TPT~TPM~110302~2055~2321123~X~003010~
ST~810~000027324~
N1~SU~TPMNAME~92~TPUCD87
N1~SE~SELMNAME~92~23234
IT1~001~3450~EA~1234.67~~BP~PARTNUM6546-048~PO~PONUM99484~PL~235~
TDS~425961150~
SE~6~2321123~
GE~1~3201~

output should look like this ...
TPU,TPUCD21,789,PONUM342342,32424214
TPU,TPUCD43,543,PONUM1253123,989248095
TPT,TPUCD87,23234,PONUM99484,425961150

I hope this isn't too long! I'm new and not yet familiar with the forum posting style. Thanks so much for your help.

Corona688 · April 21, 2011, 5:10pm

while read file
do
...
done < foo

much more efficiently. You can also do

 while stuff ; do ... ; done 2>filename

to redirect stderr once for the whole loop instead of doing a special redirection for each and every individual command.

You can also set LC_ALL once instead of doing so for each and every individual command.

In your defense, you've been forced to deal with input data that looks like line noise! I don't entirely understand what you're doing. Why are you csplitting on 10000 and /^ST/ ? Are two non-word characters in a row, **, supposed to imply a blank record between them? Finally, what is your system, what is your shell? That will have a big effect on the tools available to you.

I've started writing a solution in awk.

verge · April 21, 2011, 5:43pm

Thanks Corona,

I'm using Korn Shell on Microsoft Windows Services for UNIX 3.5 which supports
Sun Microsystems Solaris versions 7 and 8
Red Hat Linux version 8.0
IBM AIX version 5L 5.2
Hewlett-Packard HP-UX version 11i

Thanks for the tip about the backticks, sterr redirect and the while read ... I'll change that.

Yeah, the file is cumbersome :(. As for the splitting, each /^ST/ is a new group, I had taken the {100000} to be the max number of times to execute the csplit.

Yes, two non-word characters in a row, is a blank field

Hope this clarifies the structure of the file ... the initial file is approx 3 million lines

GS*IN*TPU*TPM*110303*0634*65433*X*002000 #there are approx 75K GS to GE groups
  ST*810*0001  #potentially thousands of ST to SE groups, I have to relate the ST/SE group to the GS line
     N1*SU*TPUNAME*92*TPUCD21 
     N1*SE*SELNAME*92*789 
     IT1*1*8*EA*909234.12**BP*PARTNUM123*PO*PONUM342342*PL*526 
     IT1*2*3*EA*53342.65**BP*PARTNUM456*PO*PONUM31131*PL*528 
     TDS*32424214  
  SE*7*0001
     ST*810*0002  
       N1*SU*TPUNAME*92*TPUCD43 
       N1*SE*SELNAME*92*543 
       DTM*011*110302 
       IT1*1*10*EA*909234.12**BP*PARTNUM575*PO*PONUM1253123*PL*001  
       IT1*2*15*EA*53342.65**BP*PARTNUM483*PO*PONUM646456*PL*002 
       TDS*989248095 
     SE*8*0002 
GE*2*65433
GS~IN~TPT~TPM~110302~2055~2321123~X~003010~
   ST~810~000027324~
     N1~SU~TPMNAME~92~TPUCD87
     N1~SE~SELMNAME~92~23234
     IT1~001~3450~EA~1234.67~~BP~PARTNUM6546-048~PO~PONUM99484~PL~235~
    TDS~425961150~
   SE~6~2321123~
GE~1~3201~

Corona688 · April 21, 2011, 5:52pm

How about this:

#!/bin/awk -f
# This section gets run only once, before anything's read.
# using it for variable setup.
BEGIN {
        # Don't have to check what the delimiter is, just split on
        # any single character that's not a-z, A-Z, 0-9, _
        FS="[^a-zA-Z0-9_]"
        # Print separated by commas
        OFS=","
}

# Each of the following expressions gets executed once for every
# line that matches the regex.

# Sometimes this one's column 11, sometimes it's column 12
/^IT1/  {       if(!FIRPONUM)
                {
                        FIRPONUM=$11
                        if(!(FIRPONUM ~ /^PONUM/))
                                FIRPONUM=$12;
                }
        }
# Matching these lines is easy
/^TDS/  {       INVTL=$2        }
/^N1.SE/{       SELLCD=$5       }
/^N1.SU/{       SUPCD=$5        }
/^GS/   {       GSCODE=$3       }
# Print on this only once we've read FIRPONUM
/^ST/   {
                if(FIRPONUM)
                        print GSCODE,SUPCD,SELLCD,FIRPONUM,INVTL;

                FIRPONUM=""
        }

# Have to print once on exit or we'll lose the last line
END {   print GSCODE,SUPCD,SELLCD,FIRPONUM,INVTL;       }

Not complete since neither's your example, but much more efficient than grep | cut for every line, and might be enough to get you started.

---------- Post updated at 03:46 PM ---------- Previous update was at 03:44 PM ----------

Blech. Poor imitation of a korn shell.

And since you're not actually running UNIX my awk script of course can't run as a script like I intended. Small difference though. Just run it like awk -f script.awk inputfile

---------- Post updated at 03:52 PM ---------- Previous update was at 03:46 PM ----------

Whoa, is your data actually indented like that? That changes things.

verge · April 21, 2011, 6:09pm

No it isn't indented ... I just indented it, to point out the relationship between the groups

ok ... you've given me something to chew on here, this is a great start, I'm going to start rewriting

would I call this awk script from within my ksh script?

Thanks Corona!

---------- Post updated at 03:09 PM ---------- Previous update was at 03:08 PM ----------

I was trying to be brief ... if I can make my example more complete, please let me know

Corona688 · April 21, 2011, 6:13pm

Yes. You could dump everything I wrote into a text file named script.awk (name unimportant), then run awk on that file in your ksh script with awk -f script.awk datafile

Or you could embed the entire thing into your ksh script like

<datafile awk 'BEGIN { a=b; c=d;

/^WTF/ { stuff }

...

...

...

}'

If your shell supports multi-line strings, that is.

I'll be happy to help with troubles you have improving it but it's probably best for you to match it to your needs. I'm not as likely to notice if things go just slightly wrong.

verge · April 21, 2011, 6:36pm

Thanks a lot Corona, I really appreciate your help ... I have a few other parsing issues but solving this piece helps me a great deal ... I knew there was a better way then grep|cut etc.

I just started scripting by stringing commands together and I'm noticing more and more that's the wrong approach

I'm going to try your awk now

Corona688 · April 21, 2011, 6:58pm

Just noticed this:

PARTNUM6546-048

If this is supposed to be one record, you'll need to change FS slightly.

# won't accept -
# FS="[^a-zA-Z0-9_]"
# should accept -
FS="[^a-zA-Z0-9_\\-]"

jtollefson · April 21, 2011, 8:45pm

I don't have anything useful to add over what Corona gave you, but, I am very interested to hear what the results are on your time savings.

Something tells me it will be significant. Guesstimate is that it looks to be roughly 150mb of raw data (3M * 50 char per line). I've ran similar on terabytes worth of data and it only took me an hour or so.

verge · April 21, 2011, 9:24pm

You brought up a great point Corona, there are "." and " " and "-" in user entered fields that I don't want awk to use as a delimiter ... I haven't yet been able to figure out the syntax for including these characters in the FS statement.

I'm experimenting so that I can understand HOW awk works, so that I can use the same type of logic for some of the other parsing tasks I have in my original script

I'm still working on the IF statements (to grab just the first occurence)

But I really like this solution ... I agree this will save hours of processing time

---------- Post updated at 06:24 PM ---------- Previous update was at 06:15 PM ----------

It works great BTW!

Corona688 · April 21, 2011, 10:00pm

Just add them, if they don't work, put a \\ in front of them. Easy enough to test them by echo "a-b-c-d-e-f| awk 'BEGIN { FS="..." } { print $1,$2,$3; }'

If you followed the logic for my example, you're already over the main hurdle, understanding that awk has its own built-in "while" loop to read in data record by record based on RS(usually newline) then split into tokens based on FS(usually whitespace). You can solve quite a few tasks by just by carefully adjusting its input and output settings.

But you can also cavalierly ignore them whenever you please and just deal with whole lines via $0, print arbitrary anything with printf, read an extra line into $0 with getline, etc.

Other mindblowers:

N gives you the variable, $N gives you the record number. Say you did N=4; V=$N , that's effectively V=$4 This makes it easy to get the last field(NF is the number of fields, $NF is the last field), loop over fields for(N=1; N<=NF; N++) printf("%s\n", $N); etc. $0 is the entire record (usually line).
You can actually modify the $1,$2,... special variables! And the value of $0 will change itself to match. And vice versa, so you can, say, do substitutions inside $0 and end up with different tokens. Or do one little change to $3 then print the entire modified line just by $3=toupper($3); print;
Arrays can have strings as indexes, making it easy to do word counting. { for (N=1; N<=NF; N++) word[$N]++;} END { for(keys in words) printf("%s %d\n", key, words[key]); }

...and lots more. My own understanding of awk is far from complete.

verge · April 26, 2011, 5:21pm

Thanks again Corona ... I'm still having trouble with excluding - as a delimiter in

FS="[^a-zA-Z0-9\\_\\.\\/\\-\\ ]"
 
I've tried \\-, \-, -, '-'

---------- Post updated at 01:56 PM ---------- Previous update was at 01:52 PM ----------

Thanks again for your help ... the difference in processing time reduced greatly ... from almost 6 hours to 20 minutes ... for approx over 3 million lines in about 35K lines

---------- Post updated at 02:21 PM ---------- Previous update was at 01:56 PM ----------

I have a slightly different issue with delimiters. I've been isolating and replacing the delimiters with grep, cut and sed but I know there is a much better way to do this with awk or even perl.

In my file the token seperator can be any non-word character, however the line terminator will also be a non-word character (the rule dictates that these two characters must be different). To add to the issue, user entered fields can include non-word characters, as long as those characters aren't the ones used for the token/line seperators.

Example record
ISA�3213�part-number�address~GS�56756�control{number~ST�09898~
 
Into 
ISA|3213|part-number|address
GS|56756|control{number
ST|09898

I need to replace the token seperator � with a more standard delimiter like |
I need to replace line terminator ~ with line feed (0D0A)

However � and ~ change from record to record and there are thousands of them in a file. I've csplit the files such that each file has the same token/field seperator and line terminator.

Let me know if I can explain this more clearly!

Corona688 · April 26, 2011, 6:27pm

To exclude - as a delimiter, don't put it in there. (things like 0-9 don't count.)

Is there or is there not a newline already where ~ is?

How did you csplit them?

verge · April 26, 2011, 6:46pm

there is no newline where ~ is ... I csplit by the record header "ISA"

... most are on a newline
... but if the file has less than three lines, I know that the record has this issue
... "ISA" always has a non-word character before and after it
... the line terminator is always in position 107 (in my file)

in the shell version of my script, I used an if and sed to put ISAs on its own line and then csplit the file by ISA

I'm not sure how to cleverly add a newline where ~

Corona688 · April 26, 2011, 7:57pm

Could you post what that section actually looks like, then? I'm tempted to say 'screw it' and write a filter in C that translates this dog's breakfast into sanely-separated newline-delimited data, but I'm not sure you have a compiler.

verge · April 26, 2011, 8:56pm

Here are just a few lines and for simplicity sake they are all the same record type. I can attach a file if it makes things clearer.

From this ...

        ISA^00^          ^00^          ^ZZ^ABCD TPU       ^ZZ^FGI  TPM       ^110423^1817^U^00401^000139971^0^P^@*GS^FA^ABCD TPU^FGI  TPM^110423^1817^139971^X^002000*ST^997^0001*AK1^PO^916*AK2^830^000053942*AK5^A*AK9^A^1^1^1*SE^6^0001*GE^1^139971*IEA^1^000139971*                                                                 ISA�00�          �00�          �ZZ�ABCD TPU       �ZZ�FGI  TPM       �110423�1817�U�00401�000139972�0�P�@|GS�FA�ABCD TPU�FGI  TPM�110423�1817�139972�X�002000|ST�997�0001|AK1�PO�916|AK2�830�000053943|AK5�A|AK9�A�1�1�1|SE�6�0001|GE�1�139972|IEA�1�000139972|                                                                 ISA|00|          |00|          |ZZ|ABCD TPU       |ZZ|FGI  TPM       |110423|1817|U|00401|000139973|0|P|@*GS|FA|ABCD TPU|FGI  TPM|110423|1817|139973|X|002000*ST|997|0001*AK1|PO|916*AK2|830|000053944*AK5|A*AK9|A|1|1|1*SE|6|0001*GE|1|139973*IEA|1|000139973*

To this ...

ISA^00^          ^00^          ^ZZ^ABCD TPU       ^ZZ^FGI  TPM       ^110423^1817^U^00401^000139971^0^P^@
GS^FA^ABCD TPU^FGI{TPM^110423^1817^139971^X^002000
ST^997^0001
AK1^PO^916
AK2^830^000053942
AK5^A
AK9^A^1^1^1
SE^6^0001
GE^1^139971
IEA^1^000139971
ISA�00�          �00�          �ZZ�ABCD TPU       �ZZ�FGI  TPM       �110423�1817�U�00401�000139972�0�P�@
GS�FA�ABCD-TPU�FGI  TPM�110423�1817�139972�X�002000
ST�997�0001
AK1�PO�916
AK2�830�000053943
AK5�A
AK9�A�1�1�1
SE�6�0001
GE�1�139972
IEA�1�000139972
ISA|00|          |00|          |ZZ|ABCD TPU       |ZZ|FGI  TPM       |110423|1817|U|00401|000139973|0|P|@
GS|FA|ABCD,TPU|FGI  TPM|110423|1817|139973|X|002000
ST|997|0001
AK1|PO|916
AK2|830|000053944
AK5|A
AK9|A|1|1|1

---------- Post updated at 05:56 PM ---------- Previous update was at 05:50 PM ----------

It can be beastly ... there are commercial applications that do this kind of thing but politics and red tape ... it's must faster for me to write a script

although my skills to do so need work ... again, I can't say it enough, thanks for your help

Corona688 · April 26, 2011, 8:59pm

Yes, but do you have a C compiler? You're pretty much going to need to parse this character-by-character to get the delimiters right, and that's not efficient in any language but C.

verge · April 27, 2011, 9:15pm

no Corona688, I don't have a C compiler

If it helps the token seperator (pos 4) and the line terminator (pos 107) are always in the same position, in the first line (ISA) ... the ISA line is fixed length.

This is how I'd do it with my inefficient command-line type way ... however sed doesn't work well with some delimiters

#!/usr/bin/ksh
perl -p0e 's/[\W_]ISA[\W_]/\n$&/g' orig_file>all_isas
csplit -ks -n 6 -f isa_each all_isas /^ISA/ {100000}
ls -1 isa_each>filelist
 
for file in `ls -1 isa_each*`;
do
delimseg=`grep "^ISA" newisa_1|cut -c4`
delimline=`grep "^ISA" newisa_1|cut -c107`
sed 's/\$delimseg/~/g' $file | sed 's/\$delimline/x0D/g' >$file.cleaned
done
 
and then I'd run your awk script

I was able to use your code to work with the first scenario but I don't understand awk enough to make it work in this scenario