Improve script - slow process with big files

jiam912 · January 26, 2017, 3:00am

Gents,

Please can u help me to improve this script to be more faster, it works perfectly but for big files take a lot time to end the job..

I see the problem is in the step (while) and in this part the script takes a lot time..

Please if you can find a best way to do will be great.

Input file = 16.txt
output files = 16.ss01 16.xx01
rawsps = script

Attached. ( information ).

Thanks for your help.

joker · January 26, 2017, 3:17am

Hi jiam,

thanks for not inserting the files into this forum because it is too much.

Futhermore it's annoying for me needing to download anything, install an unpacking program prior to reading the files. I politely ask you if you please use a pasting service. Maybe this one:

New paste � Fedora Project Pastebin

What's regarding the script:

Even I'm not acquainted in the use of csh, these are my
recommendations:

You have a lot external awk-calls in your loop. That's one reason making your program very slow. Like this one:

        set lineinfo = `cat info_records.list|head -$i|tail -1`
        set tap = `echo $lineinfo|awk '{print $1;}'`
        set rec = `echo $lineinfo|awk '{print $2;}'`
        set lin = `echo $lineinfo|awk '{print $3;}'`
        set pnt = `echo $lineinfo|awk '{print $4;}'`
        set spx = `echo $lineinfo|awk '{print $5;}'`
        set spy = `echo $lineinfo|awk '{print $6;}'`
        set spz = `echo $lineinfo|awk '{print $7;}'`
        set tim = `echo $lineinfo|awk '{print $8;}'`
        set lfr = `echo $lineinfo|awk '{print $9;}'`
        set lto = `echo $lineinfo|awk '{print $10;}'`

If you create one awk-program it will be a lot faster.

Regards,
stomp

jiam912 · January 26, 2017, 3:51am

Hello Stomp.

Thanks for your answer,

Can you help me to generate the awk program please

RudiC · January 26, 2017, 6:25am

To expand on what stomp already said, your script seems to read the input file n+1 times, once to detect the count of reports ( = n) and their respective location in it (one sed and one awk invocation), then using a shell loop to "extract" and analyse the respective single reports, invoking awk 13 times, and sed 4 times per loop (i.e. 130 awk s and 40 sed s for the sample file with 10 reports).

No surprise this will be somewhat lengthy on large files with many reports...

jiam912 · January 26, 2017, 6:33am

Hi RudiC,

Sometimes I have files with more of 20000 lines in this case it takes long time to end the job.
Kindly, could you please help me to improve it.
Appreciate your help.

rbatte1 · January 26, 2017, 6:34am

The csh shell has many known issues and I would strongly recommending that you don't use it.

I've not read the whole thing, but assuming that you are working through the file, line by line and this design is actually reading the whole file each and every time round your loop, calling 10 processes to split up your line, how about this construct instead:-

while read tap rec lin pnt spx spy spz tim lfr lto unused
do
   whatever_you_need_here
done < info_records.list

Okay, so this is sh / ksh / bash, but it is far neater and has far lower overheads. I've not got a true csh available and the manual page I have uses all sorts of bash phrasing so it is only imitating some csh scripting so I can't really test anything I write in csh.

You might be better with this csh mangle instead of calling awk all over the place:-

foreach line ( "`cat input_file`" )
   set parsed = ($line)
   set tap = $parsed[1]
   set rec = $parsed[2]
   set lin = $parsed[3]
   :
   :
end

This reads the file once, and for each line it splits up the record to the variables you want without calling external commands (except the cat that I can't find a way to remove)

Overall though, it is worth the effort to convert to sh based scripts. I hope your code is not riddled with goto statements like I have had to decipher before. That poor programming can leave serious headaches in re-designing.

I hope that this helps,
Robin

RudiC · January 26, 2017, 8:58am

You might want to try this one. Due to the input file structure, it must be read twice - once to identify the respective reports, another time to extract the data and produce the output files.
As you can see below, the X output exactly matches your sample output. The S file doesn't as I don't understand your date/time function and thus can't replicate it. Nor more shell loops, no sed , and just two awk invocations, I'd guess it should save serious amounts of time. Please report back.

awk -F: '
BEGIN   {FMT = "%d %d %d %d %11.1f %11.1f %11.1f %s %010d %010d\n"
         for (n = split ("Tape_Nb:File_Nb:Line_Name:Point_Number:Cog_Easting:Cog_Northing:Cog_Elevation:Tb_GPS_Time", IX); n>0; n--) SRCH[IX[n]]
        }

$1 == "Observer_Report "        {if (flag)      printf FMT,     OUT[IX[1]], OUT[IX[2]], OUT[IX[3]], OUT[IX[4]],
                                                                OUT[IX[5]], OUT[IX[6]], OUT[IX[7]], OUT[IX[8]], from, to
                                 delete OUT
                                 from = NR
                                 flag = 1
                                }

                {gsub (/[       ]/, _)
                 to = NR
                }

$1 in SRCH      {OUT[$1] = $2
                 if ($1 ~ /Tb_GPS_Time/)        OUT[$1] = substr($2,2,16)
                }

END     {printf FMT,    OUT[IX[1]], OUT[IX[2]], OUT[IX[3]], OUT[IX[4]],
                        OUT[IX[5]], OUT[IX[6]], OUT[IX[7]], OUT[IX[8]], from, to
        }
' /tmp/16.txt |
awk -F[:-\(] '
BEGIN           {HD1 = "H26 5678901234567890123456789012345678901234567890123456789012345678901234567890"
                 HD2 = "H26      1         2         3         4         5         6         7          "
                }
NR == 1         {print HD1 RS HD2 > XFILE
                 print HD1 RS HD2 > SFILE
                 }

FNR == NR       {OR[NR] = $0
                 MX = NR
                 next
                }
FNR > NXTREP ||
FNR == 1        {n = split (OR[++OCNT], T, " ")
                 NXTREP = T[n] + 0
                 printf "S%10.2f%10.2f%3d1                     %9.1f%10.1f%6.1f%09d\n", T[3], T[4], 1, T[5], T[6], T[7], T[8] > SFILE
                }

                {sub (/^[       ]*/, _)
                 sub (/ *: */, ":")
                }


$1 ~ /^Live_Seis/       {DATA = 1
                         sub (/Live_Seis[^:]*:/, _)
                        }
/[^0-9:() -]/           {DATA = 0
                        }
DATA                    {printf "X%6d%8d11%10.2f%10.2f%1d%5d%5d1%10.2f%10.2f%10.2f1\n", T[1], T[2], T[3], T[4], 1, $4, $5, $1, $2, $3 > XFILE 
                        }
' XFILE="xfile" SFILE="sfile" - /tmp/16.txt

diff xfile /tmp/16.xx01   # no output from diff -> no difference!

jiam912 · January 26, 2017, 10:02am

Dear RudiC,

Thansk a lot for this great job.

I think i have something missing because when i use the code. i got the following output .. for sfile.

H265678901234567890123456789012345678901234567890123456789012345678901234567890
H26      1         2         3         4         5         6         7          
S      0.00      0.00  11                           0.0       0.0   0.0000000004
S      0.00      0.00  11                           0.0       0.0   0.0000000150
S      0.00      0.00  11                           0.0       0.0   0.0000000296
S      0.00      0.00  11                           0.0       0.0   0.0000000442
S      0.00      0.00  11                           0.0       0.0   0.0000000588
S      0.00      0.00  11                           0.0       0.0   0.0000000734
S      0.00      0.00  11                           0.0       0.0   0.0000000880
S      0.00      0.00  11                           0.0       0.0   0.0000001026
S      0.00      0.00  11                           0.0       0.0   0.0000001172
S      0.00      0.00  11                           0.0       0.0   0.0000001318

Then i dont have the data for column 2 and others.

Please can u send me the output you got.

Thanks and regards

RudiC · January 26, 2017, 10:20am

This is what I get for SFILE:

H26 5678901234567890123456789012345678901234567890123456789012345678901234567890
H26      1         2         3         4         5         6         7          
S  67609.00  30835.00  11                      240038.1 2786615.9 373.82147483647
S  67609.00  30841.00  11                      240113.1 2786612.8 373.72147483647
S  67607.00  30841.00  11                      240111.7 2786588.4 373.92147483647
S  67605.00  30841.00  11                      240111.1 2786562.3 374.32147483647
S  67603.00  30841.00  11                      240116.1 2786537.1 374.42147483647
S  67609.00  30851.00  11                      240237.3 2786613.9 373.32147483647
S  67609.00  30491.00  11                      235736.9 2786612.1 368.72147483647
S  67607.00  30491.00  11                      235734.3 2786587.1 369.32147483647
S  67605.00  30491.00  11                      235737.1 2786561.2 368.72147483647
S  67603.00  30491.00  11                      235738.4 2786539.5 367.92147483647

Except for the last column which is the difficult date/time info, it is identical to your sample output. Did you test with your sample file from post#1?

jiam912 · January 26, 2017, 2:48pm

Dear RudiC,

Yes I use the same sample file,, but really i dont understand where the issue is.. I convert it to unix also to try but does not work.

RudiC · January 26, 2017, 3:28pm

After some cogitating about GPS -> UTC date/time conversion, I could replicate the date/time column in your S file using GNU date 8.25 (although I still don't understand what you're after here). Both output files now are identical to the ones you attached in post#1. Try:

awk -F: '
BEGIN                   {FMT = "date +\"%d %d %d %d %11.1f %11.1f %11.1f 0%%d%%H%%M%%S %010d %010d\" -d@%s\n"
                         for (n = split ("Tape_Nb:File_Nb:Line_Name:Point_Number:Cog_Easting:Cog_Northing:Cog_Elevation:Tb_GPS_Time", IX); n>0; n--) SRCH[IX[n]]
                        }

$1 ~ /^Observer_Report/ {if (flag)      printf FMT,     OUT[IX[1]], OUT[IX[2]], OUT[IX[3]], OUT[IX[4]],
                                                        OUT[IX[5]], OUT[IX[6]], OUT[IX[7]], from, to, OUT[IX[8]] + 315961200 + 10783    # epoch = GPS + 6.1.1980 + 3h - 17 sec
                         delete OUT
                         from = NR
                         flag = 1
                        }

                        {gsub (/[       ]/, _)
                         to = NR
                        }

$1 in SRCH              {OUT[$1] = $2
                        }
$1 ~ SRCH[IX[8]]        {OUT[$1] = substr($2,1,10)
                        }

END                     {printf FMT,    OUT[IX[1]], OUT[IX[2]], OUT[IX[3]], OUT[IX[4]],
                                        OUT[IX[5]], OUT[IX[6]], OUT[IX[7]], from, to, OUT[IX[8]] + 315961200 + 10783                    # epoch = GPS + 6.1.1980 + 3h - 17 sec
                        }
' /tmp/16.txt |

sh |

awk -F[:-\(] '
BEGIN                   {HD1 = "H26 5678901234567890123456789012345678901234567890123456789012345678901234567890"
                         HD2 = "H26      1         2         3         4         5         6         7          "
                        }
NR == 1                 {print HD1 RS HD2 > XFILE
                         print HD1 RS HD2 > SFILE
                         }

FNR == NR               {OR[NR] = $0
                         MX = NR
                         next
                        }
FNR > NXTREP ||
FNR == 1                {n = split (OR[++OCNT], T, " ")
                         NXTREP = T[n] + 0
                         printf "S%10.2f%10.2f%3d1                     %9.1f%10.1f%6.1f%09d\n", T[3], T[4], 1, T[5], T[6], T[7], T[8] > SFILE
                        }

                        {sub (/^[       ]*/, _)
                         sub (/ *: */, ":")
                        }


$1 ~ /^Live_Seis/       {DATA = 1
                         sub (/Live_Seis[^:]*:/, _)
                        }
/[^0-9:() -]/           {DATA = 0
                        }
DATA                    {printf "X%6d%8d11%10.2f%10.2f%1d%5d%5d1%10.2f%10.2f%10.2f1\n", T[1], T[2], T[3], T[4], 1, $4, $5, $1, $2, $3 > XFILE 
                        }
' XFILE="xfile" SFILE="sfile" - /tmp/16.txt

diff xfile /tmp/16.xx01    # no diff = identical! 
diff sfile /tmp/16.ss01    # no diff = identical!

jiam912 · January 27, 2017, 4:06am

Dear RudiC,

Thanks a lot for your help, It works perfectly now..

I have modified a little the code to get correct value in indexpoint,

OUT[IX[8]]

..

and i have to remove the tab spaces to let the code works fine.

here the last modification:

            read -p " " jd 

sed -i -e "s/[[:space:]]\+/ /g" $jd.txt 

awk -F: '
BEGIN                   {FMT = "date +\"%d %d %d %d %11.1f %11.1f %11.1f %d 0%%d%%H%%M%%S %010d %010d\" -d@%s\n"
                         for (n = split ("Tape_Nb:File_Nb:Line_Name:Point_Number:Cog_Easting:Cog_Northing:Cog_Elevation:Point_Index:Tb_GPS_Time", IX); n>0; n--) SRCH[IX[n]]
                        }

$1 ~ /^Observer_Report/ {if (flag)      printf FMT,     OUT[IX[1]], OUT[IX[2]], OUT[IX[3]], OUT[IX[4]],
                                                        OUT[IX[5]], OUT[IX[6]], OUT[IX[7]], OUT[IX[8]], from, to, OUT[IX[9]] + 315961200 + 10783    # epoch = GPS + 6.1.1980 + 3h - 17 sec
                         delete OUT
                         from = NR
                         flag = 1
                        }

                        {gsub (/[       ]/, _)
                         to = NR
                        }

$1 in SRCH              {OUT[$1] = $2
                        }
$1 ~ SRCH[IX[9]]        {OUT[$1] = substr($2,1,10)
                        }

END                     {printf FMT,    OUT[IX[1]], OUT[IX[2]], OUT[IX[3]], OUT[IX[4]],
                                        OUT[IX[5]], OUT[IX[6]], OUT[IX[7]], OUT[IX[8]], from, to, OUT[IX[9]] + 315961200 + 10783                    # epoch = GPS + 6.1.1980 + 3h - 17 sec
                        }
' $jd.txt |

sh |

awk -F[:-\(] '
BEGIN                   {HD1 = "H26 5678901234567890123456789012345678901234567890123456789012345678901234567890"
                         HD2 = "H26      1         2         3         4         5         6         7          "
                        }
NR == 1                 {print HD1 RS HD2 > XFILE
                         print HD1 RS HD2 > SFILE
                         }

FNR == NR               {OR[NR] = $0
                         MX = NR
                         next
                        }
FNR > NXTREP ||
FNR == 1                {n = split (OR[++OCNT], T, " ")
                         NXTREP = T[n] + 0
                         printf "S%10.2f%10.2f%3d1                     %9.1f%10.1f%6.1f%09d\n", T[3], T[4], T[8], T[5], T[6], T[7], T[9] > SFILE
                        }

                        {sub (/^[       ]*/, _)
                         sub (/ *: */, ":")
                        }


$1 ~ /^Live_Seis/       {DATA = 1
                         sub (/Live_Seis[^:]*:/, _)
                        }
/[^0-9:() -]/           {DATA = 0
                        }
DATA                    {printf "X%6d%8d11%10.2f%10.2f%1d%5d%5d1%10.2f%10.2f%10.2f1\n", T[1], T[2], T[3], T[4], T[8], $4, $5, $1, $2, $3 > XFILE 
                        }
' XFILE="$jd.x" SFILE="$jd.s" - $jd.txt

Appreciate your help

RudiC · January 27, 2017, 4:43am

The sed is not necessary, {gsub (/[ ]/, _) contained space and <TAB> and should remove all those. Mayhap got lost in transfer.
Why do you use the GPS date/time stamp and its (OK, not too) complicated transformation to UTC, if the clear text date/time is available in the "Date" record?

jiam912 · January 27, 2017, 10:02am

Dear RudiC,
I will check why i have problems with the tab space.
I use the conversion GPS time to UTC to by more precise only.. and you are right the datetime is already in the file.. but this is the only reason why i use the GPStime...
Thanks s lot for your help...