Performance Issue - Shell Script

imrandec85 · July 28, 2016, 5:20am

Hi,

I am beginner in shell scripting. I have written a script to parse file(s) having large number of lines each having multiple comma separated strings.
But it seems like script is very slow. It took more than 30mins to parse a file with size 120MB (523564 lines), below is the script code

#!/bin/sh
start=`date +%s`
for FILE in $*
do
`/usr/bin/dos2unix -q $FILE`
for i in $(/bin/cat $FILE); do
        counter=1;
        islng=true;
        seq=1;
        for j in $(echo $i|/bin/sed "s/,/ /g")
        do
                if [[ "$counter" = "1" ]]
                then
                        fm=$j;
                elif [[ "$counter" = "2" ]]
                then
                        to=$j;
                else
                        if [[ "$islng" = "true" ]]
                        then
                                islng=false;
                                lng=$j;
                        else
                                islng=true;
                                lat=$(echo $j|/bin/sed "s/^M//g");
                                echo $fm"|"$to"|"$seq"|"$lng"|"$lat
                                #seq=`expr $seq + 1`;
                                (( seq++ ))
                        fi
                fi
                #counter=`expr $counter + 1`;
                (( counter++ ))
        done
done
done
end=`date +%s`
runtime=$((end-start))
echo $runtime

Input File Example:
---------------------

[oracle@IE1FUX004 crh]$ cat af5_nosr1.int
4888,4891,19.2076,-34.23549,19.2049,-34.23539
4855,4891,19.2026,-34.23579
4888,4893,19.2135,-34.23559,19.2145,-34.23559,19.2152,-34.23559,19.2164,-34.23549,19.2182,-34.23529,19.2191,-34.23519
4706,4893,19.2199,-34.24119,19.2197,-34.24049,19.2195,-34.23989,19.2193,-34.23919,19.2189,-34.23849,19.2189,-34.23809,19.2189,-34.23729,19.2189,-34.23629,19.2189,-34.23619,19.219,-34.23589,19.2192,-34.23569,19.2195,-34.23539
4897,4916,19.256,-34.23519,19.2552,-34.23529,19.254,-34.23519,19.2524,-34.23479,19.25,-34.23429,19.2495,-34.23409,19.2489,-34.23399,19.2479,-34.23369,19.2458,-34.23319,19.2439,-34.23269,19.242,-34.23219,19.2407,-34.23189,19.24,-34.23179,19.2394,-34.23179,19.2388,-34.23179,19.2384,-34.23189,19.2379,-34.23209,19.2374,-34.23229,19.2365,-34.23279,19.2356,-34.23329,19.2348,-34.23369,19.2342,-34.23389,19.2334,-34.23399

Note: 1. The example file above has 5 lines.
2. Each line begins with 2 non decimal numbers.

Expected Output(considering only first 3 lines above):
-------------------------------------------------------

4888|4891|1|19.2076|-34.23549
4888|4891|2|19.2049|-34.23539
4855|4891|1|19.2026|-34.23579
4888|4893|1|19.2135|-34.23559
4888|4893|2|19.2145|-34.23559
4888|4893|3|19.2152|-34.23559
4888|4893|4|19.2164|-34.23549
4888|4893|5|19.2182|-34.23529
4888|4893|6|19.2191|-34.23519

The actual file size is much than 120MB so I need to fix the issue. Please suggest!

Thanks,
Imran.

RudiC · July 28, 2016, 6:06am

No surprise that script is slow when working on large files as it is overcomplicated, duplicates part of what it does, and uses external commands where builtins could be possible.
Does it have to be executed by sh , or would a more advanced shell ( bash , ksh ) be available? Did you consider a text processing tool (like awk )?

balajesuri · July 28, 2016, 6:06am

#! /usr/bin/perl -w
use strict;

my $line = "";
my @elements = ();
open (FH, "< af5_nosr1.int");
while ($line = <FH>) {
    chomp($line);
    @elements = split(/,/, $line);
    my ($seq, $i) = (1, 2);
    for($i = 2; $i <= $#elements; $i += 2) {
        print "$elements[0]|$elements[1]|$seq|$elements[$i]|$elements[$i+1]\n";
        $seq++;
    }
}
close(FH);

RudiC · July 28, 2016, 6:10am

How about

awk '{SQ=0; for (i=3; i<=NF; i+=2) print $1, $2, ++SQ, $i, $(i+1)}' FS=, OFS="|" file
4888|4891|1|19.2076|-34.23549
4888|4891|2|19.2049|-34.23539
4855|4891|1|19.2026|-34.23579
4888|4893|1|19.2135|-34.23559
4888|4893|2|19.2145|-34.23559
4888|4893|3|19.2152|-34.23559
4888|4893|4|19.2164|-34.23549
4888|4893|5|19.2182|-34.23529
4888|4893|6|19.2191|-34.23519
4706|4893|1|19.2199|-34.24119
4706|4893|2|19.2197|-34.24049
.
.
.

Should there be problems with DOS line terminators (<CR>, \r, 0x0D), remove them by adding sub (/\r$/, ""); in front of the SQ=0 statement.

imrandec85 · July 28, 2016, 6:44am

I was sure that awk could be used here but unaware of how it can be used.

Thanks for response.