Help to improve speed of text processing script

Hey together,

You should know, that I'am relatively new to shell scripting, so my solution is probably a little awkward.

Here is the script:

#!/bin/bash

live_dir=/var/lib/pokerhands/live

for limit in `find $live_dir/ -type d  | sed -e s#$live_dir/##`; do
    cat $live_dir/$limit/* > $limit
    
    declare -a lines
    OIFS="$IFS"
    IFS=$'\n'
    set -f   # cf. help set
    lines=($(< "$limit"))
    set +f
    IFS="$OIFS"
    
    i=0
    count=0

    while [ $i -le ${#lines[@]} ]; do

        count=$[count+1]
        touch test_$count.txt
        
        while [ `ls -al test_$count.txt | awk '{print $5}'` -le 1048576 -a $i -le ${#lines[@]} ]; do
            i=$[i+1]
            if [ `expr "${lines[${i}]}" : '#Game No.*'` != 0 ]; then
                while [ `expr "${lines[${i}]}" : '.*wins.*'` = 0 ]; do
                    i=$[i+1]
                    echo "${lines[${i}]}" >> test_$count.txt
                done
                echo "" >> test_$count.txt
            fi
        done
        
    done
done

This Script splits a input file into ~1MB Parts, without destroying the data blocks.

The data blocks of the input file look something like this:

#Game No : 8273167998 
***** Hand History for Game 8273167998 *****
$100 USD NL Texas Hold'em - Saturday, July 25, 11:34:58 EDT 2009
Table Deep Stack #1459548 (No DP) (Real Money)
Seat 6 is the button
Total number of players : 6 
Seat 5: Ducilator ( $128.60 USD )
Seat 4: EvilAdj ( $145.66 USD )
Seat 3: Ice81111 ( $78.60 USD )
Seat 6: RicsterM ( $292.48 USD )
Seat 1: Techno1990 ( $141.06 USD )
Seat 2: pdiloop ( $100 USD )
Techno1990 posts small blind [$0.50 USD].
pdiloop posts big blind [$1 USD].
** Dealing down cards **
Ice81111 folds
EvilAdj folds
Ducilator raises [$4 USD]
RicsterM folds
Techno1990 folds
pdiloop folds
Ducilator does not show cards.
Ducilator wins $5.50 USD

It is working so far, but the problem is the speed ... for a ~20mb input file it runs for some hours ...

What makes it so slow?

Can anyone help me to improve the speed?

From the code it looks like you ARE an experienced scripter.
But did you try the basic?
Try your script with debug.
ksh -x
It could be any of them.
See which line takes more time. Simple.

Also, why do you need the "sed" in the first line ?
From the first look, it looks like you are removing the directory details but you are adding it up again at the "cat".

What are the conditions to split the file? There are probably other approaches to speed up the process.

Regards

This is the price to pay when you don't follow the rules and use temp file and useless commands .... time :cool:

Try to post a sample data file and required output and suggest a different approach.

Yeah I realized that and so I just asked for some hints to get a better solution to run it faster :slight_smile:

Please notice that I'am relatively new to shell scripting. In fact this is my 2nd script try.

I have a large input text file (aprox. 20mb) and want to split it into single 1MB output files. The input file include contains of data blocks that are'nt allowed to destroy during the split. (posted short sample of this data block in my 1st post)

Yeah thats a good idea. I attached my test enviroment to this post.
It contains of the following structure.

testenv/
|-- output        <-output folder
|   |-- output_1.txt    
|   |-- output_2.txt    <-output files
|   |-- output_3.txt
|   `-- output_4.txt
|-- input.txt        <-input file
`-- split.sh        <-script file

I reduced the script code to the essential things and set the output file size to 100kb to demonstrate you the principle and let you better understand what I want to do.

In this example I let the script run for ~5min and in this time it processes ~400kb of 25mb and put out these 4 files.

Thanks in advance for your help :slight_smile:

Just run your script as requested earlier.

ksh -x my your.sh

You can find out by yourself which line is taking more time.

You know the movie matrix? In that speed the chars fly through my screen, when I use "ksh -x". That really doesn't helps me much :frowning:

I think the problem is, that I use the "expr" command on any single line of the input file.
That means every iteration takes ~0.03sec. On an input file with 800.000 lines the whole process take ~7 hours.
Each iteration have to be 0.0003sec to get an acceptable result.

Is there maybe any faster command than "expr" which can do the same (regexp)?

Since you have remainded me of the movie 'matrix' - am all charged up to answer your question to an extent atleast :wink:

${#lines[@]}

If this is not modified, try assigning it to a variable and reuse that, instead of computing it each time.

echo "${lines[${i}]}" >> test_$count.txt
done
echo "" >> test_$count.txt

Writing to a file whilst in the loop block will greatly reduce the performance of the script, what happens for every write call is ...

open file
write data
close file

Ideally what should be done is

open file
write data
write data
.
.
.
close file

instead write it to the outer block something like

while [ condition ]
do
# check and do some processing
done > $output_file

In this method, there will be n + 2 (approx) calls to file libs for 'n' units of write

instead of

while [ condition ]
do
# check and do some processing
# write to a file  > $output_file
done

In this method, there will be 3n calls for 'n' units of write, which will scale badly as 'n' progresses ...

lorus,

why are we trying to write a split script..!? let split command do the job...

split -db 1m InFile OutFile

o/p: it creates files like
OutFile00
OutFile01
...

Hehe thats good to know. I have just to bring something about matrix in every post to get your help :smiley:

Your suggestions makes absolutly sence, so I rewrote it to this

#!/bin/bash

declare -a lines
OIFS="$IFS"
IFS=$'\n'
set -f   # cf. help set
lines=($(< "input.txt"))
set +f
IFS="$OIFS"

splitsize=102400
i=0
count=0
lines_tot=${#lines[@]}

while [ $i -le $lines_tot ]; do
    count=$[count+1]
    touch output/output_$count.txt
    
    while [ `ls -al output/output_$count.txt | awk '{print $5}'` -le $splitsize -a $i -le ${#lines[@]} ]; do
        i=$[i+1]
        if [ `expr "${lines[${i}]}" : '#Game No.*'` != 0 ]; then
            while [ `expr "${lines[${i}]}" : '.*wins.*'` = 0 ]; do
                i=$[i+1]
                echo "${lines[${i}]}"
            done
            echo ""
        fi
    done >> output/output_$count.txt
    
done

But that file open/close process doesn't seems to be the time-thief

A simple

while [ $i -le $lines_tot ]; do

    echo "${lines[${i}]}" >> output_test.txt
    
done

processes the whole file in just a few seconds.

So the time thief must be the "expr" command inside the inner loop. Is there maybe any equivalent command that is faster?

Because "split" cuts at static points and that would destroy the structure of my file, doesn't it?

Good one ! :slight_smile:

This will definitely have an impact and will scale accordingly to larger files.

What exactly is the operation performed? Can you please give an example?

My input file contains of blocks like the following

#Game No : 8273167998 
***** Hand History for Game 8273167998 *****
$100 USD NL Texas Hold'em - Saturday, July 25, 11:34:58 EDT 2009
Table Deep Stack #1459548 (No DP) (Real Money)
Seat 6 is the button
Total number of players : 6 
Seat 5: Ducilator ( $128.60 USD )
Seat 4: EvilAdj ( $145.66 USD )
Seat 3: Ice81111 ( $78.60 USD )
Seat 6: RicsterM ( $292.48 USD )
Seat 1: Techno1990 ( $141.06 USD )
Seat 2: pdiloop ( $100 USD )
Techno1990 posts small blind [$0.50 USD].
pdiloop posts big blind [$1 USD].
** Dealing down cards **
Ice81111 folds
EvilAdj folds
Ducilator raises [$4 USD]
RicsterM folds
Techno1990 folds
pdiloop folds
Ducilator does not show cards.
Ducilator wins $5.50 USD


first I search for the start of the block with this expression: ' #Game No.*"

if [ `expr "${lines[${i}]}" : '#Game No.*'` != 0 ]; then

then I put out all following lines while I find this expression: '.*wins.*'

 while [ `expr "${lines[${i}]}" : '.*wins.*'` = 0 ]; do
                i=$[i+1]
                echo "${lines[${i}]}"
  done

the loop around this is to check if the current output file size reaches the split limit

while [ `ls -al output/output_$count.txt | awk '{print $5}'` -le $splitsize -a $i -le ${#lines[@]} ]; do
       ...
done >> output/output_$count.txt

so the `expr "${lines[${i}]}" : '.*wins.*'` command is executed at every single line of the input file. That are ~800.000 times.
0.03sec per iteration means ~7 hours for the whole process.

Have you considered csplit. Assume the average size of your block is 300 bytes 1024000/300 = 3413 block for 1MB

csplit -k myinputfilename  '/^#Game/-1{3413}'

I forgot to say, that the length of each block is different

The posted one is just an example.

Sorry, I could not dig much into your script.

So, I wrote a quick one for your case, where splitting as blocks is on the run scanning each lines without having to store any data or predetermine any stuff

I hope this should improve performance. If you have some time, please post some stats

#! /opt/third-party/bin/perl

use strict;
use warnings;

my ($input_file) = @ARGV;
open(my $lfh, '<', $input_file) or die "Unable to open file:$input_file <$!>\n";

my $start = 0;
my $curr_file_number = 1;
my $rfh;
my $data;
while ( $data = <$lfh> ) {
    print $rfh $data if ( $start == 1 );

    if ( $data =~ /#Game No :/ ) {
        my $running_file_name = "tyrant_" . $curr_file_number;
        open($rfh, '>', $running_file_name) or die "Unable to open file : $running_file_name <$!>\n";
        $start = 1;
        print $rfh $data;
        next;
    }
    if ( $data =~ / wins / ) {
        close($rfh) or die "Unable to close file <$!>\n";
        $start = 0;
        $curr_file_number++;
    }
}

close($lfh);

ah a perl script. I tried to avoid learning perl, but if it's significant faster then bash tools there is maybe no way around ?

# ./psplit.sh input.txt
Unable to close file <Bad file descriptor>

but it generates some files

# ls -l
total 24140
-rwxrwxrwx 1 root root 24608349 2009-07-26 16:30 input.txt
drwsrwsrwt 2 root root      101 2009-07-27 11:20 output
-rwxr-xr-x 1 root root      708 2009-07-27 13:29 psplit.sh
-rwxrwxrwx 1 root root      596 2009-07-27 11:11 split.sh
-rwxr-xr-x 1 root root      265 2009-07-27 11:34 split_test.sh
-rw-r--r-- 1 root root      683 2009-07-27 13:32 tyrant_1
-rw-r--r-- 1 root root      825 2009-07-27 13:32 tyrant_10
-rw-r--r-- 1 root root      609 2009-07-27 13:32 tyrant_11
-rw-r--r-- 1 root root      605 2009-07-27 13:32 tyrant_12
-rw-r--r-- 1 root root      777 2009-07-27 13:32 tyrant_13
-rw-r--r-- 1 root root     1001 2009-07-27 13:32 tyrant_14
-rw-r--r-- 1 root root      695 2009-07-27 13:32 tyrant_15
-rw-r--r-- 1 root root      747 2009-07-27 13:32 tyrant_16
-rw-r--r-- 1 root root      848 2009-07-27 13:32 tyrant_17
-rw-r--r-- 1 root root      631 2009-07-27 13:32 tyrant_18
-rw-r--r-- 1 root root      664 2009-07-27 13:32 tyrant_19
-rw-r--r-- 1 root root      767 2009-07-27 13:32 tyrant_2
-rw-r--r-- 1 root root      804 2009-07-27 13:32 tyrant_20
-rw-r--r-- 1 root root      655 2009-07-27 13:32 tyrant_21
-rw-r--r-- 1 root root      784 2009-07-27 13:32 tyrant_22
-rw-r--r-- 1 root root      628 2009-07-27 13:32 tyrant_23
-rw-r--r-- 1 root root     1040 2009-07-27 13:32 tyrant_24
-rw-r--r-- 1 root root      813 2009-07-27 13:32 tyrant_3
-rw-r--r-- 1 root root      810 2009-07-27 13:32 tyrant_4
-rw-r--r-- 1 root root      679 2009-07-27 13:32 tyrant_5
-rw-r--r-- 1 root root     1078 2009-07-27 13:32 tyrant_6
-rw-r--r-- 1 root root      949 2009-07-27 13:32 tyrant_7
-rw-r--r-- 1 root root      962 2009-07-27 13:32 tyrant_8
-rw-r--r-- 1 root root      810 2009-07-27 13:32 tyrant_9

each file has 1 block inside. What I want was to split the Input file in 1MB files which contains as much blocks as fit in 1MB.

Try this and play around with the number (1000000) to get the desired size:

awk 'BEGIN{c=1}
/Hand History/{f=1;if(size>1000000){close("output_" c);size=0;c++}}
/Game #/{print "" > "output_" c;f=0}
f{print > "output_" c; size+=length}'  file

Regards

wow impressive, big thanks ... now I'm going to understand that synthax :slight_smile:

If this works then fine, else What Madhan wrote sounds good.
All that you have to do is make a small change to his.

 
if ( $data =~ / wins / ) {
        close($rfh) or die "Unable to close file <$!>\n";
        $start = 0;
        $curr_file_number++;
    }

I am also like you, avoiding perl since a long time.
So check my code and correct it,if needed.

 
if ( $data =~ / wins / ) {
    $group++;
     if ( $group = 100 ) {
        close($rfh) or die "Unable to close file <$!>\n";
        $start = 0;
        $group = 0;
        $curr_file_number++;
    }
}

Make sure the $group is defined at the top of the script.
All that I am trying to do is merge 100 files into 1.
Change the 100 to your requirement.
I wounder which will be faster.

:b: Just spin your solution for my environment

# time awk 'BEGIN{c=1}
/Hand History/{f=1;if(size>1000000){close(file);size=0;c++}}
/Game #/{file="file_" c;print > file ;f=0}
f{print > file; size+=length}'  input.txt

real    0m29.823s
user    0m4.481s
sys     0m23.953s