You should know, that I'am relatively new to shell scripting, so my solution is probably a little awkward.
Here is the script:
#!/bin/bash
live_dir=/var/lib/pokerhands/live
for limit in `find $live_dir/ -type d | sed -e s#$live_dir/##`; do
cat $live_dir/$limit/* > $limit
declare -a lines
OIFS="$IFS"
IFS=$'\n'
set -f # cf. help set
lines=($(< "$limit"))
set +f
IFS="$OIFS"
i=0
count=0
while [ $i -le ${#lines[@]} ]; do
count=$[count+1]
touch test_$count.txt
while [ `ls -al test_$count.txt | awk '{print $5}'` -le 1048576 -a $i -le ${#lines[@]} ]; do
i=$[i+1]
if [ `expr "${lines[${i}]}" : '#Game No.*'` != 0 ]; then
while [ `expr "${lines[${i}]}" : '.*wins.*'` = 0 ]; do
i=$[i+1]
echo "${lines[${i}]}" >> test_$count.txt
done
echo "" >> test_$count.txt
fi
done
done
done
This Script splits a input file into ~1MB Parts, without destroying the data blocks.
The data blocks of the input file look something like this:
#Game No : 8273167998
***** Hand History for Game 8273167998 *****
$100 USD NL Texas Hold'em - Saturday, July 25, 11:34:58 EDT 2009
Table Deep Stack #1459548 (No DP) (Real Money)
Seat 6 is the button
Total number of players : 6
Seat 5: Ducilator ( $128.60 USD )
Seat 4: EvilAdj ( $145.66 USD )
Seat 3: Ice81111 ( $78.60 USD )
Seat 6: RicsterM ( $292.48 USD )
Seat 1: Techno1990 ( $141.06 USD )
Seat 2: pdiloop ( $100 USD )
Techno1990 posts small blind [$0.50 USD].
pdiloop posts big blind [$1 USD].
** Dealing down cards **
Ice81111 folds
EvilAdj folds
Ducilator raises [$4 USD]
RicsterM folds
Techno1990 folds
pdiloop folds
Ducilator does not show cards.
Ducilator wins $5.50 USD
It is working so far, but the problem is the speed ... for a ~20mb input file it runs for some hours ...
From the code it looks like you ARE an experienced scripter.
But did you try the basic?
Try your script with debug.
ksh -x
It could be any of them.
See which line takes more time. Simple.
Also, why do you need the "sed" in the first line ?
From the first look, it looks like you are removing the directory details but you are adding it up again at the "cat".
Yeah I realized that and so I just asked for some hints to get a better solution to run it faster
Please notice that I'am relatively new to shell scripting. In fact this is my 2nd script try.
I have a large input text file (aprox. 20mb) and want to split it into single 1MB output files. The input file include contains of data blocks that are'nt allowed to destroy during the split. (posted short sample of this data block in my 1st post)
Yeah thats a good idea. I attached my test enviroment to this post.
It contains of the following structure.
I reduced the script code to the essential things and set the output file size to 100kb to demonstrate you the principle and let you better understand what I want to do.
In this example I let the script run for ~5min and in this time it processes ~400kb of 25mb and put out these 4 files.
You know the movie matrix? In that speed the chars fly through my screen, when I use "ksh -x". That really doesn't helps me much
I think the problem is, that I use the "expr" command on any single line of the input file.
That means every iteration takes ~0.03sec. On an input file with 800.000 lines the whole process take ~7 hours.
Each iteration have to be 0.0003sec to get an acceptable result.
Is there maybe any faster command than "expr" which can do the same (regexp)?
My input file contains of blocks like the following
#Game No : 8273167998
***** Hand History for Game 8273167998 *****
$100 USD NL Texas Hold'em - Saturday, July 25, 11:34:58 EDT 2009
Table Deep Stack #1459548 (No DP) (Real Money)
Seat 6 is the button
Total number of players : 6
Seat 5: Ducilator ( $128.60 USD )
Seat 4: EvilAdj ( $145.66 USD )
Seat 3: Ice81111 ( $78.60 USD )
Seat 6: RicsterM ( $292.48 USD )
Seat 1: Techno1990 ( $141.06 USD )
Seat 2: pdiloop ( $100 USD )
Techno1990 posts small blind [$0.50 USD].
pdiloop posts big blind [$1 USD].
** Dealing down cards **
Ice81111 folds
EvilAdj folds
Ducilator raises [$4 USD]
RicsterM folds
Techno1990 folds
pdiloop folds
Ducilator does not show cards.
Ducilator wins $5.50 USD
first I search for the start of the block with this expression: ' #Game No.*"
if [ `expr "${lines[${i}]}" : '#Game No.*'` != 0 ]; then
then I put out all following lines while I find this expression: '.*wins.*'
while [ `expr "${lines[${i}]}" : '.*wins.*'` = 0 ]; do
i=$[i+1]
echo "${lines[${i}]}"
done
the loop around this is to check if the current output file size reaches the split limit
while [ `ls -al output/output_$count.txt | awk '{print $5}'` -le $splitsize -a $i -le ${#lines[@]} ]; do
...
done >> output/output_$count.txt
so the `expr "${lines[${i}]}" : '.*wins.*'` command is executed at every single line of the input file. That are ~800.000 times.
0.03sec per iteration means ~7 hours for the whole process.
So, I wrote a quick one for your case, where splitting as blocks is on the run scanning each lines without having to store any data or predetermine any stuff
I hope this should improve performance. If you have some time, please post some stats
#! /opt/third-party/bin/perl
use strict;
use warnings;
my ($input_file) = @ARGV;
open(my $lfh, '<', $input_file) or die "Unable to open file:$input_file <$!>\n";
my $start = 0;
my $curr_file_number = 1;
my $rfh;
my $data;
while ( $data = <$lfh> ) {
print $rfh $data if ( $start == 1 );
if ( $data =~ /#Game No :/ ) {
my $running_file_name = "tyrant_" . $curr_file_number;
open($rfh, '>', $running_file_name) or die "Unable to open file : $running_file_name <$!>\n";
$start = 1;
print $rfh $data;
next;
}
if ( $data =~ / wins / ) {
close($rfh) or die "Unable to close file <$!>\n";
$start = 0;
$curr_file_number++;
}
}
close($lfh);
If this works then fine, else What Madhan wrote sounds good.
All that you have to do is make a small change to his.
if ( $data =~ / wins / ) {
close($rfh) or die "Unable to close file <$!>\n";
$start = 0;
$curr_file_number++;
}
I am also like you, avoiding perl since a long time.
So check my code and correct it,if needed.
if ( $data =~ / wins / ) {
$group++;
if ( $group = 100 ) {
close($rfh) or die "Unable to close file <$!>\n";
$start = 0;
$group = 0;
$curr_file_number++;
}
}
Make sure the $group is defined at the top of the script.
All that I am trying to do is merge 100 files into 1.
Change the 100 to your requirement.
I wounder which will be faster.