Hello,
I need to collect some statistical results from a series of files that are being generated by other software. The files are tab delimited. There are 4 different sets of statistics in each file where there is a line indicating what the statistic set is, followed by 5 lines of values. It looks like this,
train statistics
r2 0.7834
MeAE 0.36
MdAE 0.33
SE 0.34
n 400
...
...
...
test statistics
r2 0.7042
MeAE 0.39
MdAE 0.32
SE 0.41
n 400
There is more data on each line, but that is not an issue. There can also be up to 4 sets that need to be retrieved.
What I would generally do here is something like,
#!/bin/sh
# stat file being processed
input_file='inputfile.txt'
# where to write the output
logfile='logfile.txt'
# statistic set we are looking for
current_stat='train statistics'
cat $input_file | \
awk -v st_label="$current_stat" ' F == 1 { line_array[++a_count] = $0; line_count++ }
line_count == 5 { for(i=1; i<=a_count; i++) print line_array;
delete line_array;
a_count = 0;
F = 0;
line_count = 0 }
$0 ~ st_label { F = 1; line_count = 0 }
' > $logfile
This would find the line containing whatever was passed in as $current_stat
and start saving lines at the next line. After the 5th line has been saved, the saved array is printed and the array, save flag, and counters are reset. Of course, if we are only looking for one set of data to print, the reinitalization is not necessary and we could exit there instead.
My question is about the best way to capture several sets in one pass through the file. My thought was to put the labels for what I wanted to find in an array in bash and then call awk
with the array instead of a single variable. I would then look for each array element in succession until all had been found. I thought that would look like,
#!/bin/sh
# stat file being processed
input_file='inputfile.txt'
# where to write the output
logfile='logfile.txt'
# abeld for 4 sets we are looking for
LABELS=("train statistics" "test statistics" "validate statistics" "ival statistics")
cat $input_file | \
awk -v st_arr="${LABELS[*]}" ' BEGIN { a_pos = 0 }
F == 1 { line_array[++a_count] = $0; line_count++ }
line_count == 5 { for(i=1; i<=a_count; i++) print line_array;
delete line_array;
a_count = 0;
F = 0;
line_count = 0;
a_pos = 0 }
$0 ~ st_arr[a_pos] { F = 1; line_count = 0 }
' > $logfile
This was intended to start st_arr
at 0 and look for whatever value was there. This doesn't work and gives an error, attempt to use scalar `st_arr' as an array
. I think I have the syntax correct for passing an array to awk but it doesn't see to have worked. Do I need to translate the bash
array into an awk
array on the BEGIN
line? Is this just not the right way to do this?
I would probably just save everything I captured in a single array and print it at the end instead of printing after each set is recovered. Even if the above works, I'm not sure how to avoid an array boundary error with st_arr[]
since I think that the above would increment it past its size.
Thanks,
LMHmedchem