Pass an array to awk to sequentially look for a list of items in a file

Hello,

I need to collect some statistical results from a series of files that are being generated by other software. The files are tab delimited. There are 4 different sets of statistics in each file where there is a line indicating what the statistic set is, followed by 5 lines of values. It looks like this,

train statistics
r2	0.7834
MeAE	0.36
MdAE	0.33
SE	0.34
n	400
...
...
...
test statistics
r2	0.7042
MeAE	0.39
MdAE	0.32
SE	0.41
n	400

There is more data on each line, but that is not an issue. There can also be up to 4 sets that need to be retrieved.

What I would generally do here is something like,

#!/bin/sh

# stat file being processed
input_file='inputfile.txt'
# where to write the output
logfile='logfile.txt'
# statistic set we are looking for
current_stat='train statistics'

cat $input_file | \
awk -v st_label="$current_stat" '          F == 1 { line_array[++a_count] = $0; line_count++ }
                                  line_count == 5 { for(i=1; i<=a_count; i++) print line_array;
                                                    delete line_array;
                                                    a_count = 0;
                                                    F = 0;
                                                    line_count = 0 }
                                    $0 ~ st_label { F = 1; line_count = 0 }
                                 ' > $logfile

This would find the line containing whatever was passed in as $current_stat and start saving lines at the next line. After the 5th line has been saved, the saved array is printed and the array, save flag, and counters are reset. Of course, if we are only looking for one set of data to print, the reinitalization is not necessary and we could exit there instead.

My question is about the best way to capture several sets in one pass through the file. My thought was to put the labels for what I wanted to find in an array in bash and then call awk with the array instead of a single variable. I would then look for each array element in succession until all had been found. I thought that would look like,

#!/bin/sh

# stat file being processed
input_file='inputfile.txt'
# where to write the output
logfile='logfile.txt'
# abeld for 4 sets we are looking for
LABELS=("train statistics" "test statistics" "validate statistics" "ival statistics")

cat $input_file | \
awk -v st_arr="${LABELS[*]}" '              BEGIN { a_pos = 0 }
                                           F == 1 { line_array[++a_count] = $0; line_count++ }
                                  line_count == 5 { for(i=1; i<=a_count; i++) print line_array;
                                                    delete line_array;
                                                    a_count = 0;
                                                    F = 0;
                                                    line_count = 0;
                                                    a_pos = 0 }
                               $0 ~ st_arr[a_pos] { F = 1; line_count = 0 }
                             ' > $logfile

This was intended to start st_arr at 0 and look for whatever value was there. This doesn't work and gives an error, attempt to use scalar `st_arr' as an array . I think I have the syntax correct for passing an array to awk but it doesn't see to have worked. Do I need to translate the bash array into an awk array on the BEGIN line? Is this just not the right way to do this?

I would probably just save everything I captured in a single array and print it at the end instead of printing after each set is recovered. Even if the above works, I'm not sure how to avoid an array boundary error with st_arr[] since I think that the above would increment it past its size.

Thanks,

LMHmedchem

What's the end objective?
For each label/stat combo, print out the sum (min/max/avg/???) of each stat?
What're you trying to do?

Dozens of these files are generated and I need to pull out some of the statistics and put them into a form where I can look at all of the results in one file.

There will probably be a header row that will be entered when the logfile is created and the a new row will be appended for each file processed.

filename    train_r2  train_MeAE  train_MdAE  train_SE  train_n  test_r2  test_MeAE  test_MdAE  test_SE  test_n
file_1.txt  0.7834    0.36        0.33        0.34      400      0.7042   0.39       0.32       0.41     400
file_2.txt  etc

I could do this with a separate call to awk for each set I need (train, test, etc), but that doesn't make any sense. I could pass in a separate label variable fore each set I need, but that starts to look very messy after a while. It seems like passing in an array makes the most sense. It looks like awk is treating the passed in array as a single string so I was wondering if the problem is that I need to parse what awk thinks is a string into an array, or if I am incorrect in the way I am passing in the array.

If this is just a dumb idea for a solution, I would like to know about that as well.

LMHmedchem

something to start with and improve upon...
assuming all the statistics are displayed in the same order in a file AND across the files:
awk -f lmh.awk myFiles where lmh.awk is:

FNR==1 {if (stat) print file OFS stat; file=FILENAME;stat=""; next}
!/statistics/{
   stat=(!stat)? $2:stat OFS $2
}
END {
   print file OFS stat
}

Thank you for the reply but I don't see in the above how my stats will be found in the huge stats output file when there is no notation of how to find what I am looking for. The stats I need are the second field of the first 5 lines following "train statistics", etc. I don't see how you can find what I am looking for without "train statistics" being in there somewhere. I could post an actual file if that would help.

This version works more or less. It prints the stats I need to the logfile.

#!/bin/sh

# name of file being processed
STATS_FILE=$1
# file we are writing to
LOGFILE=$1

# 4 sets of labels we are looking for
LABELS='train_statistics,test_statistics,validate_statistics,ival_statistics'

cat $STATS_FILE | \
awk -v var="$LABELS" '                 BEGIN { split(var,label_array,","); pos = 1 }
                                      F == 1 { line_array[++a_count] = $2; line_count++ }
                             line_count == 5 { for(i=1; i<=a_count; i++) print line_array;
                                               delete line_array;
                                               a_count = 0;  
                                               F = 0;
                                               line_count = 0;
                                               pos++ }
                       $0 ~ label_array[pos] { F = 1; line_count = 0; }
                     ' > $LOGFILE

I need to format the output a bit better and trap for if pos is larger that the size of label_array . I suspect that I also don't need both a_count and line_count .

Is there anything badly wrong with this approach?

LMHmedchem

In post #1, the labels in your sample input files were " train statistics " and " test statistics ". In your latest code you have labels with underscores instead of spaces. You'll have to be sure your stats files and your labels match.

In post #3, you included a header line in your output; I don't see anything in your latest code that prints that header. And, unless each of your stats files contains all of the sets of statistics and includes them in the same order, what you have shown us will end up printing data for different sets of statistics under each other in the output with no indication of which set they came from. Do each of your stats files contain statistics from all of the possible sets of statistics and are each of those sets of statistics present in the same order in each stats file?

In your latest code you have:

# name of file being processed
STATS_FILE=$1
# file we are writing to
LOGFILE=$1

which sets both your input file and your output file to be the same input parameter. I'm about 99% sure that isn't what you want.

The cat in your code isn't helping and seems to be working against you. I think you're going to want to end up with something more like:

... ... ...

# file we are writing to
LOGFILE=$1
# The rest of the operands are stats files we need to read.
shift

awk -v labels="$LABELS" '
    ... ... ...
' "$@" > "$LOGFILE"

Do you want your output fields to be tab separated, or do you want your output fields to be in aligned columns? Note that since your field headers vary in width from 6 characters (e.g., " test_n " to more than 9 characters (e.g. " train_MdAE " and your statistics data all fit in less than 8 characters, the two choices are mutually exclusive. (I.e., you can't have both.)

1 Like

Making some assumptions on your data structure (until Don Cragun's questions have been fully and finally answered), and making up my own sample data files, I have come up with

awk '
BEGIN           {LAB="train statistics|test statistics|validate statistics|ival statistics"
                }

FNR == 1        {if (NR != 1)   {printf "filename"
                                 for (i=1; i<=CNT; i++) printf OFS"%s", HD
                                 printf ORS
                                 HDDONE = 1
                                }
                 printf "%s", FN
                 for (i=1; i<=CNT; i++) printf OFS"%s", VAL[HD]
                 printf ORS
                 split ("", VAL)
                }

$0 ~ LAB        {PH = $1
                 FN = FILENAME
                 for (i=1; i<=5; i++)   {getline
                                         IX = PH "_" $1
                                         VAL[IX] = $2
                                         if (! HDDONE) HD[++CNT] = IX 
                                        }
                } 

END             {printf "%s", FN
                 for (i=1; i<=CNT; i++) printf OFS"%s", VAL[HD]
                 printf ORS
                }

' OFS="\t" file[34] | column -t 

filename  train_r2  train_MeAE  train_MdAE  train_SE  train_n  test_r2  test_MeAE  test_MdAE  test_SE  test_n  ival_r2  ival_MeAE  ival_MdAE  ival_SE  ival_n  validate_r2  validate_MeAE  validate_MdAE  validate_SE  validate_n
file3     0.7834    0.36        0.33        0.34      400      0.7042   0.39       0.32       0.41     400     0.7834   0.36       0.33       0.34     400     0.7042       0.39           0.32           0.41         400
file4     0.7834    0.36        0.33        0.34      400      0.7042   0.39       0.32       0.41     400     0.7834   0.36       0.33       0.34     400     0.7042       0.39            0.32           0.41         400

Give it a try and report back.

2 Likes

I temporarily removed the spaces from both the test input files and the shell code to eliminate the space as a possible source of problems. I try to avoid having spaces within a single field of a delimited text file but it cannot always be avoided. I have it working now with the space.

The header is created when the logfile is created. I am using this code as a function that gets called when each output file has been created. The code appends the statistics from the processed file to the existing logfile. Entries for all of the statistics exist in every output file in the same order. The values may all be 0.0 if a given set of statistics were not calculated, but the entry will be there.

Yes, I have fixed that.

I have typically used cat to pipe input to awk when reading input from a file. I have used something like you suggest but I didn't remember to this time. You are suggesting basically,

awk 'awk stuff' $INPUT_FILE > $OUTPUT_FILE

which is how I would use awk on the command line but for some reason don't in scripts.

The output will be tab delimited. I did aligned columns in my sample post because I think that tab can be hard to read in plain text.

This is what I have at the moment. This has been revised to reflect your suggestion.

#!/bin/sh

# retrieve statistics from a final stats output file and write to log
function extract_and_log_stats () {

   # path to stats file being processed
   STATS_FILE_TO_READ_Fc=$1
   # path to logfile where entry will be written
   OUTPUT_PATH_Fc=$2
   # name of stats file being processed (minus the path) to be entered in logfile
   STATS_FILE_Fc=$3

   # 4 sets of labels we are looking for
   LABELS_Fc='train statistics,test statistics,validate statistics,ival statistics'

   # process the stats file looking in turn for each value in LABELS
   awk -v var="$LABELS_Fc" \
       -v filename="$STATS_FILE_Fc" \
       -v OFS='\t' '                BEGIN { split(var,label_array,","); pos = 1 }
                                   F == 1 { line_array[++a_count] = $2; line_count++ }
                          line_count == 5 { F = 0; line_count = 0; pos++ }
                                 pos == 5 { printf "%s\t",filename;
                                            for(i=1; i<a_count; i++) printf "%s\t",line_array;
                                            printf "%s\n",line_array[a_count];
                                            exit }
                    $0 ~ label_array[pos] { F = 1; line_count = 0 }
                   ' $STATS_FILE_TO_READ_Fc >> $OUTPUT_PATH_Fc

}

This function is called with the path to the stats file being processed, the path to the logfile where the stats are written, and the name of the stats file being processed (to also enter to the log). The is called once for each stats file produces and appends an entry to an existing logfile.

This seems to work fine. As you pointed out, this expects the stats in LABELS to exist and be found in the same order. I have attached a sample of the stats files I am processing in case that is useful.

LMHmedchem

OK then. It looks like RudiC guessed correctly on everything you were trying to do and on your stat file format. And, as long as you give it at least two stat files to process, it looks to me like his code produces the output you want. If you just give it one input file, however, it won't print any headers.

By moving some of this code into a function, as shown below, and processing command line arguments as I suggested before, you seem to get what you want:

#!/bin/bash

if [ $# -lt 2 ]
then	printf 'Usage: %s output_file stat_file...\n' "${0%%*/}" >&2
	exit 1
fi

# where to write the output
logfile=$1
# remaining operands are stat files to be read directly by awk...
shift

awk '
BEGIN           {LAB="train statistics|test statistics|validate statistics|ival statistics"
                }

function prec()	{if(! HDDONE)	{printf "filename"
                                 for (i=1; i<=CNT; i++) printf OFS"%s", HD
                                 printf ORS
                                 HDDONE = 1
                                }
                 printf "%s", FN
                 for (i=1; i<=CNT; i++) printf OFS"%s", VAL[HD]
                 printf ORS
                 split ("", VAL)
                }

FNR == 1        {if (NR != 1) prec()
		 FN = FILENAME
		}

$0 ~ LAB        {PH = $1
                 for (i=1; i<=5; i++)   {getline
                                         IX = PH "_" $1
                                         VAL[IX] = $2
                                         if (! HDDONE) HD[++CNT] = IX 
                                        }
                } 

END             {prec()
                }

' OFS="\t" "$@" > "$logfile"

And, if your system has the column utility, you can use it to print the log file this creates (which contains <tab> separated fields) into aligned text using the command:

column -t logfile.txt

where logfile.txt is the name of the output you supplied to the above script as its first operand. And if we do that with the sample file you uploaded with post #8 after invoking the script above with:

script_name logfile.txt S-mae_0.3810_V-mae_0.4956_all_B30_E800_EC503_S1v1_30.15.1.txt

and then run:

column -t logfile.txt

the output we get is:

filename                                                       train_r2  train_MeAE  train_MdAE  train_SE  train_n  test_r2  test_MeAE  test_MdAE  test_SE  test_n  validate_r2  validate_MeAE  validate_MdAE  validate_SE  validate_n  ival_r2  ival_MeAE  ival_MdAE  ival_SE  ival_n
S-mae_0.3810_V-mae_0.4956_all_B30_E800_EC503_S1v1_30.15.1.txt  0.8320    0.3215      0.2784      0.3068    400      0.7183   0.3810     0.2922     0.4129   400     0.5309       0.4956         0.4186         0.5013       400         0.0000   0.0000     0.0000     0.0000   0

Note that if you call this script with more than one stat file, it still only invokes awk once but will process all of the stat files you feed it.

1 Like