Hi there, I'm camor and I'm trying to process huge files with bash scripting and awk.
I've got a dataset folder with 10 files (16 millions of row each one - 600MB), and I've got a sorted file with all keys inside.
For example:
a sample_1 200
a.b sample_2 10
a sample_3 10
a sample_1 10
a sample_3 67
a sample_1 1
a.b sample_2 10
a sample_4 20
I would like to obtain an output like this:
a sample_1 200 10 1 211
a.b sample_2 10 0 10 20
a sample_3 10 67 0 77
a sample_4 0 0 20 20
I create a bash script like this, It produces a sorted file with all of my keys, it's very huge, about 62 millions of record, I slice this file into pieces and I pass each piece to my awk script.
#! /bin/bash
clear
FILENAME=<result>
BASEPATH=<base_path>
mkdir processed/slice
cat $BASEPATH/dataset/* | cut -d' ' -f1,2 > $BASEPATH/processed/aggr
sort -u -k2 $BASEPATH/processed/aggr > $BASEPATH/processed/sorted
split -d -l 1000000 processed/sorted processed/slice/slice-
echo $(date "+START PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")
for filename in processed/slice/*; do
awk -v filename="$filename" -f algorithm.awk dataset/* >> processed/$FILENAME
done
echo $(date "+END PROCESSING DATE: %d/%m/%y - TIME: %H:%M:%S")
rm $BASEPATH/processed/aggr
rm $BASEPATH/processed/sorted
rm -rf $BASEPATH/processed/slice
Here is my AWK script:
BEGIN{
while(getline < filename){
key=$1" "$2;
sources[key];
for(i=1;i<11;i++){
keys[key"-"i] = "0";
}
}
close(filename);
}
{
if(FNR==1){
ARGIND++;
}
key=$1" "$2;
keys[key"-"ARGIND] = $3
}END{
for (s in sources) {
sum = 0
printf "%s", s
for (j=1;j<11;j++) {
printf "%s%s", OFS, keys[s"-"j]
sum += keys[s"-"j]
}
print " "sum
}
}
I've figured out that my bottleneck came from iterating on dataset folder by awk input (10 files with 16.000.000 lines each). Everything is working on a small set of data, but with real data, RAM (30GB) congested. I thin the problem is " dataset/*" as AWK input.
Does anyone have any suggestions or advices? Thank you.