Hello everybody:
I want rearrange about 5 million rows (with 300 columns) into groups.
Data looks like the following: where there were various experiments (column 2) conducted at different locations (column headers in top row column 4 onwards) in different years (column 1) using instruments (column 3). The numbers in the matrix ( row 2 onwards, column 4 onwards) indicate how many instances of experiments were successful.
What I want is to rearrange the rows
Input
345 A346 347 348 349 350 351 352
2014 Exp1 IBM 24 45 22
2014 Exp2 LEN 23 32 34
2014 Exp3 LEN 2 34 34
2014 Exp4 IBM 34 44 43
2014 Exp5 IBM 2 45 51 45
2014 Exp6 IBM 34 23 54
2014 Exp7 IBM 23 23 24
2014 Exp8 IBM 34 45 56
2014 Exp9 LEN 24 45 45
2014 Exp10 LEN 43 45 32
2015 Exp11 IBM 34 55 33 34
2015 Exp12 IBM 1 33 4 5
2015 Exp13 IBM 43 55 34 43
2015 Exp14 IBM 45 32 43 4
2015 Exp15 IBM 23 4 5
2015 Exp16 IBM 32 34 43
2015 Exp17 IBM 32 34 46
2015 Exp18 LEN 32 54 67
2015 Exp19 SCL 56 6 4 45 56
2015 Exp20 LEN 67 56 76
2015 Exp21 LEN 45 56 65
2015 Exp22 SCL 45 55 54
2015 Exp23 SCL 4 55 45
What I would like to have are the rows rearranged into groups such that
1) Within the same year
2) Using the same instrument
create groups such that,
each group has at least 3 locations in common, each of which has at least 20 successful experiments.
Requested Output
345 A346 347 348 349 350 351 352
1 2014 Exp1 IBM 24 45 22
1 2014 Exp4 IBM 34 44 43
1 2014 Exp7 IBM 23 23 24
2 2014 Exp2 LEN 23 32 34
2 2014 Exp9 LEN 24 45 45
2 2014 Exp10 LEN 43 45 32
3 2014 Exp5 IBM 2 45 51 45
3 2014 Exp6 IBM 34 23 54
3 2014 Exp8 IBM 34 45 56
4 2015 Exp11 IBM 34 55 33 34
4 2015 Exp13 IBM 43 55 34 43
4 2015 Exp14 IBM 45 32 43 4
5 2015 Exp16 IBM 32 34 43
5 2015 Exp17 IBM 32 34 46
6 2015 Exp18 LEN 32 54 67
6 2015 Exp20 LEN 67 56 76
6 2015 Exp21 LEN 45 56 65
7 2015 Exp19 SCL 56 6 4 45 56
7 2015 Exp22 SCL 45 55 54
2014 Exp3 LEN 2 34 34
2015 Exp12 IBM 1 33 4 5
2015 Exp15 IBM 23 4 5
2015 Exp23 SCL 4 55 45
The group number assignment is in the first column, it doesn't matter to me what number is assigned to which group as long as the members are assigned to the correct group. Also if experiments cant be assigned to groups, they should be left without a group number. Assignment should be with the group with maximum common locations, if equal then it can be assigned to multiple groups.
Please assist, I am learning some very basic awk, this seems beyond me..
I tried this but I dont think this will get anywhere near to what I require
awk ' NR>1{ for (i=4;i<=NF;i++) if ($i!="") arr1[$1,$2,$3]=$i ; next }
$1,$2,$3 in arr1 {
for (j=1;j<length(arr1);j++))
{if (arr1[j] > 20)
group++;
END {
for (j in n) {
print group, arr1[j]
}
}' input input