I need to dedup a file (let's call it - Test_File.dat) by the first field and then concatenate one value within it....I guess you could say I need to normalize the file.
I need it to be:
aaa|bbb|ccc|123;456;789|ddd
abc|def|ghi|123;456;789|jkm
I've been looking into commands like "awk" and "sort", but I'm above my pay grade.
Do you need to preserve sort order, or could you start by sorting the file? That would make it a little bit less challenging, and reduce memory needs. The other way is also doable, but not quite as elegant.
sort file |
awk 'BEGIN { FS = OFS = "|" }
{ key=$1 OFS $2 OFS $3 OFS $5;
if (key == prev) { four=four (four ? ";" : "") $4 }
else { if (four) print prev, four; four = $4 }
prev = key }
END { if (four) print prev, four }'
I cheated and rearranged the field order for simplicity. Maybe you can figure out how to move the fourth field back to the fourth position.
Tested on Ubuntu mawk; if you have another awk, minor changes may be required.
x is an associative array with keys based on your logical primary key $1, $2 and $3 and values/elements based on the value of the fourth filed.
The if expression ? then : else statement is composing your elements in the desired format (if array[key] exists -> then append $4 to its value).
y[$1FS$2FS$3] = $5
Another associative array with the same keys and elements based on the last field.
END { for (k in x)
print k, x[k], y[k] }
After reading the entire input for every key in the x array,
print the desired values: the x key, the x value and the y value.