Join and merge multiple files with duplicate key and fill void columns

yjacknewton · November 5, 2019, 3:23am

Hi guys,

I have many files that I want to merge:

file1.csv:

1|abc
1|def
2|ghi
2|jkl
3|mno
3|pqr

file2.csv:

1|123|jojo
1|NULL|bibi
3|789|zaza
3|012|NULL

file3.csv:

2|1a2b|1994|US
3|3c4e|1995|MG
4|5f6g|1996|FR

My desired output is like this, conserve the duplicate key and fill it with the data of anothers files, and the void columns is filling by NULL:

1|abc|123|jojo|NULL|NULL|NULL
1|def|123|jojo|NULL|NULL|NULL
1|abc|NULL|bibi|NULL|NULL|NULL
1|def|NULL|bibi|NULL|NULL|NULL
2|ghi|NULL|NULL|1a2b|1994|US
2|jkl|NULL|NULL|1a2b|1994|US
3|mno|789|zaza|3c4e|1995|MG
3|pqr|789|zaza|3c4e|1995|MG
3|mno|012|NULL|3c4e|1995|MG
3|pqr|012|NULL|3c4e|1995|MG
4|NULL|NULL|NULL|5f6g|1996|FR

How to do that with awk, join or shell scripts?

RudiC · November 5, 2019, 3:34am

Any attempts / ideas / thoughts from your side?

The fourth line of your desired output doesn't seem to be consistent.

yjacknewton · November 5, 2019, 3:42am

Yes, I forgot to write "NULL"; from this column,
I have already seen many post in forum but nothing match with my desire

yjacknewton · November 5, 2019, 6:44am

The only algo to preserve duplicate data that I found is this:
awk 'BEGIN{FS=OFS="|"}NR==FNR{a[$1]=$0;next;}{print $0,a[$1]}' test* ;

Does someone have some idea?
But result is not appropiate...

1|123|jojo|1|def
1|NULL|bibi|1|def
3|789|zaza|3|pqr
3|012|NULL|3|pqr
2|1a2b|1994|US|2|jkl
3|3c4e|1995|MG|3|pqr
4|5f6g|1996|FR|

Scrutinizer · November 5, 2019, 1:12pm

Hi, try:

join -t \| -a 1 -a 2 -e 'NULL' -o 0 1.2 1.3 1.4 2.2 2.3 2.4 \
<(join -t \| -a 1 -a 2 -e 'NULL' -o 0 1.2 2.2 2.3 file1 file2) file3

yjacknewton · November 6, 2019, 1:33am

I tried this code and here is the result:

1|abc|123|jojo|NULL|NULL|NULL
1|abc|NULL|bibi|NULL|NULL|NULL
1|def|123|jojo|NULL|NULL|NULL
1|def|NULL|bibi|NULL|NULL|NULL
2|ghi|NULL|NULL|1a2b|1994|US
2|jkl|NULL|NULL|1a2b|1994|US
2|sds|NULL|NULL|1a2b|1994|US
3|mno|789|zaza|3c4e|1995|MG
3|mno|012|NULL|3c4e|1995|MG
3|pqr|789|zaza|3c4e|1995|MG
3|pqr|012|NULL|3c4e|1995|MG
4|NULL|NULL|NULL|5f6g|1996|FR

Can someone give a solution by using awk?