Merge row based on replicates ID

giuliangiuseppe · May 20, 2016, 6:51am

Dear All,
I was wondering if you may help me with an issue.
I would like to merge row based on column 1.

input file:

b1 ggg b2 fff NA NA hhh NA NA NA NA NA
a1 xxx a2 yyy NA NA zzz NA NA NA NA NA
a1 xxx NA NA a3 ttt NA ggg NA NA NA NA

output file:

b1 ggg b2 fff NA NA hhh NA NA NA NA NA
a1 xxx a2 yyy a3 ttt zzz ggg NA NA NA NA

well, basically if column 1 has the same ID (and there aren't more that two equal replicate ID) I would like to replace the NA value of first replicate row with the value (in the same column) of second replicate.
If both are NA or other same values leave as it is. Just replace the NA value with Not NA value in the same column withint replicates (column 1).

Well I think that the explanation is bad but the example should be clear I guess.

Let me know please if you need futher details

Thank you as always for your help.

Giuliano

rdrtx1 · May 20, 2016, 7:37am

awk '
{a[$1]=$1; f[$1]=NF; for (i=2; i<=NF; i++) if ($i != "NA") b[$1 FS i]=$i} # load id and id columns arrays
END {
   for (i in a) {                                                         # loop thru id array
      l=i;                                                                # initialize output string
      for (j=2; j<=f; j++) {                                           # loop thru id columns array
         l=l FS ((b[i FS j]) ? b[i FS j] : "NA");                         # fill output string with output values from id columns array
      }
      print l;                                                            # print output string
   }
}' infile

giuliangiuseppe · May 20, 2016, 7:54am

That works perfectly!
thanks!

G

Don_Cragun · May 20, 2016, 3:47pm

Hi Giuliano,
The code rdrtx1 suggested looks like it will work fine for the input files similar to the sample you provided, but it doesn't take into account the part of your description that said "and there aren't more that two equal replicate ID".

What should happen if there are three or more lines in your input file that have the same string in column 1?

MadeInGermany · May 21, 2016, 2:15pm

Don, Guiliano means there are not more than two lines with the same ID in column1.
And otherwise rdrtx1 solution would even handle it well.
--
In case the duplicate IDs are in adjacent lines, the following saves some memory

awk '
function printP(){ o=P[1]; for (i=2; i<=NF; i++) o=(o FS P); print o }
function updateP(){ for (i=2; i<=NF; i++) if (P=="NA") P=$i }
NR>1 {
  if ($1==P[1]) {
    updateP()
    next
  }
  printP()
}
{ split ($0,P) }
END { printP() }
' infile

giuliangiuseppe · May 21, 2016, 3:51pm

Yes my specific case was without more that 2 replicates.
Thank you for your help!

Best