join on a file with multiple lines, fields

crimper · December 14, 2008, 12:51am

I've looked at the join command which is able to perform what I need on two rows with a common field, however if I have more than two rows I need to join all of them.

Thus I have one file with multiple rows to be joined on an index number:

1 randomtext1
2 rtext2
2 rtext3
3 rtext4
3 rtext5
3 rtext6
.
.
.

I need this:

1 randomtext1
2 rtext2 rtext3
3 rtext4 rtext5 rtext6
.
.
.

where the repeated index number may have an arbitrary number of rtext's to be pivoted to columns for that index number. The number suffixes on rtext are only there for clarity.

I've considered using uniq to extracting the repeated text and using the results from that to remove that specific line in awk, then just reiterate this process in a script and use 'join', but I'm convinced there is a complete, easier solution.

updates: I have progressed to a solution with gawk... now I am having a problem with losing the order of rows as columns.

BEGIN {
FS=" ";
maxFLD=2;

getline var_tempnew
split(var_tempnew, array_temp, " ")
}
{

if ($1 == array_temp[1]){

	do
	{
	$0 = $0 " " array_temp[2]
	getline var_tempnew
	split(var_tempnew, array_temp, " ")
	} while ($1 == array_temp[1]) 
	print $0
	
} else {
    print var_tempnew
	var_tempnew = $0
	split(var_tempnew, array_temp, " ")
}
}
END {
print var_tempnew
}

radoulov · December 14, 2008, 7:22am

Given you have GNU AWK, you may try something like this:

WHINY_USERS=unix.com awk '
END {
  for (i in _) 
    printf "%d %s\n", i, _
  }
{
  idx = sprintf("%.10d", $1)
  _[idx] = _[idx] ? _[idx] FS $2 : $2
  }' infile

Or you want to preserve the original order of $1 which is not numeric?

ivhb · December 14, 2008, 12:21pm

if you try to preserve the original order, you can try this

awk '{
  if (cnt[$1] ++ == 0)
    idx[j++] = $1
  rst[$1] = $2 " " rst[$1]
}
END {
  for (i = 0; i < j; i ++)
    print idx, rst[idx]
}'

crimper · December 14, 2008, 5:49pm

Both solutions seems to work, however there are caveats for each one given I wasn't very specific on the requirements.
new data set:

10a textile
b10b wtf
b20b omg
b20b woot
20b teasdf ha
20b tesdf2 he
30c woot1
30c woot2

radoulov solutions works, though it assumes integers. I like the simplicity so I've tried modifying this code to work for non-integer indices, and it seems to lose the row order significantly(row to column transposed order is OK though):

END {
  for (i in _) 
    printf "%s\n", _
  }
{
  _[$1] = _[$1] ? _[$1] FS $0 : $0
}

This implies a problem with the string as array indices, but ivhb's solution uses the same feature.

{
  if (cnt[$1] ++ == 0)
  {
    idx[j++] = $1
     rst[$1] = $0
  } else {
     rst[$1] = rst[$1] FS $0 
  }
}
END {
  for (i = 0; i < j; i ++)
    print rst[idx]
}

This works OK as modified.

Final script FTW:

{
  if (cnt[$1] ++ == 0)
  {
    idx[j++] = $1
     rst[$1] = $0
  } else {
    vindex = $1
     sub(/^[[:alnum:]_]+ \y/, "") 
     rst[vindex] = rst[vindex] FS $0 
  }
}
END {
  for (i = 0; i < j; i ++)
    print rst[idx]
    #print idx, rst[idx]
}

The only problem remains is that the space is assumed as the FS in the regexp, is there a way to escape the FS in a regexp?

summer_cherry · December 15, 2008, 1:20am

awk '{_[$1]=sprintf("%s %s",_[$1],$2)}
END{
for(i in _)
      print i" "_
}' filename

radoulov · December 15, 2008, 3:50am

What should be the desired output, given the sample data above?