Merging rows in awk

Homa · January 15, 2013, 3:40pm

Hello,

I have a data format as follows:

Ind1 0 1 2
Ind1 0 2 1
Ind2 1 1 0 
Ind2 2 2 0

I want to use AWK to have this output:

Ind1 00 12 21
Ind2 12 12 00

That is to merge each two rows with the same row names.

Thank you very much in advance for your help.

RudiC · January 15, 2013, 3:46pm

What about searching this forum and trying it yourself? This type of request is abundant here!

Homa · January 15, 2013, 5:38pm

Indeed, but my real file has about 1000 rows and more than 400, 000 columns, I have found this:

{
    col2[$1] = col2[$1] $2
    col3[$1] = col3[$1] $3
    col4[$1] = col4[$1] $4
}

END {

    for ( i in col2 )
    {
        print i " " col2 " " col3 " " col4
    }
}

but I need to extend the loop, such as

i in i<=NF

I need some help for that, thank you very much.

Don_Cragun · January 15, 2013, 6:22pm

Try:

awk '
{       if(NF > m) m = NF
        if(!((1,$1) in o)) {
                o[1,$1] = $1
                oo[++oc] = $1
        }
        for(i = 2; i <= NF; i++) o[i,$1] = o[i,$1] $i
}
END {   for(i = 1; i <= oc; i++) {
                for(j = 1; j <= m; j++)
                        printf("%s%s", o[j, oo], j == m ? "\n" : " ")
        }
}' input

On Solaris systems, use /usr/xpg4/bin/awk or nawk, instead of awk.

Chubler_XL · January 15, 2013, 6:52pm

Standard awk only supports 100 fields per line so you're going to need gawk or mawk.

If performance is an issue and lines to be merged are always adjacent, the following should use a lot less resources:

awk '
function pl() {
    printf "%s",a;
    for(i=2;i in V;i++) printf " %s",V
    printf "\n"
}
a!=$1 {
    if(a)pl()
    a=$1
    split("",V,",")
}
{
   for(i=2;i<=NF;i++) V=V$i
   next
}
END {pl()}' infile

rdrtx1 · January 15, 2013, 11:21pm

try also:

awk -f '{
  a[$1]=$1; for (i=2; i<=NF; i++) {ar[$1,i]=ar[$1,i] $i; if (NF>mx) mx=NF}
}
END {
  for (b in a) {
    printf b " ";
    for (c=2; c<=mx; c++) printf ar[b,c] " ";
    print "";
  }
}' infile

RudiC · January 16, 2013, 3:49am

Is it always two adjacent rows? Do they always have the same field count? Try this:

$ awk '{getline nx; j=split (nx, Ax); for (i=2;i<=j;i++) $i=$i Ax}1' file

or, even shorter:

$ awk '{getline nx; for (i=2;i<=split(nx,Ax);i++) $i=$i Ax}1' file
Ind1 00 12 21
Ind2 12 12 00

Don't know if it will work for 400000 columns, though ...

Don_Cragun · January 16, 2013, 11:25am

chubler_xl:

Standard awk only supports 100 fields per line so you're going to need gawk or mawk.

If performance is an issue and lines to be merged are always adjacent, the following should use a lot less resources:
awk '
function pl() {
   printf "%s",a;
   for(i=2;i in V;i++) printf " %s",V
   printf "\n"
}
a!=$1 {
   if(a)pl()
   a=$1
   split("",V,",")
}
{
   for(i=2;i<=NF;i++) V=V$i
   next
}
END {pl()}' infile

Ouch! I missed the 400,000 columns note. But, I don't see anything in the POSIX Standards or the Single UNIX Specifications that allow implementations to limit the number of fields in a line. And, if the input files are sorted, it is grossly inefficient to try to read the entire input file (of at least 8,000,000,000 bytes) into memory rather than sorting the input file first and using your method. But, of course, you can't use the standard sort utility to sort a file that has lines that are at least 800,000 bytes long.

All of the standard utilities that work on text files (including awk, the editors, grep, and sort) are only defined to work on text files (which limits a line to LINE_MAX bytes per line). LINE_MAX can be as small as 2,048. I don't think I've ever used a system with LINE_MAX greater than 20,480.

The only text processing utilities in the standards that are required to work on files that would be text files if line lengths were unlimited are: cut, fold, paste, and the shell. And, for the shell it is only the length of command lines that are unlimited (the shell built-in utilities that read and write files, such as read and printf, are only defined to work if the input or output is a text file).

It would be possible to use cut to create thousands (or tens of thousands or hundreds of thousands, depending on expected field widths after merging lines) of text files that can be processed with awk and then use cut again to get rid of the first field in each file, except the first one, and then use paste to put the results back together. But, having created this file with some lines that are at least 1.2Mb long (400,000 fields * (2 bytes/joined field + 1 byte separating fields)), there isn't much you can do with it.

Scrutinizer · January 16, 2013, 12:12pm

FWIW, I ran this on various systems:

getconf LINE_MAX; { for ((i=1; i<=400000; i++)); do printf "$i "; done ; echo ;} | awk '{$1=$1x}1' OFS='\n' | wc -l

OSX 10.8:
2048
   400000

CentOS 6.3
2048
   400000

AIX7: 
2048
   400000

Solaris 10
2048
/usr/xpg4/bin/awk: line 0 (NR=1): Record too long (LIMIT: 19999 bytes)
       0

HPUX 11i:
2048
awk: Input line 1 2 3 4 5 6 7 8 9 10 cannot be longer than 3,000 bytes.
 The source line number is 1.
0