Merging columns based on one or more column in two files

genehunter · August 30, 2012, 9:31pm

I have two files.
FileA.txt

30910   rs7468327
36587   rs10814410
91857   rs9408752
105797  rs1133715
146659  rs2262038
152695  rs2810979
181843  rs3008128
182129  rs3008131
192118  rs3008170

FileB.txt

30910 1.9415219673 0
36431 1.3351312477 0.0107191428
36587 1.3169171182 0.0109274233
37123 1.3181466012 0.0116332908
38515 1.1211025231 0.0134681509
44551 1.5498135416 0.0202351257
47327 1.5694610726 0.0245374081
48265 1.5556343019 0.0260095626
68775 1.5538580867 0.0579156221

I want to merge the columns together based on Column 1.
Also would like to know if I can merge these together if I had more than one column to match between the two files.

agama · August 30, 2012, 10:31pm

If file one isn't too large, then this should work

# single pass across each file, but requires the entire first file
# to be held in memory which might not be realistic.
# order is preserved based on file2
awk '
    NR == FNR { cache[$1] = $0; next; }
    $1 in cache {
        printf( "%s", cache[$1] );
        $1 = "";
        print;
    }
' file1 file2 >output

If file1 is large (i.e. it's not practical to cache it in memory), then this is one way. May not be the most efficent, but it should work. The order of the output is sorted by field1.

# multiple passes across the data, but memory requirement is eliminated
# order of file2 is not preserved.
(
    sed 's/^/a /' file1
    sed 's/^/b /' file2
) | sort -k 2n,2 -k 1,1  awk '
    $1 == "a" {
        x = $2;
        $1 = "";
        cache = $0;
        next;
    }
    $2 == x {
        $1 = $2 ="";
        printf( "%s%s\n", substr( cache, 2 ), $0 );
    }
'

You could do this without the seds, and depend on the number of columns to determine if an unmatched pair exists, but this works without having to know the exact layout of either file, other than the desired column to compare.

Yes, multiple columns can be used to match.

leafei · August 30, 2012, 11:13pm

The title leads me to join:

join FileA.txt FileB.txt