Merge 4 bim files by keeping only the overlapping variants (unique rs values )

fondan · January 4, 2020, 11:17am

Dear community, I am facing a problem and I kindly ask your help:

I have 4 different data sets consisted from 3 different types of array.

On each file, column 1 is chromosome position, column 2 is SNP id etc... Lets say I have the following (bim) datasets:

x2014:

1       rs3094315       0       752566  G       A
1       rs3131972       0       752721  G       A

....more 550.000

x2016:

0       200610-10       0       0       G       A
0       200610-108      0       0       G       A

...

x2017

0       200610-10       0       0       G       A
0       200610-108      0       0       G       A

...

x2018:

0       200610-10       0       0       G       A
0       200610-108      0       0       G       A

.....more 550K rows

How can I merge all files together, without having any duplicate values based on the 2nd column (rs_id)?

nezabudka · January 5, 2020, 12:34am

Hi
So your files look like this exactly?
cat file

x2014:  1       rs3094315       0       752566  G       A
        1       rs3131972       0       752721  G       A
        ...
x2016:  0       200610-10       0       0       G       A
        0       200610-108      0       0       G       A
        ...

or maybe
cat file

x2014:
1       rs3094315       0       752566  G       A
1       rs3131972       0       752721  G       A       
...     
x2016:
0       200610-10       0       0       G       A
0       200610-108      0       0       G       A       
...

nezabudka · January 5, 2020, 3:49pm

Maybe these are file names x2014 x2016 x2017 x2018 ?
Five hundred thousand rows for an array in AWK is a drop in the bucket

awk '!T[$2]++' x201[4678] > ONE_FILE

fondan · January 7, 2020, 4:21am

Thank you for your replies. I know that it may seems easy but I am a beginner with Bash.

@nezabudka

Yes, x2014, x2015 are the file-names! There are like like the 2nd one:

cat x2014
 1       rs3094315       0       752566  G       A 

1       rs3131972       0       752721  G       A

etc..

nezabudka · January 7, 2020, 5:29am

Ok
Then this is what the doctor ordered. The command from the post #3 will filter duplicate values for 2 field.
If you need to select by the unique value of the entire line then so

awk '!T[$0]++' x201[4678] > ONE_FILE