Hi guys,
I have very basic linux experience so I need some help with a problem.
I have 3 files from which I want to extract columns based on common fields between them.
File1:
--- rs74078040 NA 51288690 T G 461652 0.99223 0.53611 3
--- rs77209296 NA 51303525 T G 461843 0.98973 0.60837 3
--- rs7985170 NA 51304203 C T 461852 0.99215 0.541 3
File2:
id rsid chromosome pos allele_A allele_B index average_call info
--- rs74078040 NA 51288690 T G 461652 0.99852 0.52725 2
--- rs77209296 NA 51303525 T G 461843 0.99992 0.90562 2
--- rs7985170 NA 51304203 C T 461852 0.99994 0.93283 2
--- rs145780716 NA 19032912 G A 2054321 0.9988 0.026123 2
File3:
id rsid chromosome pos allele_A allele_B index average_call info
--- rs74078040 NA 51288690 T G 461652 0.956 0.8558 1
--- rs77209296 NA 51303525 T G 461843 0.9843 0.8989 1
--- rs7985170 NA 51304203 C T 461852 0.9323 0.6754 1
--- rs145780716 NA 19032912 G A 2054321 0.9098 0.7632 1
I want to take all the rows from File1 and say columns 1-7 (which are the same in all the files) and then add column 8 from File 2 and columns 9 and 10 from File3.
Note that File1 doesn't have a header but the other files do. Also, File1 has less lines than the other files which have the same number of lines.
The output would look like:
--- rs74078040 NA 51288690 T G 461652 0.99852 0.8558 1
--- rs77209296 NA 51303525 T G 461843 0.99992 0.8989 1
--- rs7985170 NA 51304203 C T 461852 0.99994 0.6754 1
Hi guruprasadpr,
Many thanks for your reply. I have tried the command you suggest but it tells me: "join: File 2 is not in sorted order" which I think probably comes from the header line because column 4 is in sorted order.
Any suggestions how to get around that problem? I don't want to make a copy of the files without headers because they are quite big and our disk space is limited.
Hi pamu, I wouldn't mind using more than one step even though in reality I have 5 files instead of 3.
Could you broadly explain the awk syntax please? My awk knowledge is almost zero.
Thanks for your reply!
awk 'FNR==NR{$NF="";$(NF-1)="";$(NF-2)="";a[$2]=$0;next} # Read first file, make $NF,$(NF-1) and $(NF-2) NULL as we don't need those. Create an array a with $2 is index and which stores $0 ( whole line).
{if(a[$2]){print a[$2],$8}}' file1 file2 > temp # Here we are reading second file. Check if $2 present in this file or not. If it is present then file1 line matches with file2 line then print line from file1 and append $8 as you have requested. And write this to a new temp file.
awk 'FNR==NR{a[$2]=$0;next} # Read temp file create an array a, which stores $0(whole line)
{if(a[$2]){print a[$2],$9,$10}}' temp file3 # read second file. If there is match between files then print temp file line and append $9 and $10 from file3 as per your required output.
Thanks pamu and Scrutinizer. I will try your suggestions and read up a bit on awk and will let you guys know.
Thanks a lot!
---------- Post updated at 02:40 PM ---------- Previous update was at 11:39 AM ----------
pamu, your code actually works very well! I would just need to adapt it because my files don't really have 10 columns, more like 80 but the principle is the same. It would just be quite a long command.
Scrutinizer, your code works as well but what if file1 has more columns? How could I remove some of them if I don't want them?
Also, do I have to run the command with the spacing as is? When I tried to put into one long line I got some syntax errors.
Your command as far as I understand takes the values i want from files 2 and 3 and puts them in the same positions in file1 and prints the rest of file 1 as is. Let's say I want to do all that and on top keep a subset of columns, say column 65 and 75 from the remaining columns. Is that clear? Could I just replace "print" with "print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$65,$75" since now $8,$9 and $10 have the values from the other files?
I Think it does work!
---------- Post updated at 03:21 PM ---------- Previous update was at 03:08 PM ----------