An algorithm to be written in linux command

Hi All,

I wrote the following script in R. However, i can not run it. Because the data file is so big. Therefore, i need to write it in shell script. Could you please help me?

######################################

data=as.matrix(read.table("data.txt"))
file=as.matrix(read.table("file.txt"))
n1=dim(file)[1]    # number of lines in file.txt
n2=dim(data)[1]  # number of lines in data.txt
control=file[,3:4] # 3th and 4th column of file.txt
new=matrix(nrow=n1, ncol=1)  # new matrix to store the output
count=0
for (j in 1:n1)
{
 count=count+1
  for (i in 1:n2)
 {  
  if (data[i, ((2*j)-1):(2*j)]!=c(control[j,1],control[j,1])&& data[i, ((2*j)-1):(2*j)]!=c(control[j,1],control[j,2])&& data[i, ((2*j)-1):(2*j)]!=c(control[j,2],control[j,1])&& data[i, ((2*j)-1):(2*j)]!=c(control[j,2],control[j,2]))  
   {
    new[count]=file[j,1]
   }
  } 
}
 

################################
data.txt is genotype data and looks like

G A G A G A G G G A G A ...
G A G G G A A G G G G G ...
...
G A G A G A G A ...

file.txt looks like

snp1 265 G T
snp2 546 A G
snp3 905 A G
snp4 965 T G
...

new.txt which is the output should looks like

snp1
snp4
...

So, the algorithm compares the columns from data.txt
i.e 1st and 2nd column

G A
G A
..
G A

by the 1st line 3th 4th column of the file.txt (G T) and if it is not any of the combination (G T, G G, T G, T T) then it reports to new.txt

Does that make sense?

Thanks in advance,

For each $1 and $2 in "data.txt" you want to compare with the equivalent record in "file.txt" for $3 and $4.

If not the same, then display $1 from "file.txt".

Is this correct?

yes this is correct.

See if this works for you:

#!/usr/bin/ksh
cut -d' ' -f1,2 data.txt > data2.txt
# Input for loop will be: G A snp1 265 G T
paste data2.txt file.txt |
while read m1 m2 m3 m4 m5 m6; do
  if [[ "${m1}" = "${m5}" && "${m2}" = "${m6}" ]]; then
    echo ${m3}
  fi
done