How to remove a subset of data from a large dataset based on values on one line

davegen · November 23, 2011, 10:28am

Hello. I was wondering if anyone could help. I have a file containing a large table in the format:

      marker1 marker2 marker3 marker4
      position1 position2 position3 position4
      genotype1 genotype2 genotype3 genotype4

with marker being a name, position a numeric measure of distance and genotype also a number.

I need to remove columns based on the values in the "position" line i.e. I need a script to take each position and remove adjacent columns that are within a certain distance of that marker, which is indicated by the value the position line which is a measure of distance. So if the file looked like this

rs1 rs2 rs3 rs4 rs5
1 2 3 4 5
2 3 1 1 2

and I was dealing with rs3 and the distance I wanted to remove was 1, I would want the output:

rs1 rs3 rs5
1 3 5
2 1 2

Does anyone know any way can do this? I appreciate any help and I hope I haven't been too confusing!

Shell_Life · November 23, 2011, 11:08am

See if this works for you:

#!/usr/bin/ksh
typeset -i mMax=5
typeset -i mMarker=3
typeset -i mBefore=${mMarker}-1
typeset -i mAfter=${mMarker}+1
typeset -i mFld=1
mList=''
while [[ ${mFld} -le ${mMax} ]]; do
  if [[ ${mFld} -ne ${mBefore} && ${mFld} -ne ${mAfter} ]]; then
    mList=${mList}",${mFld}"
  fi
  mFld=${mFld}+1
done
cut -d' ' -f${mList} Your_File

davegen · November 24, 2011, 7:12am

Thanks, that's really helpful! I hate to ask for more but could I alter that to make it run through every position?