Selecting random columns from large dataset in UNIX

sajmar · July 4, 2015, 11:55pm

Dear folks

I have a large data set which contains 400K columns. I decide to select 50K determined columns from the whole 400K columns. Is there any command in unix which could do this process for me? I need to also mention that I store all of the columns id in one file which may help to select those columns out of the whole 400K columns.

Regards
Saj

Don_Cragun · July 5, 2015, 12:49am

What operating system are you using?

Your large dataset clearly is not a text file. What type of file is it?

What delimits columns in your dataset?

What separates records in your dataset?

What is the format of column IDs?

What is the format of the file containing column IDs?

sajmar · July 5, 2015, 1:02pm

1.What operating system are you using?
Linux
2.Your large dataset clearly is not a text file. What type of file is it?
ASCII text
3.What delimits columns in your dataset?
One space delimits columns
4.What separates records in your dataset?
One Space between each record
5.What is the format of column IDs?
All of the columns contain 0,1 or 2
6.What is the format of the file containing column IDs?
Integer

senhia83 · July 5, 2015, 11:09pm

Lots of assumptions...since your request is not clear at all

if your data is space delimited, and the columns in you want to extract are in a file with one column name per line, this is worth a try

Save the following as selectcols.sh

#!/bin/bash

dlf=${1:-data.txt}
clf=${2:-list.txt}

awk  -v colsFile="$clf" '
   BEGIN {
     j=1
     while ((getline < colsFile) > 0) {
        col[j++] = $1
     }
     n=j-1;
     close(colsFile)
     for (i=1; i<=n; i++) s[col]=i
   }
   NR==1 {
     for (f=1; f<=NF; f++)
       if ($f in s) c[s[$f]]=f
     next
   }
   { sep=""
     for (f=1; f<=n; f++) {
       printf("%c%s",sep,$c[f])
       sep=FS
     }
     print ""
   }
' "$dlf"

Run , after adding paths to script and files

selectcols.sh datafile listofcolsfile

sajmar · July 20, 2015, 12:18pm

To senhia83:

Thanks for your awk script suggestion. But after running the code you mentioned, I did not got my expected result.

If you assume my "datafile" is:

1 2 1 0 2 0 1 0
2 2 2 1 1 1 0 0
1 1 0 0 0 2 2 2
2 2 2 1 1 0 0 0

and my "listofcolsfile" is:

1
4
8

my desire output is:

Regards
SAJ

RudiC · July 20, 2015, 12:31pm

Try

awk 'FNR==NR {C[++j]=$1;next} {for (i=1;i<=j;i++) printf "%s ", $C; printf "\n"}' file2 file1
1 0 0 
2 1 0 
1 0 2 
2 1 0