how to choose random columns

mira · April 21, 2011, 4:19pm

Hello!

Can anybody suggest about the fastest way of extracting "n" random columns from a very large file (tab separated) having thousands of columns, where n can be any specified number.

Thanks!

Corona688 · April 21, 2011, 4:29pm

Just one line of thousands of columns?

If you have or can use the rl utility, it is quite fast indeed:

rl -d $'\t' -c 32 tabseperatedfile

Shell_Life · April 21, 2011, 4:58pm

You can use the "cut" command - the default delimiter is horizontal tab:

cut -fn Inp_File

Where "n" is the column number which can be of any length, including zero.

mira · April 24, 2011, 5:42am

Hey! may be I was not very clear. I wanted to extract 'n' random columns where 'n' can be 250 or 400 or any other number. And the file has thousands of columns (lets say 5000) and hundreds of rows (lets say 400).

Thanks!

ahamed101 · April 24, 2011, 6:12am

Actually it is not very clear

lets say you give n=100, you want to extract first 100 columns from all the rows.

regards,
Ahamed

---------- Post updated at 02:12 AM ---------- Previous update was at 02:06 AM ----------

To extract columns from m to n where n>m

#awk '{gsub("\t"," ",$0);print}' file | cut -d" " -fm-n
#eg:
awk '{gsub("\t"," ",$0);print}' file | cut -d" " -f3-5

#eg to extract first 400 columns
awk '{gsub("\t"," ",$0);print}' file | cut -d" " -f1-400

regards,
Ahamed

mira · April 24, 2011, 6:34am

Thanks for your reply ahamed but I need to extract random columns like if I say I want to extract 50 random columns it should extract it randomly from all the columns, lets say it extracts the column no. 10, 28, 40, 57, 92, 500, 740, 2540, ... or any other columns for that matter but it should be random and not ordered.

Ofcouse the total no. of columns and rows are fixed in that file and 'n' the number of random columns required can vary.

kurumi · April 24, 2011, 8:31am

Ruby(1.9.1+)

f=open("file")
numcols=f.readline.split.size
cols=(1..numcols).to_a.sample(4)
while not f.eof?
    line=f.gets.split
    cols.each{|x| printf "%s " , line[x-1]}
    puts
end
f.close

drl · April 24, 2011, 8:36am

Hi, mira.

Providing an example would help.

Given this simple model of data in "row,column" notation:

1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,10 
2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 2,10 
3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9 3,10 
4,1 4,2 4,3 4,4 4,5 4,6 4,7 4,8 4,9 4,10

what would be your expected results of, say 2 runs, one for 2 and one for 3 of your random choices. This will help us know what your intentions are.

Best wishes ... cheers, drl

mira · April 24, 2011, 2:07pm

ok, lets say if I need 2 random columns it can be any two e.g.

and if I say I need 3 random columns, it can be any three, but what is important is it should be random, may be by using some function for choosing columns randomly

e.g. output will be :

 
1,1  1,4  1,7
2,1  2,4  2,7
3,1  3,4  3,7
4,1  4,4  4,7

Thanks! drl..

Corona688 · April 24, 2011, 3:27pm

What languages do you have available? I'm pondering a solution in C.

---------- Post updated at 01:27 PM ---------- Previous update was at 12:20 PM ----------

/**
 *	rc.c	picks random columns from whitespace-separated input from stdin.
 */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

const char ofs=' ';		// output field separator
const char *fs=" \r\n\t";	// input field seperator

int main(int argc, char *argv[])
{
	size_t size;	// buf[] size in bytes
	char *buf=malloc(size=65536L);
	int pos=0;	// How much's been read into buf[]

	int coln=0, colmax;	// How many columns read, max columns in cols[]
	char **cols=malloc(sizeof(char *) * (colmax=512L));

	int choose=1;	// How many columns to choose

	if(argc != 2)
	{
		fprintf(stderr, "Usage:  %s columns < datafile\n",argv[0]);
		return(1);
	}

	if((sscanf(argv[1], "%d", &choose) != 1) || (choose <= 0))
	{
		fprintf(stderr, "Bad count '%s'\n", argv[1]);
		return(1);
	}

	srand(time(NULL)^getpid());	// Make results random

	// Read until end of file
	while(fgets(buf+pos, size-pos, stdin) != NULL)
	{
		int c;
		char *tok;

		pos += strlen(buf+pos);// Find the end of line
		if(pos <= 0) continue; // Don't bother checking empty line
		// Check the end of line for \n
		if(buf[pos-1] != '\n')
		{	// Didn't get entire line, make buffer bigger
			// then get the rest
			buf=realloc(buf, size += size>>1);
			continue;
		}

		// Break into columns across whitespace
		tok=strtok(buf, fs);
		do
		{
			// Check if we have enough room for columns.
			// Add more if necessary.
			if(colmax <= coln)
				cols=realloc(cols, sizeof(char *)*
					(colmax+=(colmax>>1)));

			cols[coln++]=tok;
			tok=strtok(NULL, fs);
		} while(tok != NULL);

		for(c=0; (c<choose)&&(coln>0); c++)
		{
			int m=rand()%coln;
			char *pick=cols[m];
			cols[m]=cols[--coln];	// Remove from list

			if(c != 0)	putc(ofs, stdout);
			fputs(pick, stdout);
		}

		putc('\n', stdout);

		// Reset everything for next line
		coln=0;		pos=0;
	}

	return(0);
}

should handle very large lines and thousands of columns without problem.

drl · April 24, 2011, 6:22pm

Hi.

With standard utilities:

#!/usr/bin/env bash

# @(#) s1	Demonstrate extraction of random number of columns.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for i;do printf "%s" "$i";done; printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && . $C seq rl tr sed cut

N=${1-3}
FILE=data1

# Generate N numbers in random sequence, range 1 - number-of-columns
cols=$( seq 1 10 | rl -c $N | tr '\n' ',' | sed 's/.$//' )

pl " Random columns: $cols"

cut -d" " -f"$cols" $FILE

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.7 (lenny) 
GNU bash 3.2.39
seq (GNU coreutils) 6.10
rl 0.2.7
tr (GNU coreutils) 6.10
GNU sed version 4.1.5
cut (GNU coreutils) 6.10

-----
 Random columns: 7,9,2
1,2 1,7 1,9
2,2 2,7 2,9
3,2 3,7 3,9
4,2 4,7 4,9

and choosing a different N:

% ./s1 5

... omitted

-----
 Random columns: 8,2,9,1,5
1,1 1,2 1,5 1,8 1,9
2,1 2,2 2,5 2,8 2,9
3,1 3,2 3,5 3,8 3,9
4,1 4,2 4,5 4,8 4,9

Best wishes ... cheers, drl