Hello!
Can anybody suggest about the fastest way of extracting "n" random columns from a very large file (tab separated) having thousands of columns, where n can be any specified number.
Thanks!
Hello!
Can anybody suggest about the fastest way of extracting "n" random columns from a very large file (tab separated) having thousands of columns, where n can be any specified number.
Thanks!
Just one line of thousands of columns?
If you have or can use the rl utility, it is quite fast indeed:
rl -d $'\t' -c 32 tabseperatedfile
You can use the "cut" command - the default delimiter is horizontal tab:
cut -fn Inp_File
Where "n" is the column number which can be of any length, including zero.
Hey! may be I was not very clear. I wanted to extract 'n' random columns where 'n' can be 250 or 400 or any other number. And the file has thousands of columns (lets say 5000) and hundreds of rows (lets say 400).
Thanks!
Actually it is not very clear
lets say you give n=100, you want to extract first 100 columns from all the rows.
regards,
Ahamed
---------- Post updated at 02:12 AM ---------- Previous update was at 02:06 AM ----------
To extract columns from m to n where n>m
#awk '{gsub("\t"," ",$0);print}' file | cut -d" " -fm-n
#eg:
awk '{gsub("\t"," ",$0);print}' file | cut -d" " -f3-5
#eg to extract first 400 columns
awk '{gsub("\t"," ",$0);print}' file | cut -d" " -f1-400
regards,
Ahamed
Thanks for your reply ahamed but I need to extract random columns like if I say I want to extract 50 random columns it should extract it randomly from all the columns, lets say it extracts the column no. 10, 28, 40, 57, 92, 500, 740, 2540, ... or any other columns for that matter but it should be random and not ordered.
Ofcouse the total no. of columns and rows are fixed in that file and 'n' the number of random columns required can vary.
Ruby(1.9.1+)
f=open("file")
numcols=f.readline.split.size
cols=(1..numcols).to_a.sample(4)
while not f.eof?
line=f.gets.split
cols.each{|x| printf "%s " , line[x-1]}
puts
end
f.close
Hi, mira.
Providing an example would help.
Given this simple model of data in "row,column" notation:
1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,10
2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 2,10
3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9 3,10
4,1 4,2 4,3 4,4 4,5 4,6 4,7 4,8 4,9 4,10
what would be your expected results of, say 2 runs, one for 2 and one for 3 of your random choices. This will help us know what your intentions are.
Best wishes ... cheers, drl
ok, lets say if I need 2 random columns it can be any two e.g.
1,6 1,10
2,6 2,10
3,6 3,10
4,6 4,10
and if I say I need 3 random columns, it can be any three, but what is important is it should be random, may be by using some function for choosing columns randomly
e.g. output will be :
1,1 1,4 1,7
2,1 2,4 2,7
3,1 3,4 3,7
4,1 4,4 4,7
Thanks! drl..
What languages do you have available? I'm pondering a solution in C.
---------- Post updated at 01:27 PM ---------- Previous update was at 12:20 PM ----------
/**
* rc.c picks random columns from whitespace-separated input from stdin.
*/
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
const char ofs=' '; // output field separator
const char *fs=" \r\n\t"; // input field seperator
int main(int argc, char *argv[])
{
size_t size; // buf[] size in bytes
char *buf=malloc(size=65536L);
int pos=0; // How much's been read into buf[]
int coln=0, colmax; // How many columns read, max columns in cols[]
char **cols=malloc(sizeof(char *) * (colmax=512L));
int choose=1; // How many columns to choose
if(argc != 2)
{
fprintf(stderr, "Usage: %s columns < datafile\n",argv[0]);
return(1);
}
if((sscanf(argv[1], "%d", &choose) != 1) || (choose <= 0))
{
fprintf(stderr, "Bad count '%s'\n", argv[1]);
return(1);
}
srand(time(NULL)^getpid()); // Make results random
// Read until end of file
while(fgets(buf+pos, size-pos, stdin) != NULL)
{
int c;
char *tok;
pos += strlen(buf+pos);// Find the end of line
if(pos <= 0) continue; // Don't bother checking empty line
// Check the end of line for \n
if(buf[pos-1] != '\n')
{ // Didn't get entire line, make buffer bigger
// then get the rest
buf=realloc(buf, size += size>>1);
continue;
}
// Break into columns across whitespace
tok=strtok(buf, fs);
do
{
// Check if we have enough room for columns.
// Add more if necessary.
if(colmax <= coln)
cols=realloc(cols, sizeof(char *)*
(colmax+=(colmax>>1)));
cols[coln++]=tok;
tok=strtok(NULL, fs);
} while(tok != NULL);
for(c=0; (c<choose)&&(coln>0); c++)
{
int m=rand()%coln;
char *pick=cols[m];
cols[m]=cols[--coln]; // Remove from list
if(c != 0) putc(ofs, stdout);
fputs(pick, stdout);
}
putc('\n', stdout);
// Reset everything for next line
coln=0; pos=0;
}
return(0);
}
should handle very large lines and thousands of columns without problem.
Hi.
With standard utilities:
#!/usr/bin/env bash
# @(#) s1 Demonstrate extraction of random number of columns.
# Utility functions: print-as-echo, print-line-with-visual-space, debug.
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for i;do printf "%s" "$i";done; printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && . $C seq rl tr sed cut
N=${1-3}
FILE=data1
# Generate N numbers in random sequence, range 1 - number-of-columns
cols=$( seq 1 10 | rl -c $N | tr '\n' ',' | sed 's/.$//' )
pl " Random columns: $cols"
cut -d" " -f"$cols" $FILE
exit 0
producing:
% ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian GNU/Linux 5.0.7 (lenny)
GNU bash 3.2.39
seq (GNU coreutils) 6.10
rl 0.2.7
tr (GNU coreutils) 6.10
GNU sed version 4.1.5
cut (GNU coreutils) 6.10
-----
Random columns: 7,9,2
1,2 1,7 1,9
2,2 2,7 2,9
3,2 3,7 3,9
4,2 4,7 4,9
and choosing a different N:
% ./s1 5
... omitted
-----
Random columns: 8,2,9,1,5
1,1 1,2 1,5 1,8 1,9
2,1 2,2 2,5 2,8 2,9
3,1 3,2 3,5 3,8 3,9
4,1 4,2 4,5 4,8 4,9
Best wishes ... cheers, drl