Split file into n parts.

Hi all:

I have a 5-column tab-separated file.
The only thing that I want to do with it is to split it.
However, I want to split it with a 80/20 proportion -- randomized, if possible.
I know that something like :

awk '{print $0 ""> "file" NR}' RS='' input-file

will work, but it only splits into equallly sized files -- and does not randomize.

Does anyone know if that is possible using awk/grep?

look into awk's 'srand' and 'rand' functions....

1 Like

Thanks:
Something like this would randomize the lines --

$ awk -v N=`cat FILE | wc -l` 'rand()<numberoflines/N' FILE

But it only prints a defined "number of lines".
Is it possible to split this instead into two files -- one that is 80% of the content and the second which is 20% of the content of the file?

a bit verbose, but.....
output will be in files p1 and p2

awk -f ow.awk myFile
or
awk -v p1=60 -v p2=40 -f ow.awk myFile

ow.awk:

function genrand(n)
{
  return(int(n*rand())+1)
}

BEGIN {
  srand()
  if (!p1) p1=80
  if (!p2) p2=20
  perc[p1]="p1"
  perc[p2]="p2"
}
{ a[FNR]=$0;fnr=FNR }
END {
  for(i=1;i<=fnr;i++){
    g=int(genrand(fnr))
    if (!(g in a))
      i--
    else {
      out=((i/fnr)*100 <= p1)?perc[p1]:perc[p2]
      print a[g] >> out
      close(out)
      delete a[g]
    }
  }
}
1 Like

Try also

 sort -R file | split -l $(($(wc -l <file) *8/10)) 
2 Likes

if just want to split then you can try like this also

split -l $[ $(wc -l file |cut -d ' ' -f1) * 70 / 100  ] file output_prefix

go through man split
but it's not random :slight_smile:

1 Like

If approximately 80/20 is sufficient:

awk 'BEGIN {srand()} {print > (rand() < .8 ? "f1" : "f2")}' file

Regards,
Alister