hi all, I need some help in regards of how to process just a sample from a large .txt file
I have a large file from many new lines (say above 200.000 new lines), I need a script that process just a sample of it, say 10.000 bur a random sample (taking rows from top top to the the bottom)
Could someone help? I will be happy, if you could also enable me to give the number of sample as an input - for example, if I need 20.000 instead of 10.000 to give the 20.000 as input
Could you please let me know how do you know that it will give ca 10%, for example; if I know the number of rows before hand; how I should play with the
According to the standards, awk's rand() function returns a pseudo-random number x such that 0 < x < 1. If we assume that the pseudo-random number is uniformly distributed (which is not required by the standards), then (rand() <= .01) should be true approximately 1% of the time. For 10%, the test would be ICODE](rand() <= .1)[/ICODE] (again, assuming rand() produces a uniformly distributed set of values).
Note that when using "random" numbers, there is certainly no guarantee that using (rand() <= 10000/200000) as the test in Jotne's script will print exactly 10,000 lines from a 200,000 line file.