How to select lines randomly without replacement in UNIX?

Dear Folks

I have one column of 15000 lines and want to select randomly 5000 of them in five different times without replacement. I am aware that command 'shuf' and 'sort -R' could select randomly those lines but I am not sure how could I avoid the replacement of selection line. Does anyone have a suggestion?

We cant read your mind so provide us with a sample of the input and the desired output...

Say, I have this small input file:

1
2
3
4
5
6
7
8
9
10

My desired output is to select three numbers but different each time.

2
4
9
5
8
1

The linux shuf utility can select random lines without repeats. Run it once and read three lines at a time from it in a loop.

shuf < inputfile | while read LINE1 && read LINE2 && read LINE3
do
...
done
2 Likes

Thank you Corona688 for your suggestion. I only present the small example. In my case, I want to randomly choose 5000 lines out of 15000 lines. what should I do for this situation?

Try to come up with a smart hash function that'd let you choose a different set of lines in each iteration...

shuf < inputfile | split -l 5000 sample.

This will create sample files sample.aa, sample.ab, sample.ac of 5000 randomly-chosen non repeating lines each.

I've used shuf on megabytes of data for the purpose of selecting random samples before. It operates by seeking and should be reasonably efficient.

2 Likes

Hello sajmar,

Could you please try following and let me know if this helps you.
1st solution:

cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5000`;  do line=$(($RANDOM*${lines_in_file}/32767+1)); sed "${line}q;d" $file >> "output"; done

OR

cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5000`
do
	line=$(($RANDOM*${lines_in_file}/32767+1))
	sed "${line}q;d" $file >> "output"
done

2nd solution:

cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5001`;  do line=$(($RANDOM*${lines_in_file}/32767+1)); awk -vline="$line" 'FNR==line' $file >> "output"; done

OR

cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5000`
do
        line=$(($RANDOM*${lines_in_file}/32767+1))
        awk -vline="$line" 'FNR==line' $file >> "output"
done

Where $RANDOM is responsible for generating the random numbers from 0 to 32767 .
EDIT: Above 2 solutions will give lines duplicate, then following may help you in same too.

awk 'FNR==NR {A[$1]; next} {if (FNR in A) print}' <(shuf -i 1-15000 -n5000) Input_file > output

Thanks,
R. Singh

1 Like

Thanks to all the folks for their suggestions. I am still not meet my requirement. As I said, I have a file with 15000 lines and I want to select 5000 lines for five times. However, in each of these five times, I want to have different 5000 selected line. In other words, I am looking for five different set of randomly selected 5000 lines from the whole set of 15000.

Actually, your specification has never been clear. First, you wanted 2 3 line output files from a 10 line input file with no duplicates in either of the output files. Then you wanted a single 5000 line file from a 15000 line file. Then you wanted 3 5000 line output files from a 15000 line input file. And, now you want 5 5000 line output files from a 15000 input line file. How do you randomly select 25000 lines from a 15000 line file without replacements?

If you mean that you want 5 5000 files each of which has lines from a 15000 line file with no replacements in any one of the 5 output files, why doesn't:

shuf < 15000LineFile | head -n 5000 > 5000LineFile

give you what you want (or to get 5 output files):

for i in 1 2 3 4 5
do	shuf < 15000LineFile | head -n 5000 > 5000LineFile$i
done

And, of course, Corona688's suggestion would have given you 3 5000 line files with no duplicates from your 15000 line file and a second run would give you 3 more 5000 line files to choose from...

But, of course, all of these assume that there are no duplicated lines in 15000LineFile (or if there are duplicates, you don't mind them being duplicated in one of your output files as long as there aren't more than N duplicates in an output file if there are N duplicates in your input file). Is there a chance for duplicated lines in your input file? If so, do those duplicates have to be removed before creating output files?

If we had a clearer specification of how lines in one of the output files are related to lines in other output files and whether or not there could be duplicated lines in the input file (and, if so, how they are to be handled), all of the output files could be created by a single invocation of awk .

Knowing what operating system and shell you're using would also help for several possible script suggestions.

5 * 5000 = 25000. There will unavoidably be duplicates.

If you don't care about duplicates, it's easy to create as many 5000-line shuffles as you want.

for N in 1 2 3 4 5
do
        shuf < inputfile | head -n 5000 > output.$N
done