I have one column of 15000 lines and want to select randomly 5000 of them in five different times without replacement. I am aware that command 'shuf' and 'sort -R' could select randomly those lines but I am not sure how could I avoid the replacement of selection line. Does anyone have a suggestion?
Thank you Corona688 for your suggestion. I only present the small example. In my case, I want to randomly choose 5000 lines out of 15000 lines. what should I do for this situation?
Could you please try following and let me know if this helps you.
1st solution:
cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5000`; do line=$(($RANDOM*${lines_in_file}/32767+1)); sed "${line}q;d" $file >> "output"; done
OR
cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5000`
do
line=$(($RANDOM*${lines_in_file}/32767+1))
sed "${line}q;d" $file >> "output"
done
2nd solution:
cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5001`; do line=$(($RANDOM*${lines_in_file}/32767+1)); awk -vline="$line" 'FNR==line' $file >> "output"; done
OR
cat script.ksh
lines_in_file=`cat Input_file | wc -l`
file=Input_file
for val in `seq 1 5000`
do
line=$(($RANDOM*${lines_in_file}/32767+1))
awk -vline="$line" 'FNR==line' $file >> "output"
done
Where $RANDOM is responsible for generating the random numbers from 0 to 32767 .
EDIT: Above 2 solutions will give lines duplicate, then following may help you in same too.
awk 'FNR==NR {A[$1]; next} {if (FNR in A) print}' <(shuf -i 1-15000 -n5000) Input_file > output
Thanks to all the folks for their suggestions. I am still not meet my requirement. As I said, I have a file with 15000 lines and I want to select 5000 lines for five times. However, in each of these five times, I want to have different 5000 selected line. In other words, I am looking for five different set of randomly selected 5000 lines from the whole set of 15000.
Actually, your specification has never been clear. First, you wanted 2 3 line output files from a 10 line input file with no duplicates in either of the output files. Then you wanted a single 5000 line file from a 15000 line file. Then you wanted 3 5000 line output files from a 15000 line input file. And, now you want 5 5000 line output files from a 15000 input line file. How do you randomly select 25000 lines from a 15000 line file without replacements?
If you mean that you want 5 5000 files each of which has lines from a 15000 line file with no replacements in any one of the 5 output files, why doesn't:
shuf < 15000LineFile | head -n 5000 > 5000LineFile
give you what you want (or to get 5 output files):
for i in 1 2 3 4 5
do shuf < 15000LineFile | head -n 5000 > 5000LineFile$i
done
And, of course, Corona688's suggestion would have given you 3 5000 line files with no duplicates from your 15000 line file and a second run would give you 3 more 5000 line files to choose from...
But, of course, all of these assume that there are no duplicated lines in 15000LineFile (or if there are duplicates, you don't mind them being duplicated in one of your output files as long as there aren't more than N duplicates in an output file if there are N duplicates in your input file). Is there a chance for duplicated lines in your input file? If so, do those duplicates have to be removed before creating output files?
If we had a clearer specification of how lines in one of the output files are related to lines in other output files and whether or not there could be duplicated lines in the input file (and, if so, how they are to be handled), all of the output files could be created by a single invocation of awk .
Knowing what operating system and shell you're using would also help for several possible script suggestions.