If I'm right your problem is to randomly generate N integers in the range [1;M], where M is the number of lines in the file, and N is a rounded percentage of M.
Before going on: are repetitions allowed? In other words, may a number (a line in the file) be randomly selected more than once?
And: what are your shell and OS?
EDIT: solution proposed
However, this is a possible solution working on linux/bash, with repetitions not allowed:
#!/bin/bash
range=$(wc -l "$1" | cut -d " " -f1)
(( $range > 32768 )) && exit ### max number of lines for this script: 32768
percent=$2 ### set this as you like, [1-100]
(( 0 < $2 )) && (( $2 < 101 )) || exit
lim=$(( $range * $percent / 100 ))
for ((i=0;i<lim;i++)); do
num=$(( $RANDOM % $range + 1 ));
arr[$i]=$num;
for ((j=0;j<i;j++)); do
(( ${arr[$j]} == $num )) && {
let i--
break; }
done
done
for linenum in "${arr[@]}"; do
line=$(sed -n "$linenum p" "$1")
### your stuff here, for example: ###
echo "$linenum"$'\t'"### $line"
done
exit 0
Usage:
lem@biggy:/tmp$ ./file2 file2 30
6 ### lim=$(( $range * $percent / 100 ))
19 ### ### your stuff here, for example: ###
11 ### (( ${arr[$j]} == $num )) && {
10 ### for ((j=0;j<i;j++)); do
5 ### (( 0 < $2 )) && (( $2 < 101 )) || exit
4 ### percent=$2 ### set this as you like, [1-100]
## Let's say that:
## the number of lines in your file is 200 (range=200);
## you want 30% of the lines in your file: percent=30.
## So we'll have that lim= 200 * 30 / 100 = 60.
for ((i=0;i<lim;i++)); do
## We start our first step of our loop with i=0. We check that 0<200. It is, so this time
## we run the loop.
## At the end of the loop i value will be increased by one: this is the meaning of i++.
## Let's say that now i=15.
num=$(( $RANDOM % $range + 1 ));
## $RANDOM looks like a parameter, but you better think of it as a special function.
## Every time you call it, you get a pseudo random number between 0 and 32767.
## Let's say this time we get 8512. We calculate the remainder of the division
## 8512/200. So: 8512/200=200*42+112. 112 is the remainder. We add 1, and we get
## num=113. Note: num we'll always be between 1 and 200.
arr[$i]=$num;
## We save this value as arr[15], the 16th element of our array. At the end our array will have
## lim elements, so 60 elements in this example, indexed from arr[0] to arr[59].
for ((j=0;j<i;j++)); do
## This is our control loop.
## Now we want to check if the 16th element we've just found is a repetition. We
## already know that our first 15 elements are all different, because we've already run
## this same test for each of the past 15 elements.
(( ${arr[$j]} == $num )) && {
## So we compare arr[15] with arr[0], then with arr[1], then..., then with arr[14].
## If at any point we find it is indeed a repetition (that is: if we find that our
## 16th value is equal to one of the previous 15 values), we
let i--
## decrease the i value by 1, so now i=14 and
break; }
## we immediately exit from our control loop: no need to waste time. So now we're
## back to our main loop, where i is increased by one: it gets back to 15 again,
## and again we try to find a new 16th element.
done
## If instead we complete all 15 steps of our control loop, without a break, we know that
## arr[15] is not a repetition, and we go back to our main loop. As we said i value
## is incremented by 1, and so now it is 16. And we go forth for the 17th element generation.
done
## After we've found 60 different elements, we're done
Feel free to ask again if I couldn't explain myself.
--
Bye
My question please how can i eliminate 50 an 51 , i mean i'd like 25% of this list other than this two lines 50 and 51, and i'd like to specify it in the command like that
./program Myfile 10 50 51
Have you an idea please ?
Thank you so much for help
#!/bin/bash
inputfile="/tmp/buddyfile"
cp "$1" $inputfile
(( $# < 3 )) || {
string="-${3}$"
for ((a=4;a<=$#;a++)); do
string+="\|-${!a}$"
done
sed -i "/$string/d" $inputfile; }
range=$(wc -l $inputfile | cut -d " " -f1)
(( $range > 32768 )) && exit ### max number of lines for this script: 32768
percent=$2 ### set this as you like, [1-100]
(( 0 < $2 )) && (( $2 < 101 )) || exit
lim=$(( $range * $percent / 100 ))
for ((i=0;i<lim;i++)); do
num=$(( $RANDOM % $range + 1 ));
arr[$i]=$num;
for ((j=0;j<i;j++)); do
(( ${arr[$j]} == $num )) && {
let i--
break; }
done
done
for linenum in "${arr[@]}"; do
line=$(sed -n "$linenum p" $inputfile)
### your stuff here, for example: ###
echo "$linenum"$'\t'"### $line"
done
exit 0
Changes are in bold. I didn't test it, but it should work.
Of course if 50, 51 and friends are true line numbers counting from 1 (first line) onwards, we can simplify the new part of the script.
Beware: if you have a file of 100 lines, and you exclude 40 lines, and you want 30% of the lines, you'll get 30% of the remaining 60 lines, so you'll get 18 lines. Is this right for you, or do you still want 30 lines among the 60 remaining lines?
--
Bye
No, 25% is calculated on the remaining lines (after you excluded two lines).
So 25% of 2 is 0.5, which rounds to zero.
Try: ./program Myfile 50 50 52
You'll get 1 line randomly chosen among 2 good lines.
4 total lines - 2 excluded lines = 2 good lines, and 50% of 2 good lines is 1 - randomly chosen - line.
--
Bye
In this file "-94" and alike are not at the end of lines, as they were before. So no lines were excluded. You got 50% of 5 lines, so 2 lines (2.5 rounds to 2).
If you want a good excluding pattern, please show very well how lines can be built.
Is it good a pattern that matches -XY. , where X and Y are digits? That means: do you have -XY. once and only once in each line?
--
Bye