Get 20% of lines in File randomly

chercheur857 · September 28, 2012, 5:55am

Hello,

This is my code:

nb_lignes=`wc -l $1 | cut -d " " -f1`
for i in $(seq $nb_lignes)
do
m=`head $1 -n $i | tail -1`
//command
done

Please how can i change it to get Get 20% of lines in File randomly to apply "command" on each line ? 20% or 40% or 60 % (it's a parameter)

Thank you.

balajesuri · September 28, 2012, 7:06am

[root@host dir]# head -$(( $(wc -l file | cut -d" " -f1) * 40 / 100 )) file

Lem · September 28, 2012, 7:39am

If I'm right your problem is to randomly generate N integers in the range [1;M], where M is the number of lines in the file, and N is a rounded percentage of M.

Before going on: are repetitions allowed? In other words, may a number (a line in the file) be randomly selected more than once?

And: what are your shell and OS?

EDIT: solution proposed

However, this is a possible solution working on linux/bash, with repetitions not allowed:

#!/bin/bash
range=$(wc -l "$1" | cut -d " " -f1)
(( $range > 32768 )) && exit              ### max number of lines for this script: 32768
percent=$2                                ### set this  as you like, [1-100]
(( 0 < $2 )) && (( $2 < 101 )) || exit
lim=$(( $range * $percent / 100 ))
for ((i=0;i<lim;i++)); do
        num=$(( $RANDOM % $range + 1 ));
        arr[$i]=$num;
        for ((j=0;j<i;j++)); do
                (( ${arr[$j]} == $num )) && {
                        let i--
                        break; }
        done
done

for linenum in "${arr[@]}"; do
        line=$(sed -n "$linenum p" "$1")
        ### your stuff here, for example: ###
        echo "$linenum"$'\t'"### $line"
done

exit 0

Usage:

lem@biggy:/tmp$ ./file2 file2 30
6	### lim=$(( $range * $percent / 100 ))
19	###         ### your stuff here, for example: ###
11	###                 (( ${arr[$j]} == $num )) && {
10	###         for ((j=0;j<i;j++)); do
5	### (( 0 < $2 )) && (( $2 < 101 )) || exit
4	### percent=$2                                ### set this  as you like, [1-100]

--
Bye

tukuyomi · September 28, 2012, 10:46am

Here is a solution using awk:

~/unix.com$ awk 'int(100*rand())%5<1' file

5 and 1 are the parameters you want to modify here : 1/5 = 20% in this example

To be more specific in your requirements:

~/unix.com$ awk 'int(101*rand())%100<value' value=20 file | while read line; do echo "command $line"; done

Set value=... to get ...% of lines

chercheur857 · October 1, 2012, 12:44pm

@Lem: thank you so much for help, i test your solution it works

chercheur857 · October 8, 2012, 5:30pm

@Lem: please can you explain me this loop by an example

for ((i=0;i<lim;i++)); do
        num=$(( $RANDOM % $range + 1 ));
        arr[$i]=$num;
        for ((j=0;j<i;j++)); do
                (( ${arr[$j]} == $num )) && {
                        let i--
                        break; }
        done
done

i'm sorry if i disturb you thank you so much

Lem · October 8, 2012, 6:47pm

## Let's say that:
## the number of lines in your file is 200 (range=200);
## you want 30% of the lines in your file: percent=30.
## So we'll have that lim= 200 * 30 / 100 = 60.

for ((i=0;i<lim;i++)); do
## We start our first step of our loop with i=0. We check that 0<200. It is, so this time
## we run the loop.
## At the end of the loop i value will be increased by one: this is the meaning of i++.
## Let's say that now i=15.

num=$(( $RANDOM % $range + 1 ));
## $RANDOM looks like a parameter, but you better think of it as a special function. 
## Every time you call it, you get a pseudo random number between 0 and 32767.
## Let's say this time we get 8512. We calculate the remainder of the division
## 8512/200. So: 8512/200=200*42+112. 112 is the remainder. We add 1, and we get
## num=113. Note: num we'll always be between 1 and 200.

arr[$i]=$num;
## We save this value as arr[15], the 16th element of our array.  At the end our array will have
## lim elements, so 60 elements in this example, indexed from arr[0] to arr[59].

for ((j=0;j<i;j++)); do
## This is our control loop.
## Now we want to check if the 16th element we've just found is a repetition. We 
## already know that our first 15 elements are all different, because we've already run
## this same test for each of the past 15 elements.

(( ${arr[$j]} == $num )) && {
## So we compare arr[15] with arr[0], then with arr[1], then..., then with arr[14].
## If at any point we find it is indeed a repetition (that is: if we find that our
## 16th value is equal to one of the previous 15 values), we

let i--
## decrease the i value by 1, so now i=14 and

break; }
## we immediately exit from our control loop: no need to waste time. So now we're
## back to our main loop, where i is increased by one: it gets back to 15 again,
## and again we try to find a new 16th element.

done
## If instead we complete all 15 steps of our control loop, without a break, we know that
## arr[15] is not a repetition, and we go back to our main loop. As we said i value
## is incremented by 1, and  so now it is 16. And we go forth for the 17th element generation.

done
## After we've found 60 different elements, we're done

Feel free to ask again if I couldn't explain myself.
--
Bye

chercheur857 · October 20, 2012, 4:43am

@Lem Thank you so much for help
Please i'd like to apply your program on this file: Myfile

xxxxxxxx-50
xxxxxxxx-51
xxxxxxxx-52
xxxxxxxx-53
./program Myfile 10     #25 percent

My question please how can i eliminate 50 an 51 , i mean i'd like 25% of this list other than this two lines 50 and 51, and i'd like to specify it in the command like that

./program Myfile 10 50 51

Have you an idea please ?
Thank you so much for help

Lem · October 22, 2012, 5:34am

chercheur857:

My question please how can i eliminate 50 an 51 , i mean i'd like 25% of this list other than this two lines 50 and 51, and i'd like to specify it in the command like that
./program Myfile 10 50 51
Have you an idea please ?
Thank you so much for help

#!/bin/bash

inputfile="/tmp/buddyfile"
cp "$1" $inputfile
(( $# < 3 )) || {
        string="-${3}$"
        for ((a=4;a<=$#;a++)); do
                string+="\|-${!a}$"
        done
        sed -i "/$string/d" $inputfile; }

range=$(wc -l $inputfile | cut -d " " -f1)
(( $range > 32768 )) && exit              ### max number of lines for this script: 32768
percent=$2                                ### set this  as you like, [1-100]
(( 0 < $2 )) && (( $2 < 101 )) || exit
lim=$(( $range * $percent / 100 ))
for ((i=0;i<lim;i++)); do
        num=$(( $RANDOM % $range + 1 ));
        arr[$i]=$num;
        for ((j=0;j<i;j++)); do
                (( ${arr[$j]} == $num )) && {
                        let i--
                        break; }
        done
done

for linenum in "${arr[@]}"; do
        line=$(sed -n "$linenum p" $inputfile)
        ### your stuff here, for example: ###
        echo "$linenum"$'\t'"### $line"
done

exit 0

Changes are in bold. I didn't test it, but it should work.

Of course if 50, 51 and friends are true line numbers counting from 1 (first line) onwards, we can simplify the new part of the script.

Beware: if you have a file of 100 lines, and you exclude 40 lines, and you want 30% of the lines, you'll get 30% of the remaining 60 lines, so you'll get 18 lines. Is this right for you, or do you still want 30 lines among the 60 remaining lines?
--
Bye

chercheur857 · October 22, 2012, 7:01am

i have tested your code in my example but it display nothing
i run

./program Myfile 25 50 52

budyfile contain

xxxxxxxx-51
xxxxxxxx-53

2 lines (but 25% of lines =1)
Have you an idea please ?
Thank you so much for help.

Lem · October 22, 2012, 7:54am

No, 25% is calculated on the remaining lines (after you excluded two lines).
So 25% of 2 is 0.5, which rounds to zero.

Try: ./program Myfile 50 50 52
You'll get 1 line randomly chosen among 2 good lines.
4 total lines - 2 excluded lines = 2 good lines, and 50% of 2 good lines is 1 - randomly chosen - line.
--
Bye

chercheur857 · October 22, 2012, 8:37am

Thank you Lem so much for help,
i try to test your script on this file:

xxxxxxxx-94.yyyyy.zzzzzzzz.aa
xxxxxxxx-95.yyyyy.zzzzzzzz.aa
xxxxxxxx-96.yyyyy.zzzzzzzz.aa
xxxxxxxx-97.yyyyy.zzzzzzzz.aa
xxxxxxxx-98.yyyyy.zzzzzzzz.aa

i run

./prog.sh  file.txt 50 94 97

it display:

4	### xxxxxxxx-97.yyyyy.zzzzzzzz.aa
3	### xxxxxxxx-96.yyyyy.zzzzzzzz.aa

have you an idea please ? i'm so sorry if i disturb you
Thank you so much
BestRegards

Lem · October 22, 2012, 9:00am

In this file "-94" and alike are not at the end of lines, as they were before. So no lines were excluded. You got 50% of 5 lines, so 2 lines (2.5 rounds to 2).

If you want a good excluding pattern, please show very well how lines can be built.
Is it good a pattern that matches -XY. , where X and Y are digits? That means: do you have -XY. once and only once in each line?
--
Bye

chercheur857 · October 22, 2012, 9:06am

where X and Y are digits? That means: do you have -XY. once and only once in each line?

Yes X and Y are digits, XY no redendant in other line (once and only once in each line)

Lem · October 22, 2012, 11:55am

Ok. So we have a sequence of chars (a -, then some digits you may want to use for exclusion, then a .) once and only once in each line. So here it is:

#!/bin/bash

inputfile="/tmp/buddyfile"
cp "$1" $inputfile
(( $# < 3 )) || {
        string="-${3}\."
        for ((a=4;a<=$#;a++)); do
                string+="\|-${!a}\."
        done
        sed -i "/$string/d" $inputfile; }

range=$(wc -l $inputfile | cut -d " " -f1)
(( $range > 32768 )) && exit              ### max number of lines for this script: 32768
percent=$2                                ### set this  as you like, [1-100]
(( 0 < $2 )) && (( $2 < 101 )) || exit
lim=$(( $range * $percent / 100 ))
for ((i=0;i<lim;i++)); do
        num=$(( $RANDOM % $range + 1 ));
        arr[$i]=$num;
        for ((j=0;j<i;j++)); do
                (( ${arr[$j]} == $num )) && {
                        let i--
                        break; }
        done
done

for linenum in "${arr[@]}"; do
        line=$(sed -n "$linenum p" $inputfile)
        ### your stuff here, for example: ###
        echo "$linenum"$'\t'"### $line"
done

exit 0

Changes are in bold red.
--
Bye

chercheur857 · October 22, 2012, 3:38pm

Thank you so much for help