Replace randomly occurrences bash

allabiba · February 20, 2020, 3:49am

I want to replace only percent of occurrence of a pattern (not all, for example, half of occurrence of a pattern) in my text and it replaces randomly occurrences (or for example multiple of 3 occurrences).
for example, I have a text like below:
The text contains 20 a .

a a a a a a a a a a
a a a a a a a a a a

I want to replace half of a (not whole of them) and randomly (not half of the first occurrences)

replace a to b and the result is something like below:

a b b b a b a b a a
b a b a b a a a b b

best regards

vbe · February 20, 2020, 4:16am

Welcome on board!
As you have seen our rules, we are here to help you not do the work for you, what have you tried so far?
We also need to know a minimum such as your OS and version and the shell you are using

RudiC · February 20, 2020, 6:13am

How "randomly" should those replacements occur? Exactly half? Half per line or per total replacements over all lines?

allabiba · February 20, 2020, 10:24am

1- I tried to turn text to one array and replace them something like this:

text=(
    a a a a a a a a a a
    a a a a a a a a a a
)
Get the number of items

number=${#text[@]}
Get half of it

half=$((number/2))
In a loop from 1 to the half of the text get random number and using it as index change item in text array

for i in $(seq $half); {
    rnd=$((RANDOM % (number-1)))
    text[$rnd]=b
}
Echo result

echo ${text[@]}

But this code doesn't keep the new line and echo all text in one line and my text is so big about 15 Gb and can't put it in one array.

2- Choose randomly from all occurrences of one word or token in the whole text.
simply can get number of repetition by something like this ( grep "a" | wc -l )

3- I can use python to do this but my text is huge and I want to use bash since it is faster than python. In python I use a set contains (a,b) and in replace function use a random function to choose (a or b) from that set.

4- I use Ubuntu 18.04, sed (GNU sed) 4.4, GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)

5- I simplify the problem, the main problem is that: I want to normalize a text corpus for training a tri-gram language model, in the language model, the sequence of words is important. I normalize the numbers to letters so for example, I convert all 30 to thirty but we use often half instead of thirty for reporting hour (e.g 8:30). I want to replace randomly things like this not whole of them.

best regards

RudiC · February 20, 2020, 10:29am

How about

awk '{for (i=1; i<=NF; i++) {$i=(int(.5+rand()))?$i:"b"; SUM+=x}} 1' file
b a b a b a b a b a
b b b b b a a b b b

allabiba · February 20, 2020, 11:16am

How can I set replace strings?
I meen how can I set "a" and "b" in that code?

rdrtx1 · February 20, 2020, 11:31am

awk '
BEGIN{srand();}
{
fstr=":";
for (fld in fields) delete fields[fld];
while (gsub(":", ":", fstr) <= (NF*(pct/100.0))) {
rnd=rand();
sub("^[^.]*[.]", "", rnd);
fld=((rnd+1) % NF) + 1;
if (fstr !~ ":" fld ":") fstr=fstr fld ":";
}
sub("^:*", "", fstr);
sub(":*$", "", fstr);
n=split(fstr, arr, ":");
for (i=1; i<=n; i++) fields[arr]=i;
for (i=1; i<=NF; i++) $i=(i in fields) ? repstr : $i;
print $0;
}
' pct=50 repstr="b" file

nezabudka · February 20, 2020, 2:43pm

Hi

awk -F '' ' #for multiple of 3 occurrences
BEGIN   {srand(); split("a a b", ab, " ")}
        {for(i=1; i<=NF; i++)
                if($i == "a") $i = ab[int(rand()*10%3)+1]
        } 1' OFS='' file

a a a a a a a a a b
a b a a a b a a b b
the table stands at the window bnd
the picture hang on the wbll.