I want to replace only percent of occurrence of a pattern (not all, for example, half of occurrence of a pattern) in my text and it replaces randomly occurrences (or for example multiple of 3 occurrences).
for example, I have a text like below:
The text contains 20 a .
a a a a a a a a a a
a a a a a a a a a a
I want to replace half of a (not whole of them) and randomly (not half of the first occurrences)
replace a to b and the result is something like below:
Welcome on board!
As you have seen our rules, we are here to help you not do the work for you, what have you tried so far?
We also need to know a minimum such as your OS and version and the shell you are using
1- I tried to turn text to one array and replace them something like this:
text=(
a a a a a a a a a a
a a a a a a a a a a
)
Get the number of items
number=${#text[@]}
Get half of it
half=$((number/2))
In a loop from 1 to the half of the text get random number and using it as index change item in text array
for i in $(seq $half); {
rnd=$((RANDOM % (number-1)))
text[$rnd]=b
}
Echo result
echo ${text[@]}
But this code doesn't keep the new line and echo all text in one line and my text is so big about 15 Gb and can't put it in one array.
2- Choose randomly from all occurrences of one word or token in the whole text.
simply can get number of repetition by something like this ( grep "a" | wc -l )
3- I can use python to do this but my text is huge and I want to use bash since it is faster than python. In python I use a set contains (a,b) and in replace function use a random function to choose (a or b) from that set.
4- I use Ubuntu 18.04, sed (GNU sed) 4.4, GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
5- I simplify the problem, the main problem is that: I want to normalize a text corpus for training a tri-gram language model, in the language model, the sequence of words is important. I normalize the numbers to letters so for example, I convert all 30 to thirty but we use often half instead of thirty for reporting hour (e.g 8:30). I want to replace randomly things like this not whole of them.