Random letters

eldeingles · August 23, 2018, 3:12am

Hi there,
first of all this is not homework...this is a new type of exercise for practicing vocabulary with my students.
I have a file consisting of two columns, separated by a tab, each line consisting of a word and its definition, separated by a line break.

What i need is to replace a number of random letters of the defined word with an underscore. The number of letters would depend on the length of the word, but half its number would be ok...ideas?

Much appreciated.

RudiC · August 23, 2018, 3:19am

An input sample, and a desired output would help.

eldeingles · August 23, 2018, 3:26am

INPUTFILE

a true � of Islam \t follower
the recent � of two CIA agents \t disappearance
The restructuring is designed to give a sharper � on key markets. \t focus
a large country house with beautiful landscaped � \t gardens

OUTPUTFILE

a true � of Islam \t fo _ _o_er
the recent � of two CIA agents \t d_ _a_ _ear_ ce
The restructuring is designed to give a sharper � on key markets. \t f c_s
a large country house with beautiful landscaped � \t _ar_e_s

RudiC · August 23, 2018, 3:45am

zeroth approximation .. you need to eliminate the leading spaces in $2, no check is done to not to replace an already set "_" with another one, nor replacement of adjacent characters. Try

awk -F"\t" '
        {LEN = split ($2, T, "")
         $2 = ""
         for (i=1; i<=LEN/2; i++) T[int(rand()*LEN)] = "_"
         for (i=1; i<=LEN;   i++) $2 = $2 T
        }
1
' file
a true � of Islam  fol__w_r
the recent � of two CIA agents  di_app_a___ce
The restructuring is designed to give a sharper � on key markets.  f_c_s
a large country house with beautiful landscaped �  ___de_s

eldeingles · August 23, 2018, 5:01am

Wow, RudiC, that's awesome!
A final thought: how could I add a blank space when 2 underscores are together?

Great help!

------ Post updated at 03:40 AM ------

Forget it! I figured it out myself.

------ Post updated at 03:51 AM ------

How could I get the blanked_out resulting word separated from the definition by a tab? This I can't figure it out.

------ Post updated at 04:01 AM ------

For some reason this words didn't come out well:

...
footprints left in the hard dried ... mud
...
a hit TV ... show
...
a blanket advertising ... on tobacco ban
I need my beauty ... . rest
Elephants have a very tough ... . hide

RudiC · August 23, 2018, 5:20am

Facts / data, please.
Where exactly do you need the <TAB> separation?
What and how "didn't come out well"?

For the space separated underscore chars, use " _ " for the T[...] assignment.

eldeingles · August 23, 2018, 5:26am

That's what I would need to be incorporated into an Excel sheet.

A ransom ... has been made for the kidnapped racehorse.  \t _  _ m _ nd

A third of the country's population is of mixed racial ... . \t h _ ri _  _  _ e

the professional ... of the lawyers and accountants involved  \t _ ees

And the words that didn't come out well is because none of their letters were substituted by any blank.

RudiC · August 23, 2018, 5:38am

Set the output field separator to <TAB>: OFS="\t" just before the file name.
The non-substitution was possibly due to several factors:

a three char word would offer few options to subst; in fact, we would have one single substitution only, when going for LEN/2 chars.
the rand() function can return a zero value, so none of the string's characters (1 - LEN) were targeted.

Try

awk -F"\t" '
        {LEN = split ($2, T, "")
         $2 = ""
         for (i=1; i<=(LEN+1)/2; i++) T[int(.5+rand()*LEN)] = " _ "
         for (i=1; i<=LEN;       i++) $2 = $2 T
        }
1
' OFS="\t" file

eldeingles · August 23, 2018, 5:51am

Worked like magic!

------ Post updated at 04:51 AM ------

Final result: approx 1200 words didn't come out with any letter blanked out out of a total of 48k. I can't see any patterns or similarities in them.

RudiC · August 23, 2018, 5:56am

Again, facts, please. Does it happen again / identical to the same words on a second run?

eldeingles · August 23, 2018, 6:06am

Yes, always the same words.

eldeingles · August 23, 2018, 6:17am

Here is the non-blanked out words file.

RudiC · August 23, 2018, 6:24am

Extract a few of the offending lines into a small file and run the script several times - there should be varying substitutions; at least there are when I do.

As said in the beginning - it's an approximation to be refined if further conditions need to be met.

eldeingles · August 23, 2018, 6:48am

Yes, the offending lines become less numerous as I run the code and process again the resulting lines. It works perfect for me!

Thanks again RudiC!

Shall we close the thread?

------ Post updated at 05:48 AM ------

It took me 9 rounds. The final 2 rounds consisted of 3-letter words and a very few 2-word.

RudiC · August 23, 2018, 6:49am

I did a thorough analysis of the script and its results. There's no systematic error ignoring certain / certain length words. The algorithm to select characters is a rudimetary one and has its flaws as pointed out before. By defining more sophisticated rules on how to use / evaluate / improve the random number generation and application (no zero target, no double substitution), you could certainly stabilize the results, but to what avail?
I'm OK to close the thread.

eldeingles · August 23, 2018, 6:50am

Kudos for you, RudiC!