Random letters

Hi there,
first of all this is not homework...this is a new type of exercise for practicing vocabulary with my students.
I have a file consisting of two columns, separated by a tab, each line consisting of a word and its definition, separated by a line break.

What i need is to replace a number of random letters of the defined word with an underscore. The number of letters would depend on the length of the word, but half its number would be ok...ideas?

Much appreciated.

An input sample, and a desired output would help.

INPUTFILE

a true � of Islam \t follower
the recent � of two CIA agents \t disappearance
The restructuring is designed to give a sharper � on key markets. \t focus
a large country house with beautiful landscaped � \t gardens

OUTPUTFILE

a true � of Islam \t fo _ _o_er
the recent � of two CIA agents \t d_ _a_ _ear_ ce
The restructuring is designed to give a sharper � on key markets. \t f
c_s
a large country house with beautiful landscaped � \t _ar_e_s

zeroth approximation .. you need to eliminate the leading spaces in $2, no check is done to not to replace an already set "_" with another one, nor replacement of adjacent characters. Try

awk -F"\t" '
        {LEN = split ($2, T, "")
         $2 = ""
         for (i=1; i<=LEN/2; i++) T[int(rand()*LEN)] = "_"
         for (i=1; i<=LEN;   i++) $2 = $2 T
        }
1
' file
a true � of Islam  fol__w_r
the recent � of two CIA agents  di_app_a___ce
The restructuring is designed to give a sharper � on key markets.  f_c_s
a large country house with beautiful landscaped �  ___de_s
2 Likes

Wow, RudiC, that's awesome!
A final thought: how could I add a blank space when 2 underscores are together?

Great help!

------ Post updated at 03:40 AM ------

Forget it! I figured it out myself.

------ Post updated at 03:51 AM ------

How could I get the blanked_out resulting word separated from the definition by a tab? This I can't figure it out.

------ Post updated at 04:01 AM ------

For some reason this words didn't come out well:

  • ...
  • footprints left in the hard dried ... mud
  • ...
  • a hit TV ... show
  • ...
  • a blanket advertising ... on tobacco ban
  • I need my beauty ... . rest
  • Elephants have a very tough ... . hide

Facts / data, please.
Where exactly do you need the <TAB> separation?
What and how "didn't come out well"?

For the space separated underscore chars, use " _ " for the T[...] assignment.

That's what I would need to be incorporated into an Excel sheet.

A ransom ... has been made for the kidnapped racehorse.  \t _  _ m _ nd
A third of the country's population is of mixed racial ... . \t h _ ri _  _  _ e
the professional ... of the lawyers and accountants involved  \t _ ees

And the words that didn't come out well is because none of their letters were substituted by any blank.

Set the output field separator to <TAB>: OFS="\t" just before the file name.
The non-substitution was possibly due to several factors:

  • a three char word would offer few options to subst; in fact, we would have one single substitution only, when going for LEN/2 chars.
  • the rand() function can return a zero value, so none of the string's characters (1 - LEN) were targeted.

Try

awk -F"\t" '
        {LEN = split ($2, T, "")
         $2 = ""
         for (i=1; i<=(LEN+1)/2; i++) T[int(.5+rand()*LEN)] = " _ "
         for (i=1; i<=LEN;       i++) $2 = $2 T
        }
1
' OFS="\t" file

Worked like magic!

------ Post updated at 04:51 AM ------

Final result: approx 1200 words didn't come out with any letter blanked out out of a total of 48k. I can't see any patterns or similarities in them.

Again, facts, please. Does it happen again / identical to the same words on a second run?

Yes, always the same words.

Here is the non-blanked out words file.

Extract a few of the offending lines into a small file and run the script several times - there should be varying substitutions; at least there are when I do.

As said in the beginning - it's an approximation to be refined if further conditions need to be met.

Yes, the offending lines become less numerous as I run the code and process again the resulting lines. It works perfect for me!

Thanks again RudiC!

Shall we close the thread?

------ Post updated at 05:48 AM ------

It took me 9 rounds. The final 2 rounds consisted of 3-letter words and a very few 2-word.

I did a thorough analysis of the script and its results. There's no systematic error ignoring certain / certain length words. The algorithm to select characters is a rudimetary one and has its flaws as pointed out before. By defining more sophisticated rules on how to use / evaluate / improve the random number generation and application (no zero target, no double substitution), you could certainly stabilize the results, but to what avail?
I'm OK to close the thread.

1 Like

Kudos for you, RudiC!