Randomize letters

jeppe83 · June 18, 2012, 6:18pm

Hi,

Is there a tool somewhat parallel to rev, but which randomizes instead of reverses?

I've tried rl, but I can only get it to randomize words.

I was hoping for something like this

echo "hello" | ran 
leolh

less simpler solutions are also welcome.
Sorry if the question is trivial or answered before, but I cant seem to find a solution..

drl · June 18, 2012, 6:42pm

Hi.

echo hello | fold -w 1 | sort -r | tr -d '\n';echo

producing

ollhe

See man pages for details.

Best wishes ... cheers, drl

jeppe83 · June 18, 2012, 7:10pm

Thank you. Just what I was looking for!

I replaced sort -r with shuf, so I could use it for iteration.

jeppe@space:/$ while :;do echo "palindrom" | fold -w 1 | shuf | tr -d '\n';echo;done
daipmonrl
rlmipdaon
mpriladon
aiomrplnd
lnpraiomd
nidlpraom
ilmpadnro
rodnlamip
odapnimrl
riodlmanp
oadpimrnl
...etc.

Scrutinizer · June 19, 2012, 1:16am

Hold on, there is nothing random about sort -r , obviously
--
Edit: Ah I see you probably mean (GNU) sort -R

balajesuri · June 19, 2012, 2:53am

Here's an implementation of Fisher Yates Shuffle algorithm (from perlfaq4):

#! /usr/bin/perl -w
use strict;

my $string = "hello";
my @x = split '', $string;
fisher_yates_shuffle( \@x );
print @x;

sub fisher_yates_shuffle {
    my $deck = shift;
    return unless @$deck;
    my $i = @$deck;
    while (--$i) {
        my $j = int rand ($i+1);
        @$deck[$i,$j] = @$deck[$j,$i];
    }
}

jeppe83 · June 19, 2012, 8:51am

It seems sort -R works faster than shuf..

I'm experimenting with a script that finds anagrams.

So far this does the job:

#!/bin/bash
#anagramfinder
while :
do
WORD=$(shuf -n 1 /data/korpus2k/kord)
AG=$(echo "$WORD" | fold -w 1 | sort -R | tr -d '\n';echo)
CHECK=$(grep -w "$AG" /data/korpus2k/kord | wc -l)
	if [ "$CHECK" -eq "0" ]
	then true
	elif [ "$WORD" = "$AG" ]
	then true	
	else  
	echo "$WORD" "$AG"			
	fi
done

The file kord is a word list containing 162060 meaningful Danish words, 1 pr. line.

Is there a way for the script to work on the first word of kord, randomize the letters until it finds itself or another, then the next etc.. instead of picking random words in 'eternal' iteration?

alister · June 19, 2012, 9:15am

Why randomize at all? You could scan kord, eliminating each word in turn if it is not the same length or does not use the same letters with the same frequency. 162,060 isn't a lot to brute.

Regards,
Alister

jeppe83 · June 19, 2012, 9:27am

That certainly sounds like a much better solution. Thanks

Can you supply some hints how to achieve this? I'm not an experienced scripter..

drl · June 19, 2012, 9:47am

Hi, Scrutinizer.

Thanks for catching and correcting my too-hastily posted code.

Looking back at my history, I originally used:

echo hello | fold -w 1 | unsort | tr -d '\n'

but then I remembered that I had written unsort (back in 1996), and I didn't want to post that code, so I punted with GNU sort, and got the case wrong. Mea culpa.

As it turns out, Linux also provides an unsort, so I could have left it in, sigh ... cheers, drl

alister · June 19, 2012, 10:02am

The following reads the dictionary from stdin and takes the word for which anagrams are to be found as the first argument to the script. This is not going to be very efficient (for that, better to implement a solution with AWK or perl) nor is it very convenient even if you consider it efficient enough (for example, you can't simply provide it more than one word to match against per invocation); perhaps it's enough to get you started with a better shell script (or AWK/perl solution).

# w: word for which matches are sought
# k: kord word being tested
# fw: letter frequency in w
# fk: letter frequency in k

# Usage: script word < kord

w=$1
while read -r k; do
    [ ${#w} -ne ${#k} ] && continue
    fw=$(printf '%s\n' "$w" | fold -w1 | sort | uniq -c)
    fk=$(printf '%s\n' "$k" | fold -w1 | sort | uniq -c)
    [ "$fw" != "$fk" ] && continue
    printf '%s\n' "$k"
done

Regards,
Alister

jeppe83 · June 19, 2012, 5:48pm

Thank you.

There's some syntax I don't (yet) understand.

It works slower than my rough "try a random permutation and see if it matches a word that isn't the same" script.

Ideally, I would like a list of all possible anagrams in Danish, so if I ever get this script done, it should work on word 1 in the word-list, check if the letters can be combined in a way that matches one or more words in the same word-list, then the next etc.

I also wonder if there is a (hopefully simple) way to generate a list of all possible combinations of letters.

eg.

echo "yes" | (some pipe)
esy
eys
sey
sye
yes
yse

methyl · June 19, 2012, 6:19pm

@jeppe83
Hold on!
Are you trying to write an anagram cracker (like you would use for crosswords) against your Danish words list?

If so, you don't need to generate all the combinations at all.

Ideally you would have your words list indexed by a key made of the letters in the word sorted, with duplicate keys allowed. Then take the input string, sort the letters and look up the anagrams.

This is trivial in most modern database packages, and is often set as a final piece. I haven't seen this one set as Homework on a Shell course so we shall assume that this is hobby computing.

It is also fairly trivial in Shell, but the important part is the script to prepare your look-up file(s) with each record containing a sorted letter key field and the matching word. When working with flat files, splitting the data by word length into separate files should be faster, but it depends how many seconds you are prepared to wait for an answer.

Did you mention anything about your computer or your own skills?
Operating System and version.
Preferred Shell.
Any programming languages which you know?

Ps. I have an old technology version. The Longman Anagram Dictionary (a book). It is first ordered by the length of the word, then the sorted letters of the word in alphabetical order. If you can beat me with that book in my hand, your program is good!

jeppe83 · June 19, 2012, 6:37pm

I realize I don't need to generate all combinations, but I would like to know how anyway.

It's not homework. I'm a linguist and years ago I followed a course "information technology for linguists" where I was introduced to the wonders of grep and sed etc. I've recently taken it up again, just for fun and to see how much I can remember. I don't aspire to be a scripting wizard..

I use

GNU bash, version 4.1.5(1)-release (x86_64-pc-linux-gnu)

I only have experience with bash-scripts and a little knowledge of awk.

my os is debian squeeze.

methyl · June 19, 2012, 7:01pm

Generating all combinations of a string once-only is not a trivial piece of code. I last wrote a program to do this in Basic-A (for those with long memories) to drive a stage light show.

The essence is that you take each character in turn then remove it from its position in the original string and then insert it into every possible position in the remaining string (including front and back). At the end of the process you have every possible permutation once-only. Purists would take account of duplicate letters (I didn't).

Somebody who has this GNU bash will be able to find a substring function (like that in Basic-A) which makes this easy.

Scrutinizer · June 19, 2012, 7:42pm

I concocted this awk, you could perhaps give it a try:

echo "word" | awk 'NR==1{w=$0; l=split(w,W)} length==l{for(i=1;i<=l;i++)if(gsub(W,"&",w)!=gsub(W,"&"))next;print}' FS= - wordlist

I could be made a bit more efficient still..

jeppe83 · June 19, 2012, 8:04pm

Perfect! thanks.

So far i have this, which produces many anagrams fast.

#!/bin/bash
#anagramfinder
while :
do
#WORD=$1
WORD=$(shuf -n 1 /data/korpus2k/ordliste)
AG=$(echo "$WORD" | awk 'NR==1{w=$0; l=split(w,W)} length==l{for(i=1;i<=l;i++)if(gsub(W,"&",w)!=gsub(W,"&"))next;print}' FS= - /data/korpus2k/ordliste | sed s/$WORD//g; echo)

CHECK=$(echo -w "$AG" | wc -l)
	if [ "$CHECK" -gt "1" ]
	then echo $WORD && echo $AG;echo 
	else  true			
	fi
done

It produces output like this:

besat
baste bates beast beats besta stabe tabes

klimadebat
debatklima

yderste
dyreste rystede styrede syredet

which is very nice.

I'm still in the dark as to how it can work on the first word, check for anagrams, then the next etc, instead of choosing a random word with shuf -n 1.

drl · June 19, 2012, 8:29pm

Hi.

For comparison:

OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 

Description: anagram generator
 Wordplay generates anagrams of words or phrases. For example,
 "Debian GNU/Linux" = "laud benign unix", "nubian lug index",
 "dang nubile unix", or "I debug in lax nun".

wordplay Scrutinizer

Anagrams found:
     1.  ERIC RITZ SUN
     2.  RICE RITZ SUN
...
    48.  CRUZ REST I IN

In the Debian repositories ... cheers, drl

jeppe83 · June 20, 2012, 7:45am

not anymore.


#!/bin/bash
while IFS= read -r WORD
do
AG=$(echo "$WORD" | awk 'NR==1{w=$0; l=split(w,W)} length==l{for(i=1;i<=l;i++)if(gsub(W,"&",w)!=gsub(W,"&"))next;print}' FS= - /data/korpus2k/ordliste | sed s/"$WORD"//g; echo)
CHECK=$(echo -w "$AG" | wc -l)
	if [ "$CHECK" -gt "1" ]
	then echo "$WORD" && echo "$AG";echo 
	else true			
	fi 
done <  /data/korpus2k/ordliste

output:

aaben
aabne

aaberen
aaberne

aaberne
aaberen

aabne
aaben

aabner
barena

aabrinken
karabinen

I'll not pursue any further refinements other than obtaining a more pure wordlist. Thank you for all the help!

drl · June 20, 2012, 8:23am

Hi.

As a linguistics guy, you may be interested in the Google results for dan melamed perl, which includes a brief intro Dan Melamed's NLP Research Software Library (General Processing Section) and a host of (generally) short perl codes, Index of /~melamed/ftp/tools/genproc

I taught a few classes at West Publishing (now part of Thomson Reuters) and ran into his work at that time.

Best wishes ... cheers, drl

methyl · June 20, 2012, 6:31pm

I understand what you mean. I cracked the Wordstar Thesaurus file format and extracted 60,000 unique words to my own crossword cracker database. Then spent a just a few minutes a day updating the list against my preferred reference dictionary and adding the corresponding definition and Thesaurus cross-reference. After only 30+ years and several changes of PC and software, the current list of @ 320,000 words (and over one million Thesaurus entries) is nearly perfect. I have deliberately avoided writing automatic conjugation processes because they are nigh on impossible to get right (which is why so many spell-checkers allow dubious agent nouns and dubious Latin conjugations).
I have a separate database of proper nouns organised by category (e.g. Capital Cities; Characters in Shakespeare plays) which has become rather big over the years, but the number of updates is now relatively low because I now only update what is relevant to the puzzle in front of me.

All this because I like to complete crosswords. It's a hobby.