I'm experimenting with a script that finds anagrams.
So far this does the job:
#!/bin/bash
#anagramfinder
while :
do
WORD=$(shuf -n 1 /data/korpus2k/kord)
AG=$(echo "$WORD" | fold -w 1 | sort -R | tr -d '\n';echo)
CHECK=$(grep -w "$AG" /data/korpus2k/kord | wc -l)
if [ "$CHECK" -eq "0" ]
then true
elif [ "$WORD" = "$AG" ]
then true
else
echo "$WORD" "$AG"
fi
done
The file kord is a word list containing 162060 meaningful Danish words, 1 pr. line.
Is there a way for the script to work on the first word of kord, randomize the letters until it finds itself or another, then the next etc.. instead of picking random words in 'eternal' iteration?
Why randomize at all? You could scan kord, eliminating each word in turn if it is not the same length or does not use the same letters with the same frequency. 162,060 isn't a lot to brute.
Thanks for catching and correcting my too-hastily posted code.
Looking back at my history, I originally used:
echo hello | fold -w 1 | unsort | tr -d '\n'
but then I remembered that I had written unsort (back in 1996), and I didn't want to post that code, so I punted with GNU sort, and got the case wrong. Mea culpa.
As it turns out, Linux also provides an unsort, so I could have left it in, sigh ... cheers, drl
The following reads the dictionary from stdin and takes the word for which anagrams are to be found as the first argument to the script. This is not going to be very efficient (for that, better to implement a solution with AWK or perl) nor is it very convenient even if you consider it efficient enough (for example, you can't simply provide it more than one word to match against per invocation); perhaps it's enough to get you started with a better shell script (or AWK/perl solution).
# w: word for which matches are sought
# k: kord word being tested
# fw: letter frequency in w
# fk: letter frequency in k
# Usage: script word < kord
w=$1
while read -r k; do
[ ${#w} -ne ${#k} ] && continue
fw=$(printf '%s\n' "$w" | fold -w1 | sort | uniq -c)
fk=$(printf '%s\n' "$k" | fold -w1 | sort | uniq -c)
[ "$fw" != "$fk" ] && continue
printf '%s\n' "$k"
done
It works slower than my rough "try a random permutation and see if it matches a word that isn't the same" script.
Ideally, I would like a list of all possible anagrams in Danish, so if I ever get this script done, it should work on word 1 in the word-list, check if the letters can be combined in a way that matches one or more words in the same word-list, then the next etc.
I also wonder if there is a (hopefully simple) way to generate a list of all possible combinations of letters.
@jeppe83
Hold on!
Are you trying to write an anagram cracker (like you would use for crosswords) against your Danish words list?
If so, you don't need to generate all the combinations at all.
Ideally you would have your words list indexed by a key made of the letters in the word sorted, with duplicate keys allowed. Then take the input string, sort the letters and look up the anagrams.
This is trivial in most modern database packages, and is often set as a final piece. I haven't seen this one set as Homework on a Shell course so we shall assume that this is hobby computing.
It is also fairly trivial in Shell, but the important part is the script to prepare your look-up file(s) with each record containing a sorted letter key field and the matching word. When working with flat files, splitting the data by word length into separate files should be faster, but it depends how many seconds you are prepared to wait for an answer.
Did you mention anything about your computer or your own skills?
Operating System and version.
Preferred Shell.
Any programming languages which you know?
Ps. I have an old technology version. The Longman Anagram Dictionary (a book). It is first ordered by the length of the word, then the sorted letters of the word in alphabetical order. If you can beat me with that book in my hand, your program is good!
I realize I don't need to generate all combinations, but I would like to know how anyway.
It's not homework. I'm a linguist and years ago I followed a course "information technology for linguists" where I was introduced to the wonders of grep and sed etc. I've recently taken it up again, just for fun and to see how much I can remember. I don't aspire to be a scripting wizard..
I use
GNU bash, version 4.1.5(1)-release (x86_64-pc-linux-gnu)
I only have experience with bash-scripts and a little knowledge of awk.
Generating all combinations of a string once-only is not a trivial piece of code. I last wrote a program to do this in Basic-A (for those with long memories) to drive a stage light show.
The essence is that you take each character in turn then remove it from its position in the original string and then insert it into every possible position in the remaining string (including front and back). At the end of the process you have every possible permutation once-only. Purists would take account of duplicate letters (I didn't).
Somebody who has this GNU bash will be able to find a substring function (like that in Basic-A) which makes this easy.
I'm still in the dark as to how it can work on the first word, check for anagrams, then the next etc, instead of choosing a random word with shuf -n 1.
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian GNU/Linux 5.0.8 (lenny)
Description: anagram generator
Wordplay generates anagrams of words or phrases. For example,
"Debian GNU/Linux" = "laud benign unix", "nubian lug index",
"dang nubile unix", or "I debug in lax nun".
wordplay Scrutinizer
Anagrams found:
1. ERIC RITZ SUN
2. RICE RITZ SUN
...
48. CRUZ REST I IN
I understand what you mean. I cracked the Wordstar Thesaurus file format and extracted 60,000 unique words to my own crossword cracker database. Then spent a just a few minutes a day updating the list against my preferred reference dictionary and adding the corresponding definition and Thesaurus cross-reference. After only 30+ years and several changes of PC and software, the current list of @ 320,000 words (and over one million Thesaurus entries) is nearly perfect. I have deliberately avoided writing automatic conjugation processes because they are nigh on impossible to get right (which is why so many spell-checkers allow dubious agent nouns and dubious Latin conjugations).
I have a separate database of proper nouns organised by category (e.g. Capital Cities; Characters in Shakespeare plays) which has become rather big over the years, but the number of updates is now relatively low because I now only update what is relevant to the puzzle in front of me.
All this because I like to complete crosswords. It's a hobby.