Hi @ProdiJay,
as already noted by our team members, sed is useless here. You need at least a word list with the most common english words. A so-called AI can probably do that now too.
Here is a relatively simple brute force method that does the following (see also comments in the code):
- First a word list is read in and adjusted a bit.
- Then a dictionary/map is created which contains a list of all possible words for each position/index (start at 0). For the string “howdoyoudo” it looks like this:
0 ['how', 'ho']
1 ['ow']
2 []
3 ['do']
4 []
5 ['yo', 'you']
6 []
7 []
8 ['do']
9 []
- Then search recursively: Add the first word of the first position to the checklist. The checklist is joined to a string and compared with the target string every time the
lookup()
function is called and is printed on match. Then add the first word of the next non-empty position list, etc. Once all positions have been processed, go back to the last position and check the next word from its list, etc. It then looks like this:
0 []
3 ['how']
5 ['how', 'do']
7 ['how', 'do', 'yo']
8 ['how', 'do', 'you']
10 ['how', 'do', 'you', 'do']
how do you do
2 ['ho']
For example, the checklist ["how", "ow"] ([pos0/word0, pos1/word0])
does not appear because "howow" does not match the target string.
The main problem is short words, as you can already see with "himyname":
hi myna me
hi my name
hi my na me
Eliminating that might be difficult though And, of course, for large strings this may take some time...
Call the script via python3 script.py target_string
.
#!/usr/bin/python3
import sys
import re
STR = sys.argv[1].lower()
STRLEN = len(STR)
# from wamerican package (Debian based)
with open("/usr/share/dict/american-english") as fin:
# convert words into lowercase, remove possible duplicates via set()
WORDS = set(ln.strip().lower() for ln in fin)
# filter out words containing a quote and single-character words except 'a' and 'I'
WORDS = [w for w in WORDS if "'" not in w and len(w) > 1 or w in "ai"] # ai lol
def lookup(wordpos, pos, words):
# comment out for debug
#print(pos, words)
str = "".join(words)
# return immediately if leading chars don't match
if not STR.startswith(str):
return
# else check length resp. match
if len(str) >= STRLEN:
# found!
if str == STR:
print(" ".join(words))
return
# else loop over all possible words at current position (list may be empty)
for word in wordpos[pos]:
# append word and ...
words.append(word)
# ... look at next position
lookup(wordpos, pos+len(word), words)
# when done, remove word and continue with next word from list
words.pop()
# main data struct
# {pos: [word]}
wordpos = {p: [] for p in range(STRLEN)} # init
for word in WORDS:
# search each word in STR & store its position(s) if found
for pos in re.finditer(word, STR):
wordpos[pos.start()].append(word)
# comment out for debug
#for (pos, words) in sorted(wordpos.items()):
# print(pos, words)
#print()
# go!
lookup(wordpos, 0, [])