Linux command sed

Hi ,

i have a file containt a whole text but there are no spaces between each words. how can i, using the sed command, add spaces bewteen each words?

for example my file contains the text:

himynameisjoenicetomeetyou

i would like to use sed command to add spaces between the right words so it becomes :

hi my name is joe nice to meet you.

thank you

@ProdiJay , Welcome.
Few minor points to begin with, read the forum rules/guidelines w.r.t. to posts/requests/formatting.

As for this small sample - easy for a human, however, sed is not a grammar parser , so something (other/or as well as) sed will be needed.

Have you made any attempts ? (share if so, try some if not). The forum is a collaboration - we do not in general write solutions without requester participation (ie post your attempts - regardless of success/failure, that shows your commitment to the challenge at hand).

  • Hint : a supporting dictionary would be helpful

  • is this homework ? (allowed, but good to know if it is)

This can be rather indeterminate.

neverthelessoncerealisemotive

nevertheless once realise motive
never the lesson cereal is emotive

Even the first 3 characters in your example show a problem. Does it start "hi" or "him"? You don't find out until you discover there are no words in your dictionary that start with "yn".

I think this might require some esoteric DSA (Data structures and algorithms) to get an efficient solution (an inefficient one could run for a lifetime, I feel). Possibly start by parsing a complete dictionary into a trie [sic] to look up possible initial sequences of words, and your possible multiple solutions into a tree [sic] so you can prune off partial solutions that cannot be completed.

Any abbreviation (like DSA), proper noun (like Gladys), or misspelling (like mispelling) is very likely to terminate prematurely your algorithm with no matches. Plurals for nouns, and tenses for verbs, also tend to not be given in dictionaries.

I also considered whether there were "rare" letter pairs that might indicate word boundaries, like your "himyname". I scanned a million lines of text, and found 158 reasonably common words that broke this idea (examples below with frequencies). Nevertheless, scanning a complete dictionary for all letter pairs, counting their frequency, and looking for text that contained (say) "yn" and was not in that list, might be used to break up a long text into shorter sections at some word boundaries.

cynical             94
dynamic            187
keynote             45
laryngitis           9
photosynthesis      10
shyness              3
syndicate          205
syndrome           329
synthetic          143

If you were looking for something to write a PhD thesis on, this could be it.

2 Likes

Hi @ProdiJay,

as already noted by our team members, sed is useless here. You need at least a word list with the most common english words. A so-called AI can probably do that now too.

Here is a relatively simple brute force method that does the following (see also comments in the code):

  • First a word list is read in and adjusted a bit.
  • Then a dictionary/map is created which contains a list of all possible words for each position/index (start at 0). For the string “howdoyoudo” it looks like this:
0 ['how', 'ho']
1 ['ow']
2 []
3 ['do']
4 []
5 ['yo', 'you']
6 []
7 []
8 ['do']
9 []
  • Then search recursively: Add the first word of the first position to the checklist. The checklist is joined to a string and compared with the target string every time the lookup() function is called and is printed on match. Then add the first word of the next non-empty position list, etc. Once all positions have been processed, go back to the last position and check the next word from its list, etc. It then looks like this:
0 []
3 ['how']
5 ['how', 'do']
7 ['how', 'do', 'yo']
8 ['how', 'do', 'you']
10 ['how', 'do', 'you', 'do']
how do you do
2 ['ho']

For example, the checklist ["how", "ow"] ([pos0/word0, pos1/word0]) does not appear because "howow" does not match the target string.

The main problem is short words, as you can already see with "himyname":

hi myna me
hi my name
hi my na me

Eliminating that might be difficult though :slight_smile: And, of course, for large strings this may take some time...

Call the script via python3 script.py target_string.

#!/usr/bin/python3

import sys
import re


STR = sys.argv[1].lower()
STRLEN = len(STR)

# from wamerican package (Debian based)
with open("/usr/share/dict/american-english") as fin:
    # convert words into lowercase, remove possible duplicates via set()
    WORDS = set(ln.strip().lower() for ln in fin)
    # filter out words containing a quote and single-character words except 'a' and 'I'
    WORDS = [w for w in WORDS if "'" not in w and len(w) > 1 or w in "ai"]  # ai lol

def lookup(wordpos, pos, words):
    # comment out for debug
    #print(pos, words)
    str = "".join(words)
    # return immediately if leading chars don't match
    if not STR.startswith(str):
        return
    # else check length resp. match
    if len(str) >= STRLEN:
        # found!
        if str == STR:
            print(" ".join(words))
        return
    # else loop over all possible words at current position (list may be empty)
    for word in wordpos[pos]:
        # append word and ...
        words.append(word)
        # ... look at next position
        lookup(wordpos, pos+len(word), words)
        # when done, remove word and continue with next word from list
        words.pop()

# main data struct
# {pos: [word]}
wordpos = {p: [] for p in range(STRLEN)}  # init
for word in WORDS:
    # search each word in STR & store its position(s) if found
    for pos in re.finditer(word, STR):
        wordpos[pos.start()].append(word)
# comment out for debug
#for (pos, words) in sorted(wordpos.items()):
#    print(pos, words)
#print()

# go!
lookup(wordpos, 0, [])
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.