sed random \n for "n" range of character occurrences

I'd like to put paragraph breaks \n\n randomly between 5 - 10 occurrences of the dot character (.), for an entire text file. How to do that?

In other words, anywhere between every 5 -10 sentences, a new paragraph will generate. There are no other uses of the (.) except for sentence breaks in the file.

Thanks in advance for help.

Is this a homework assignment? Homework and coursework questions can only be posted in the Homework & Coursework forum with posts following special homework rules.

If you did post homework in the main forums, please repost your question in the proper forum as described above.

If this is not homework please explain:

  1. why you would want to do this to a normal text file,
  2. why you think sed would be an appropriate tool to calculate random numbers,
  3. what operating system you're using,
  4. what shell you're using, and
  5. what you have tried to solve this problem on your own?

Hi, thanks, it's not homework.

1) because I like to experiment with word lists and grammar.
2) because sed has a random function, and sed examples from forums have helped me so far.
3) Linux Mint.
4) Bash shell.
5) Something like this:
sed '(RANDOM(5~10)s/$/\n/g)'< in.txt > out.txt

I'm not aware of a random function in sed ; which version do you use?

And, while your code sample (ignoring the RANDOM stuff) will add a <NL> char to some EOL (end of line), in post#1 you specify that dots should be replaced. Do I guess correctly that dots are not necessarily at EOLs?

Would a non- sed solution be acceptable?

Hi, thanks for responding,

sed (GNU sed) 4.2.2

No, dots are not replaced. The idea is to put random paragraph breaks in a chunk of text, every 5 - 10 sentences, for example.

Yes I'm open to any other method.

Generally speaking: you have two problems to solve if you are using sed for this: the first is to generate a random number. Probably it will take a big lot of effort to do so (although it should in principle be possible because sed is a turing-complete language).

The second is that you are not working context-free: this is always a hassle because sed has little to offer save for a powerful regex-engine. Such problems are generally more easily solved in awk becuse this offers a programming language in addition to the regex engine.

I hope this helps.

bakunin

Please show us some sample input (i.e., the contents of the file in.txt ) and the output you are hoping to have produced (in the file out.txt ) and the output that you did get (both diagnostic messages produced, if there were any, and the output stored in out.txt ) when you ran the command:

sed '(RANDOM(5~10)s/$/\n/g)'< in.txt > out.txt

on your system.

Running that command on my system produces the diagnostic message:

sed: 1: "(RANDOM(5~10)s/$/\n/g)": invalid command code (

I don't really know what I'm doing with that sed command; it's just showing my thought process as to what I want to achieve.

Input example is a blob of text with periods (.) marking sentences. The following output example would have random \n\n breaks, between every random 3 - 6 periods (for example):

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Nam nibh. Nunc  varius facilisis eros. Sed erat. 

In in velit quis arcu ornare laoreet.  Curabitur adipiscing luctus massa. Integer ut purus ac augue commodo  commodo. Nunc nec mi eu justo tempor consectetuer. Etiam vitae nisl. In  dignissim lacus ut ante. 

Cras elit lectus, bibendum a, adipiscing vitae,  commodo et, dui. Ut tincidunt tortor. Donec nonummy, enim in lacinia  pulvinar, velit tellus scelerisque augue, ac posuere libero urna eget  neque. 

Cras ipsum. Vestibulum pretium, lectus nec venenatis volutpat,  purus lectus ultrices risus, a condimentum risus mi et quam.  Pellentesque auctor fringilla neque. Duis eu massa ut lorem iaculis  vestibulum. Maecenas facilisis elit sed justo.

Given that the input is one single line of text, try

awk '
        {N = split ($0, T, ".")
         CNT = 1
         while (CNT <= N)       {RND = int(5*(1+rand()))
                                 for (i=CNT; i<CNT+RND && i<N; i++) printf "%s.", T
                                 printf "\n\n"
                                 CNT += RND
                                }
        }
' file
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Nam nibh. Nunc  varius facilisis eros. Sed erat. In in velit quis arcu ornare laoreet.

  Curabitur adipiscing luctus massa. Integer ut purus ac augue commodo  commodo. Nunc nec mi eu justo tempor consectetuer. Etiam vitae nisl. In  dignissim lacus ut ante. Cras elit lectus, bibendum a, adipiscing vitae,  commodo et, dui.

 Ut tincidunt tortor. Donec nonummy, enim in lacinia  pulvinar, velit tellus scelerisque augue, ac posuere libero urna eget  neque. Cras ipsum. Vestibulum pretium, lectus nec venenatis volutpat,  purus lectus ultrices risus, a condimentum risus mi et quam.  Pellentesque auctor fringilla neque. Duis eu massa ut lorem iaculis  vestibulum. Maecenas facilisis elit sed justo.
1 Like

Thank you RudiC, that is awesome! I see what you did here

{RND = int(5*(1+rand()))

as rand() is between 0 and 1 (so between 5 and 5*2 periods). I can adjust values to create new ranges. T (or a) is array field of split().

Yes, split($0, %, ".") creates an array named T with each element of T[] containing the text between periods in the input line. If you'd like to get rid of leading whitespace characters at the start of each paragraph, you might want to consider this slight modification of RudiC's suggestion:

awk '
{	N = split ($0, T, ".")
	CNT = 1
	while (CNT <= N) {
		RND = int(5*(1+rand()))
		for (i=CNT; i<CNT+RND && i<N; i++) {
			if(i == CNT) sub(/^[[:space:]]*/, "", T)
			printf "%s.", T
		}
		printf "\n\n"
		CNT += RND
	}
}' file

If you'd like to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk .

Note that on some versions of awk neither of these suggestions will work if the single-line input file is longer than 2048 bytes (or whatever the command:

getconf LINE_MAX

returns on your system if it isn't 2048).

2 Likes

Thanks very much Don. That's perfect; replaces this step I was doing:

awk '{$1=$1}1' in > left-trim