AWK scripting

Muki101 · November 26, 2008, 8:24am

I have a text file in which the text has been divided into paragraphs (two line breaks or tab marks a new paragraph) and I want to make a script which output would delete line breaks within the paragraph and the different paragraphs would be separated by two line breaks.

So, if my input file is:

     The first line.
Second line.

First line of the second paragraph.
Second line of the second paragraph.

I want the output to be something like:

The first line. Second line.

First line of the second paragraph. Second line of the second paragraph.

I have tried now for some hours to come up with something reasonable, but I seem to be heading the wrong way. I would be really pleased if someone gave their idea of how to solve the problem.

Thanks!

Tytalus · November 26, 2008, 8:38am

#  paste - - - <infile | sed G
     The first line.    Second line.

First line of the second paragraph.     Second line of the second paragraph.

Tytalus · November 26, 2008, 8:44am

or in nawk:

# nawk 'NR%3 {printf "%s ", $0;next}1(NR+1)%3{print"\n"}' infile
     The first line. Second line.

First line of the second paragraph. Second line of the second paragraph.

vgersh99 · November 26, 2008, 9:23am

And what if the paragraph is more than 2 lines long?

nawk 'BEGIN {FS=RS=""; ORS="\n\n\n"} $1=$1' infile

Muki101 · November 29, 2008, 12:25pm

The output of Tytalus' code is pretty much what needed, but yes, I want it to work with paragraphs longer than 2 lines as well. But I really didn't understand the code itself very well to change it properly. Could someone explain it a bit or give ideas of how to change it?

vgersh99 · November 29, 2008, 2:43pm

Have you tried my suggestion?

Franklin52 · November 29, 2008, 4:04pm

vgersh99, with your solution I get this:

$ nawk 'BEGIN {FS=RS=""; ORS="\n\n\n"} $1=$1' file
          T h e   f i r s t   l i n e .
 S e c o n d   l i n e .


F i r s t   l i n e   o f   t h e   s e c o n d   p a r a g r a p h .
 S e c o n d   l i n e   o f   t h e   s e c o n d   p a r a g r a p h .

Set the record separators to 2 newlines:

$ awk 'BEGIN {RS=ORS="\n\n"} $1=$1' file
The first line. Second line.

First line of the second paragraph. Second line of the second paragraph.

Regards

Muki101 · December 1, 2008, 5:44am

Thanks Franklin52, the code you gave works just fine, except there is one problem. I want it to make a new paragraph when tab is used as well.

So if the input file is:

First line of the second paragraph.
Second line of the second paragraph.
     Third as well.

The output is:

First line of the second paragraph. Second line of the second paragraph.
     
Third as well.

I tried adding "\t" to RS in addition to the "\n\n". Is that the right thing to do anyway? Agaiss it isn't, cause if I replaced the "\n\n" with "\t" then it should have made a new paragraph to my mind, but it didn't. So any further assistance is greatly appreciated.

Thanks in advance!

Franklin52 · December 1, 2008, 7:38am

You can left the awk code unaltered. Translate the tabs to newlines with tr and pipe the output to the awk command:

tr '\t' '\n' < file | awk 'BEGIN {RS=ORS="\n\n"} $1=$1'

This is what I get:

$ cat file
    The first line.
Second line.

First line of the second paragraph.
Second line of the second paragraph.
    Third as well.
$
$ tr '\t' '\n' < file | awk 'BEGIN {RS=ORS="\n\n"} $1=$1'
The first line. Second line.

First line of the second paragraph. Second line of the second paragraph.

Third as well

Regards

Celos · December 1, 2008, 9:35am

Can someone explain the '$1=$1' part of the script? What exactly is it doing?

Franklin52 · December 1, 2008, 1:09pm

A trick to force awk to remove whitespaces ( and non record separators) and rearrange the line in the buffer ($0) with the new ORS.

Regards

summer_cherry · December 2, 2008, 4:39am

Hi i think perl is a little bit easier than awk

$/="\n\n";
open FH,"<a.txt";
while(<FH>){
	tr/\n//d;
	print $_,"\n";
}

Muki101 · December 2, 2008, 11:07am

Thanks alot Franklin52! Works just fine

But I was wondering whether it`s possible to do it entirely int AWK?

radoulov · December 2, 2008, 11:29am

You already have it entirely in AWK
Just remove the tr part:

awk 'BEGIN{RS=ORS="\n\n"}$1=$1' infile

Edit: Consider that most AWK implementations do not support multiple characters for RS.
If the AWK code provided by Franklin52 is working for you, you should be using GNU AWK or tawk.

Or, if Perl is acceptable:

perl -00ple'tr/\t//d;tr/\n/ /' infile

Muki101 · December 4, 2008, 10:01am

I have a question. My script is like this at the moment.

#!/usr/bin/awk -f

BEGIN {RS=ORS="\n\n"}


//{
      gsub("\t", "\n")
}

$1=$1


END {}

Why won�t if make a new line? It recognizes the tabs and if I replaced it let`s say with the letter "a" then it worked. But why wont it make a new line in this case?
As I understand it is somehow in conflict with the $1=$1 part. Cause when I did gsub to the input file and just printed it out, then it worked.

Any suggestions of how to fix it?

Thanks!