Don't understand how RS functions in awk

joe228 · August 27, 2010, 11:31pm

I learn using RS in awk to extract portion of file in this forum which is wonderful solution to the problem. However, I don't understand how exactly it operates.

I don't quite understand the mechanism behind how searching for /DATA2/ can result in extracting the whole section under "DATA2"

sample

DATA1
data11
data12

DATA2
data21
data22

DATA3
data31
data32

$cat sample | awk 'BEGIN {RS=""} /DATA2/'
DATA2
data21
data22

Since RS is set to be empty string, so each line now should be regarded as a field and so I expected printing $1 and $2 would give me the output of DATA1 and data11 but it didn't. Instead, it returned me with what is shown below:

$ cat sample | awk 'BEGIN {RS=""} { print $1 }'
DATA1
DATA2
DATA3
$ cat sample | awk 'BEGIN {RS=""} { print $2 }'
data11
data21
data31

So, can someone explain to me why it behaved this way?? Thanks!

Scrutinizer · August 28, 2010, 12:59am

It looks fine to me. In your latter examples you did not specify a record, so it produces the fields for all the records. For comparison:

$ awk 'BEGIN {RS=""} /DATA2/{ print $1,$2 }' infile
DATA2 data21

bartus11 · August 28, 2010, 3:17am

joe228:

I learn using RS in awk to extract portion of file in this forum which is wonderful solution to the problem. However, I don't understand how exactly it operates.

I don't quite understand the mechanism behind how searching for /DATA2/ can result in extracting the whole section under "DATA2"

sample

DATA1
data11
data12

DATA2
data21
data22

DATA3
data31
data32

$cat sample | awk 'BEGIN {RS=""} /DATA2/'
DATA2
data21
data22

Since RS is set to be empty string, so each line now should be regarded as a field and so I expected printing $1 and $2 would give me the output of DATA1 and data11 but it didn't. Instead, it returned me with what is shown below:

$ cat sample | awk 'BEGIN {RS=""} { print $1 }'
DATA1
DATA2
DATA3
$ cat sample | awk 'BEGIN {RS=""} { print $2 }'
data11
data21
data31

So, can someone explain to me why it behaved this way?? Thanks!

It behaves like that, because setting RS to empty string causes AWK to go into special mode, where it separates records by empty lines, so in your example you end up with three records:

DATA1        |
data11       |   1st record (NR=1)
data12       |

DATA2        |
data21       |   2nd record (NR=2)
data22       |

DATA3        |
data31       |   3rd record (NR=3)
data32       |

In that mode one more thing is changed. Field separator is now not only space or tab, but also newline. So inside each of those records you end up with three fields:

DATA1        <=  1st field ($1)
data11       <=  2nd field ($2)   
data12       <=  3rd field ($3)

I hope it cleared things up for you.

rdcwayx · August 28, 2010, 6:17am

default RS is "\n" or new line, it used to separate the records . So by default, each line is a record.

if RS="", then use the empty line as record separater.