awk - treating remaining columns as one

ppucci · November 13, 2012, 6:13pm

Hi all,

For no particular reason, I would like to use awk on a file that contains multiple columns, but let's say only columns 1 and 2 have some text values, and the remainder of the line contains text that I would like to treat as one column, considering I have spaces as delimiter for the columns, e.g.:

alpha 200 this is a comment for this record
bravo 400 this is another comment for this record

I would like awk to output $1, $2 and $3 as

$1 = alpha
$2 = 200
$3 = this is a comment for this record

Ideas?

ctsgnb · November 13, 2012, 6:30pm

Assuming your comments do not contain doublequote:

sed 's/ / "/2;s/$/"/' yourfile | xargs -n1

ppucci · November 13, 2012, 6:32pm

nice, but I was looking for a way to do it in awk... any ideas?

ctsgnb · November 13, 2012, 6:36pm

awk '{a=$1"\n"$2;sub(".*"$3,$3);print a"\n"$0}' yourfile

awk 'sub(".*"$3,$1RS$2RS$3)' yourfile

ppucci · November 13, 2012, 6:41pm

That is nice... would it be a trouble for you to explain to me what each part does? (sorry, I am really a newcomer)

balajesuri · November 13, 2012, 6:45pm

This doesn't capture all fields from $3 to NF. Please re-check.

Here's one:

awk '{a=$1; b=$2; c=$3; for(i=4;i<=NF;i++){c=c" "$i}; print a"\n"b"\n"c} ' file

ppucci · November 13, 2012, 6:55pm

It looks better, sorry I maybe did not express correctly, I still need the columns to show on the same row, I just want awk to treat from $3 forward (for as many delimiters it may have on that line, $4, $5, $6, etc) as $3, so the output would be $1 $2 $3, e.g.

print $3 would print "this is a comment for this record"

ppucci · November 13, 2012, 7:25pm

I am almost achieving my desired output, however I am getting something funny.

Consider

alpha 100 this is a comment
bravo 200 this is another comment

When using

awk '{print $2,sub (".*"$3,$3) $0}' filename

i am getting

100 1this is a comment
200 1this is another comment

Where is this "1" between the columns I am getting coming from?

balajesuri · November 13, 2012, 7:38pm

@ctsgnb: It does work for file t1 in post #8. And you might also want to take a look at this. May be the awk installed on my system is having a hangover

[root@host dir]# cat file
alpha 200 this is a comment for this record
bravo 400 this is another comment for this record
[root@host dir]#
[root@host dir]# awk 'sub(".*"$3,$1RS$2RS$3)' file
alpha
200
this record
bravo
400
this record
[root@host dir]#

Instead of "this is a comment for this record", it just prints "this record". I'm using GNU Awk 3.1.5

ctsgnb · November 13, 2012, 8:18pm

@Bala :

Ah yup,not the same version here : it was working with GNU awk 3.1.1

Could you try to run

awk 'sub(".*"$3,$1 RS $2 RS $3,$0)' yourfile

Just to see if it works better ?

---------- Post updated at 02:14 AM ---------- Previous update was at 02:06 AM ----------

@ Bala,

By the way, i am also curious to see what output you get by running :

awk '{a=$1"\n"$2;sub(".*"$3,$3);print a"\n"$0}' yourfile

---------- Post updated at 02:18 AM ---------- Previous update was at 02:14 AM ----------

ppucci:

I am almost achieving my desired output, however I am getting something funny.

Consider

alpha 100 this is a comment
bravo 200 this is another comment

When using
awk '{print $2,sub (".*"$3,$3) $0}' filename
i am getting

100 1this is a comment
200 1this is another comment

Where is this "1" between the columns I am getting coming from?

@pucci:

Fix your code :

awk '{print $1"\n"$2;sub(".*"$3,$3);print $0}' filename

or

awk '{print $1 RS $2 RS ((sub(".*"$3,$3))?$0:z)}' filename

Scrutinizer · November 13, 2012, 8:36pm

How about:

awk '{$1=$1; sub(FS,RS); sub(FS,RS)}1' infile

balajesuri · November 13, 2012, 8:51pm

Same. I tried on a different version too (CYGWIN, GNU Awk 4.0.0) and got the same result. Baffling!

[user@home-pc ~]$ cat file
alpha 200 this is a comment for this record
bravo 400 this is another comment for this record
[user@home-pc ~]$
[user@home-pc ~]$ awk '{a=$1"\n"$2;sub(".*"$3,$3);print a"\n"$0}' file
alpha
200
this record
bravo
400
this record
[user@home-pc ~]$

Strange that it works on GNU Awk 3.1.1! Anyway, this one by Scrutinizer is pretty creative good one mate!

awk '{$1=$1; sub(FS,RS); sub(FS,RS)}1'

Scrutinizer · November 14, 2012, 1:10am

@cts. bala@ the difference occurs not because of awk versions but because you are using different different data samples. With bala's data sample this part of ctsgnb's code is problematic: sub(".*"$3,$3) , which matches upto the last occurence of "this", because of greedy matching..

--
@cts: nice sed/xargs
@bala: thanks

ctsgnb · November 14, 2012, 3:11am

@Scruti

Ah ok ... i got it now (gush!, how did i miss it !) , thx !

By the way, in the code:

awk '{$1=$1; sub(FS,RS); sub(FS,RS)}1' infile

The $1=$1 is not necessary, is it ?

Scrutinizer · November 14, 2012, 9:30am

The $1=$1 is not strictly necessary with the sample provided, but it gives the script robustness since if the data were to include TABs or multiple spaces or if there were a space before the first field, then it might break otherwise...

ctsgnb · November 14, 2012, 9:48am

Hi Scruti,

Ok, so you mean using $1=$1 trim blank stuff around the retained $1 ? or do i understand it in a wrong way ?

Do you have a short example just to illustrate what you mean ?

Thank in advance for your time & for your help

Scrutinizer · November 14, 2012, 10:15am

Yes the $1=$1 trims the blanks, so that we can be sure there is no leading whitespace and and only a single space separating the fields, so that the subs can be successful..

For example:

$ printf 'alpha 200 this is a comment for this record\n' | awk '{sub(FS,RS); sub(FS,RS)}1'
alpha
200
this is a comment for this record
$ printf '  alpha \t 200   this is a comment for this record\n' | awk '{sub(FS,RS); sub(FS,RS)}1'


alpha 	 200   this is a comment for this record
$ printf '  alpha \t 200   this is a comment for this record\n' | awk '{$1=$1; sub(FS,RS); sub(FS,RS)}1'
alpha
200
this is a comment for this record

only4satish · November 14, 2012, 11:30am

could you please explain how does the below code works ?

sub(FS,RS); sub(FS,RS)

 
 
awk '{$1=$1; sub(FS,RS); sub(FS,RS)}1' infile

ctsgnb · November 14, 2012, 12:07pm

sub(FS,RS) substitue the Field Separator (whose default value is a space " ") with a Record Separator (whose default value is a newline "\n")

It does this substitution only once , so the first FS met is changed into a RS.
That is the reason why it is important to make sure that the first FS encountered is one between the fist field $1 and second field $2.

only4satish · November 15, 2012, 8:04am

# cat t1
1 2 This is a comment

awk '{a=$1"\n"$2;sub(".*"$3,$3);print a"\n"$0}'  t1

sub(".*"$3,$3) replace in the current line ($0) everthing before and including $3 by the current value of $3
print a"\n"$0 print the variable a and the current line separated by a new line

by refering above file t1,current value of $3 is 'this' right ? , i am bit confused,could you please explain hw does this code works

sub(".*"$3,$3)

, will it returns the value 'this' or 'This is a comment'

please help me ............i am a newbie ..................

---------- Post updated at 06:34 PM ---------- Previous update was at 04:45 PM ----------

got it....
looks like above code does not give desired results, if the file contains string (this)which is repeated as below

# cat t1
1 2 This is a comment this is not a comment this this1

awk '{a=$1"\n"$2;sub(".*"$3,$3);print a"\n"$0}' t1