Edit a file using awk ?

jimbob01 · February 2, 2012, 2:26pm

Hey guys,

I'm trying to learn a bit of awk/sed and I'm using different sites to learn it from, and i think I'm starting to get confused (doesn't take much!).

Anyway, say I have a csv file which has something along the lines of the following in it:

"test","127.0.0.1","startup timestamp",,,,"1327702381482",
"test","127.0.0.1","cpu combined","cpu 0",,,,"0.0900"
"test","127.0.0.1","cpu idle","cpu 0",,,,"0.9100"
"test","127.0.0.1","cpu nice","cpu 0",,,,"0.0000"
"test","127.0.0.1","cpu sys","cpu 0",,,,"0.0360"
"test","127.0.0.1","cpu user","cpu 0",,,,"0.0540"
"test","127.0.0.1","cpu wait","cpu 0",,,,"0.0010"

Basically, what I want, is to edit the file via a script rather than manually. For the first line for instance, all I want left on that line is startup timestamp,1327702381482 and on the second line cpu combined 0.0900 etc etc so that the file now looks something like:

startup timestamp,1327702381482
cpu combined,0.0900
cpu idle,0.9100
cpu nice,0.0000
cpu sys,0.0360
cpu user,0.0540
cpu wait,0.0010

Anyway, while learning, I've tried various different commands to do this, so, say for the first line, I tried the following (and it didn't work!):

 awk '{if (NR==1) print{"$19"} print{"$24"} print{"$26"}}' myfile.csv > mynewfile.csv

Something tells me I'm hopeless at this! Any help would be gratefully received before it drives me insane!

Also, I read that the O'Reilly sed&awk (second edition) book is worth buying, any you guys recommend it ? I looked it up on Amazon and it was published in 1997 but seems the current edition. Or, if you guys could recommend another book on sed/awk, there would be much thanks!

Cheers

Jim

bartus11 · February 2, 2012, 2:35pm

Try:

awk -F, -vOFS="," '{print $3,$8}' myfile.csv > mynewfile.csv

Corona688 · February 2, 2012, 2:39pm

First off, awk doesn't know that it's supposed to split on commas unless you tell it. It won't know to print out commas unless you tell it, either.

# -F controls the input separator.  You want ,
# -v can set any variable inside the program before it starts running.
# This includes special variables like OFS, the output separator.
awk -F, -v OFS="," ...

Second, instead of one big statement of if/then, you can make different statements triggered by different situations. Putting a condition in front of a code block controls when it gets run. For each line, they're processed in order.

So:

$ awk -F, -v OFS="," '
# This statement runs first, every line.  Deletes all quotation marks.
{ for(N=1; N<=NF; N++) gsub(/"/,"", $N); }
# This statement runs only on the first line.  Print fields 3 and 7.
NR==1{ print $3, $7 }
# This only runs for every line thereafter.  Print fields 3 and 8.
NR>1 { print $3, $8 }' data

startup timestamp,1327702381482
cpu combined,0.0900
cpu idle,0.9100
cpu nice,0.0000
cpu sys,0.0360
cpu user,0.0540
cpu wait,0.0010

$

jimbob01 · February 2, 2012, 3:06pm

Thanx guys, that was exactly the sort of information I was after!

Yous wouldn't happen to know of any decent tutorials on the net about awk ? Think the ones I've been reading might not be that great.

Jim

Corona688 · February 2, 2012, 3:32pm

The Linux manual page for it is a wealth of information, documenting a lot of syntax, all the special built-in variables, every built-in function, etc. It's far too much information to give someone who's never used it before, since it's gibberish out of context, but now that you know a little of the basics I think it'd be very helpful. It's also a good reference.

One thing I might want to clear up is that $ does NOT mean variable. variables in awk are just names, like abc=32; . $ is actually an operator which means "turn a number into a field".

$3 turns the number 3 into field number 3. You could also set X=3 then use $X to get the third field. That's what I'm doing in my while loop, why I'm using N in most places but $N in one particular place. You can even do expressions in it, like $(X+3), which would get you field 6 if X was 3. NF is the number of fields, and since fields start at 1, $NF is the very last field.

You can write to fields, too. $1="asdf" is perfectly fine to do.

$0 means the entire current line. You can write to it too. Changes made to $1, etc turn up in $0 where you'd expect them to, and vice versa.

Technically, awk isn't "line-based", just "record-based". By default it uses \n as its record separator, but that's a special variable too, RS, which you can set as you please. Setting it blank makes it split upon blank lines.

jimbob01 · February 2, 2012, 4:13pm

Thanx Corona688, that helps a lot.

I'll check out the man page/s for it in a little while. I'm coming from a Windows background, and been learning Linux (to enhance career opportunities) and I wish I had converted sooner - Linux is so much more powerful and fascinating. These forums, to me, are invaluable.

One question about the awk script you posted above. Say, for instance, in another script there were more commas separting the data on a different line. Could I use an 'if NR>10 && NR<20' in there too, or would that not work ? If it wouldn't work what would be the best approach ?

Just been reading this blog article about cleaning csv data with awk, but seems to be confusing me even more! Think I need sleep.

Cheers

Jim

Corona688 · February 2, 2012, 4:22pm

It would work fine. You can put expressions of any complexity you want in there, including brackets, variables, and regular expressions.

You might want a >= or <= in there somewhere so you don't leave out line 10 or 20 by accident.

Think of BEGIN as just another statement. The code block with BEGIN in front of it gets run whenever it's true, and it's true when the program finishes loading, but no lines are yet processed. It's handy for setting things up. There's an equal and opposite END one, too, so you can have an awk script that does Z += $3 for a thousand lines, then print the total in the END section.

The FS variable is what you're setting with the -F flag. You can set that in BEGIN as easily as anywhere else. Up to your preferences. It defaults to spaces and tabs.

You can use regular expressions, too. awk '/asdf/ { print $3 }' for instance would only print the third field in lines containing 'asdf'.

And if you leave off the code block completely, it becomes a control for when lines are printed. awk '/asdf/' is equivalent to grep 'asdf' for example. So if you're doing grep | awk, you can probably just put the entire thing in awk somehow...