Sed or awk script to remove text / or perform calculations from large CSV files

metronomadic · June 17, 2009, 1:08pm

I have a large CSV files (e.g. 2 million records) and am hoping to do one of two things. I have been trying to use awk and sed but am a newbie and can't figure out how to get it to work. Any help you could offer would be greatly appreciated - I'm stuck trying to remove the colon and wildcards in sed, and the average sample I've found using awk is giving me values of around 4e08.

The CSV file looks like this:

Date,AIRCOMPRESSOR\FLARE_FLOW,AIRCOMPRESSOR\FLARE_TEMP
3/1/2008,1044.83215332,1090.88208008
3/1/2008 12:00:10 AM,1044.83215332,1090.88208008
3/1/2008 12:00:21 AM,1046.71142578,1090.88208008
3/1/2008 12:00:31 AM,1044.83215332,1090.88208008
3/1/2008 12:00:41 AM,1048.59057617,1083.96069336
3/1/2008 12:00:51 AM,1044.83215332,1083.96069336

I am hoping to either use sed or another script to remove the seconds portion of the data lines (i.e. remove ":10 AM" and all similar occurrences, or preferably to use awk to average the flow rates for each minute or each 15 minutes (i.e. the column right after the time).

Thanks in advance for any help you can offer.

vgersh99 · June 17, 2009, 1:20pm

something to start with:

nawk -F, '$1~":" {match($1,"\:[^:]*$"); $1=substr($1,1,RSTART-1)}1' OFS=, myFile

metronomadic · June 17, 2009, 2:02pm

Thanks for your prompt response, but it looks like I don't have nawk (I'm running Mac OS X). I'll see if I can get it through MacPorts and try again, but if there's any help that can be offered using awk, sed, or tr I know that I have those at my disposal.

EDIT: Installed nawk, and it worked like a charm. Thank you very much.

vgersh99 · June 17, 2009, 2:03pm

try 'awk' instead of 'nawk'.

ahmad.diab · June 17, 2009, 2:40pm

To remove the second 12:21:10 use the below sed:

sed 's/.*:\([^,*]*\) AM/\1/g' file.txt

to get the to total use:-

awk ' BEGIN{c=0} {a[$1]+=$2;b[$1]+=$3;c++} END{for (i in a) {print "Total", a/c,b/c} ' file.txt

BR

metronomadic · June 17, 2009, 3:21pm

ahmad.diab:

To remove the second 12:21:10 use the below sed:
sed 's/.*:$[^,*]*$ AM/\1/g' file.txt
to get the to total use:-
awk ' BEGIN{c=0} {a[$1]+=$2;b[$1]+=$3;c++} END{for (i in a) {print "Total", a/c,b/c} ' file.txt
BR

Thanks Ahmad. I tried the awk code (which I think needs an extra } to close out the for loop?), but I think that might be calculating something else. I am trying to get the average flow (column three) for each minute (or each 15 minute span) of each day. I am not sure I understand the code, but from the output it looks like it is gathering each days worth of records, and dividing them by the number of days?

I don't mean to be a bother, but can you tell me if this is what is going on?

ahmad.diab · June 17, 2009, 3:49pm

sorry kindly add the bold string below:-

awk -F"," ' BEGIN{c=0} {a[$1]+=$2;b[$1]+=$3;c++} END{for (i in a) {print "Total", a[i]/c,b[i]/c} ' file.txt