Removing repeating lines from a data frame (AWK)

gd9629 · July 19, 2011, 5:33am

Hey Guys!

I have written a code which combines lots of files into one big file(.csv).

However, each of the original files had headers on the first line, and now that I've combined the files the headers are interspersed throughout the new combined data frame. For example, throughout the data I will have a line;

//DATE TIME

FRAC_DAYS_SINCE_JAN1

FRAC_HRS_SINCE_JAN1 EPOCH_TIME

ALARM_STATUS

with each of these headers in a different column.

Is there a way of removing these lines when these headers arise?

Thanks!

Skrynesaver · July 19, 2011, 5:47am

Presuming your CSV file has all the columns from the various files alligned then would the following work

grep -v '^//DATE'

or is it more complex?

bartus11 · July 19, 2011, 5:49am

awk '!/^\/\//' file

gd9629 · July 19, 2011, 6:04am

sorry I edited my original post to make it clearer. Yes the .csv has all the files and headers aligned, I just need to remove all the text (except the very first line) in the .csv file so that I can process the data.

grep doesn't seem to work? as I'm editing the file using GAWK? I'm pretty new to programming so I'm not sure how to find my way around this apparent conflict lol.

bartus, that code just added the first two column headers together and put them after the rest of the other headers.

Any ideas?

Cheers!

bartus11 · July 19, 2011, 6:14am

Can you post sample data showing some lines that should be kept and the header to be removed and the desired output for that sample? Use code tags for that please.

gd9629 · July 19, 2011, 6:27am

My .csv file looks like below, with each header separated out into a separate column. And data
in the columns below. These headers are interspersed throughout the file.

//DATE TIME 	             FRAC_DAYS_SINCE_JAN1	EPOCH_TIME
25/06/2011 07:03	   175.2938079		        1308985385
25/06/2011 07:03	   175.2938657		        1308985390
25/06/2011 07:03	   175.2939236                  1308985395
25/06/2011 07:03	   175.2939815		        1308985400
25/06/2011 07:03	   175.2940394		        1308985405
25/06/2011 07:03	   175.2940972		        1308985410
25/06/2011 07:03	   175.2941551		        1308985415
25/06/2011 07:03	   175.294213		        1308985420

I want to remove all the interspersed headers from the file so that there is just numeric data.
Just like below

25/06/2011 07:03	   175.2938079		        1308985385
25/06/2011 07:03	   175.2938657		        1308985390
25/06/2011 07:03	   175.2939236                    1308985395
25/06/2011 07:03	   175.2939815		        1308985400
25/06/2011 07:03	   175.2940394		        1308985405
25/06/2011 07:03	   175.2940972		        1308985410
25/06/2011 07:03	   175.2941551		        1308985415
25/06/2011 07:03	   175.294213		        1308985420

So basically I need a way of 'cutting' out lines which begin with "//DATE TIME" per se.

Is there any way of doing this?

Thanks

bartus11 · July 19, 2011, 6:33am

As you can see below my code is working as expected...

solaris% cat file
//DATE TIME                  FRAC_DAYS_SINCE_JAN1       EPOCH_TIME
25/06/2011 07:03           175.2938079                  1308985385
25/06/2011 07:03           175.2938657                  1308985390
25/06/2011 07:03           175.2939236                  1308985395
//DATE TIME                  FRAC_DAYS_SINCE_JAN1       EPOCH_TIME
25/06/2011 07:03           175.2939815                  1308985400
25/06/2011 07:03           175.2940394                  1308985405
25/06/2011 07:03           175.2940972                  1308985410
25/06/2011 07:03           175.2941551                  1308985415
25/06/2011 07:03           175.294213                   1308985420
solaris% awk '!/^\/\//' file                 
25/06/2011 07:03           175.2938079                  1308985385
25/06/2011 07:03           175.2938657                  1308985390
25/06/2011 07:03           175.2939236                  1308985395
25/06/2011 07:03           175.2939815                  1308985400
25/06/2011 07:03           175.2940394                  1308985405
25/06/2011 07:03           175.2940972                  1308985410
25/06/2011 07:03           175.2941551                  1308985415
25/06/2011 07:03           175.294213                   1308985420

gd9629 · July 19, 2011, 6:48am

It's not working for me, maybe cos I'm an idiot.

I'm writing my code in notepad++ and running it through a shell. I'm using a .sh file to combine all my .dat files into one big .csv file, then a GAWK file to edit the format.

.sh file

#!/bin/bash

# input files from each day of data in june and combine into one big file
find /u/Picarro/DataLog/2011/june -type f -name "*dat" -exec cat {} > june.dat \;

#use new combined data as input file
IN_ALL='/u/Picarro/DataLog/2011/june/june.dat' 
	
# the csv file to create for all data called 'june.csv' in the respective directory
OUT_all='/u/Picarro/DataLog/2011/Awk/june.csv'		

# gawk files to create csv file
GAWK='/u/Picarro/DataLog/2011/Awk/Format_trial.csv.awk'

#produce the OUT file from the IN file(s)
$GAWK $IN_all > $OUT_all

GAWK file

#!/bin/gawk -f

# This file is to restructure the picarro data into the correct .csv columns for R

# create a header with same headings as variable in table
# also set other variables before parsing data
BEGIN   {
        OFS="," 	# tells awk that the output separator is a comma
        ORS=""  	# tells awk to not print newline after each print command so all records are
					# on the same line until we want a new line "\n"			
		getline}	# removes 1st line of input file ie header so we can replace it with correct one

# rearrange the yyyy-mm-dd | hh:mm:ss date and time to single date column of yyyy/mm/dd hh:mm:ss needed for openair
{print substr($1,9,2) "/" substr($1,6,2) "/" substr($1,1,4) " " substr($2,1,5)}		

# print the rest of variables as columns 
{print (" ", $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20)
}


$1 $2 != prev {
    	
	{print "\n"}	# newline after each 5 seconds of data has been parsed
  
; prev=$1 $2}

I've tried putting the code you gave me into various places in the GAWK file but it doesn't seem to work.

Where am I going wrong

bartus11 · July 19, 2011, 6:53am

Don't append it to your code, just run that line on the file that you need to clean headers from. Put this after the find line in your script:

awk '!/^\/\//' /u/Picarro/DataLog/2011/june/june.dat > /u/Picarro/DataLog/2011/june/june.dat.tmp
mv /u/Picarro/DataLog/2011/june/june.dat.tmp /u/Picarro/DataLog/2011/june/june.dat

gd9629 · July 19, 2011, 7:28am

Thanks for the help bartus, still not working. The headers are still interspersed throughout the data frame. I've tried putting the new code you gave me in different places too but it doesn't do anything to the file.

Any way around this?

mirni · July 19, 2011, 7:53am

that smells like some windows mess... can you please post the output of this:

head -1 /u/Picarro/DataLog/2011/june/june.dat | od -c

which will show what exactly are the header characters

---------- Post updated at 01:53 AM ---------- Previous update was at 01:44 AM ----------

And you could try this to strip the headers

awk  '!/\/\/ *DATE .*/' /u/Picarro/DataLog/2011/june/june.dat

Looking at your awk script, I think this might be the culprit:

# print the rest of variables as columns 
{print (" ", $3, ...

If you have a space there at the beginning, than the regex in awk will not match that line.
When you pipe the header through 'od', as i suggested, it will show

gd9629 · July 19, 2011, 8:57am

Hey mirni,

when I put in the code you gave me

head -1 /u/Picarro/DataLog/2011/june/june.dat | od -c

it returned

0000000   D   A   T   E                                                
0000020                                           T   I   M   E        
0000040                                                                
0000060                   F   R   A   C   _   D   A   Y   S   _   S   I
0000100   N   C   E   _   J   A   N   1                           F   R
0000120   A   C   _   H   R   S   _   S   I   N   C   E   _   J   A   N
0000140   1                               E   P   O   C   H   _   T   I
0000160   M   E                                                        
0000200           A   L   A   R   M   _   S   T   A   T   U   S        
0000220                                                   s   p   e   c
0000240   i   e   s                                                    
0000260                           s   o   l   e   n   o   i   d   _   v
0000300   a   l   v   e   s                                            
0000320   M   P   V   P   o   s   i   t   i   o   n                    
0000340                                           O   u   t   l   e   t
0000360   V   a   l   v   e                                            
0000400                   C   a   v   i   t   y   P   r   e   s   s   u
0000420   r   e                                                   C   a
0000440   v   i   t   y   T   e   m   p                                
0000460                                   W   a   r   m   B   o   x   T
0000500   e   m   p                                                    
0000520           E   t   a   l   o   n   T   e   m   p                
0000540                                                   D   a   s   T
0000560   e   m   p                                                    
0000600                           C   O   2   _   s   y   n   c        
0000620                                                                
0000640   C   O   2   _   d   r   y   _   s   y   n   c                
0000660                                           C   H   4   _   s   y
0000700   n   c                                                        
0000720                   C   H   4   _   d   r   y   _   s   y   n   c
0000740                                                           H   2
0000760   O   _   s   y   n   c                                        
0001000                                  \r  \n
0001012

those are the headers that are interspersed throughout the combined .dat file. Is this bad?

bartus11 · July 19, 2011, 9:01am

I don't see the "//" characters in from of the header. Can you also post output of:

head -1 /u/Picarro/DataLog/2011/june/june.dat | cat -Te

gd9629 · July 19, 2011, 9:13am

The "'DATE TIME" header is in the .csv file, I think it appears once I combined the two columns using the GAWK code I posted above.

The result from

head -1 /u/Picarro/DataLog/2011/june/june.dat | cat -Te

The result

DATE                      TIME                      FRAC_DAYS_SINCE_JAN1      FRAC_HRS_SINCE_JAN1       EPOCH_TIME                ALARM_STATUS              species                   solenoid_valves           MPVPosition               OutletValve               CavityPressure            CavityTemp                WarmBoxTemp               EtalonTemp                DasTemp                   CO2_sync                  CO2_dry_sync              CH4_sync                  CH4_dry_sync              H2O_sync                  ^M$

To clarify,

The headers shown above are in the .dat files (both single files and big combined file)

The "// DATE TIME" (replaces the separate "DATE" and "TIME" header into one column) header arises in the .csv file, after the .dat file has been 'GAWKed'.

Does that help?

bartus11 · July 19, 2011, 9:17am

Post the output of

head -1 /u/Picarro/DataLog/2011/june/june.csv | cat -Te

gd9629 · July 19, 2011, 9:23am

date,fracdays,frachours,epoch,alarm,species,solenoid,mpv,outlet,cavp,cavt,warmbox,etalon,dastemp,co2sync,co2,ch4sync,ch4,h2osync$

These headers are the headers at the very top of the .csv file, which I manually put in myself as these are required just once at the top of the .csv file.

bartus11 · July 19, 2011, 9:25am

OK... so post the output of:

grep '//' /u/Picarro/DataLog/2011/june/june.csv | head -5 | cat -Te

gd9629 · July 19, 2011, 9:28am

//DATE TIME ,FRAC_DAYS_SINCE_JAN1,FRAC_HRS_SINCE_JAN1,EPOCH_TIME,ALARM_STATUS,species,solenoid_valves,MPVPosition,OutletValve,CavityPressure,CavityTemp,WarmBoxTemp,EtalonTemp,DasTemp,CO2_sync,CO2_dry_sync,CH4_sync,CH4_dry_sync,H2O_sync$
//DATE TIME ,FRAC_DAYS_SINCE_JAN1,FRAC_HRS_SINCE_JAN1,EPOCH_TIME,ALARM_STATUS,species,solenoid_valves,MPVPosition,OutletValve,CavityPressure,CavityTemp,WarmBoxTemp,EtalonTemp,DasTemp,CO2_sync,CO2_dry_sync,CH4_sync,CH4_dry_sync,H2O_sync$
//DATE TIME ,FRAC_DAYS_SINCE_JAN1,FRAC_HRS_SINCE_JAN1,EPOCH_TIME,ALARM_STATUS,species,solenoid_valves,MPVPosition,OutletValve,CavityPressure,CavityTemp,WarmBoxTemp,EtalonTemp,DasTemp,CO2_sync,CO2_dry_sync,CH4_sync,CH4_dry_sync,H2O_sync$
//DATE TIME ,FRAC_DAYS_SINCE_JAN1,FRAC_HRS_SINCE_JAN1,EPOCH_TIME,ALARM_STATUS,species,solenoid_valves,MPVPosition,OutletValve,CavityPressure,CavityTemp,WarmBoxTemp,EtalonTemp,DasTemp,CO2_sync,CO2_dry_sync,CH4_sync,CH4_dry_sync,H2O_sync$
//DATE TIME ,FRAC_DAYS_SINCE_JAN1,FRAC_HRS_SINCE_JAN1,EPOCH_TIME,ALARM_STATUS,species,solenoid_valves,MPVPosition,OutletValve,CavityPressure,CavityTemp,WarmBoxTemp,EtalonTemp,DasTemp,CO2_sync,CO2_dry_sync,CH4_sync,CH4_dry_sync,H2O_sync$

bartus11 · July 19, 2011, 9:35am

So now post:

awk '!/^\/\//' /u/Picarro/DataLog/2011/june/june.csv | grep '//' | head -5 | cat -Te

gd9629 · July 19, 2011, 9:46am

hmm. no output from the shell, and the .csv file is created with the headers still there