Have problem with extra EOLs in my CSV - need help cleaning out

richardsantink · June 16, 2014, 3:09pm

Hi Everyone,

Searching the forum, I came across another closed thread, that appears to be either the same problem, or very close to what I'm experiencing. Closed thread for reference is at: Removing cr,lf till number of fields are full

I have a report which generates 7 fields, using a comma as FS:
-Page
-EmplEvName
-Date
-Number of Views (no pct)
-Number of Visits (no pct)
-Number of Visitors (no pct)
-Empl Page Views

However, recently some rows have been generated with extra end-of-line characters (Vim output shown, with $). This breaks any of our other scripts which run against the output (expecting to see 7 fields for every row).

The closed thread offered some solutions using awk/sed, but they didn't seem to work for me, probably because I didn't completely understand them.

All of our scripts are written using BASH, and ideally, I need to stick to a solution that can be implemented in a BASH shell script.

What would be the cleanest way to eliminate the extra "$" shown on the last four lines of the sample below? The data in the last four fields are always integers, but the first three fields vary in length and format.

"Page","EmplEvName","Date","Number of Views (no pct)","Number of Visits (no pct)","Number of Visitors (no pct)","Empl Page Views"$
"Empltvdemo.edgesuite.net/software/analytics/cognos/videos/htvs/cm102","connect_to_a_server",6/14/2014,16,16,15,0$
"Empltvdemo.edgesuite.net/software/analytics/cognos/videos/htvs/cm102","connect_to_a_server",6/11/2014,14,14,14,0$
"Empltvdemo.edgesuite.net/software/analytics/cognos/videos/htvs/cm102","connect_to_a_server",6/7/2014,13,13,13,0$
"Empltvdemo.edgesuite.net/software/analytics/cognos/videos/htvs/cm102","connect_to_a_server",6/1/2014,7,7,6,0$
"Empltvdemo.edgesuite.net/software/analytics/cognos/videos/htvs/cm102","connect_to_a_server",6/8/2014,6,5,5,0$
"Empltvdemo.edgesuite.net/software/analytics/cognos/videos/htvs/cm102","$
Access reports$
",6/12/2014,5,5,4,0$
"Empltvdemo.edgesuite.net/software/analytics/cognos/videos/htvs/cm102","$
Access reports$
",6/5/2014,5,5,5,0$
"Empltvdemo.edgesuite.net/software/analytics/cognos/videos/htvs/cm102","$
Access reports$
",6/13/2014,4,4,4,0$
"Empltvdemo.edgesuite.net/software/analytics/cognos/videos/htvs/cm102","$
Access reports$
",6/2/2014,4,4,4,1$

Logically, it seems fair to remove ALL EOLs, and then insert new ones after every 7th field. This would include instances when field 2 is empty (but still a field).

Thanks,

Rich

Corona688 · June 16, 2014, 3:55pm

Logically, you should fix the thing which is producing your broken CSV's, not kludge them back together after, but:

awk -F, -v OFS="," 'NR==1 { MINREC=NF; next }
{
        if(PREV) {      $0=PREV $0;     $1=$1   }
        if(NF >= MINREC) { print; PREV=""; }
        else    PREV=$0
}' < inputfile > outputfile

This will fail if , appears inside your quoted sections.

richardsantink · June 16, 2014, 4:06pm

Thanks Corona688 - works like a charm. Agree that fixing the root cause is what should be done, however, at our company it's handled by a different group, so we must wait for them to implement fixes..

Could I bother your for a brief explanation of what is going on, especially on lines 3,4?

Cheers,

R

Corona688 · June 16, 2014, 5:06pm

$0 is a special variable in awk which means "the entire line".

NF is a special variable which means "the number of fields", i.e. columns.

So we set the entire line with its current contents, plus the PREV contents, and if there's enough fields, print it. Otherwise, save it all in PREV for next time.

richardsantink · June 16, 2014, 5:10pm

Nice! Thank you again!