Auto correct a csv file using UNIX shell script.

Hi All,

There are list of 4-5 .csv files which has 12 columns.In some cases one of the record is split into 2 records. What needs to be done is this split record has to be auto corrected and placed in the csv file.

Eg:

Let us consider sample.csv file and in normal conditions the file would like below.

a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n

In some cases the sample.csv file would look like

a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,
i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h
,i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n

If you see the line 2,3 the record is broken. It should be a single record.

How to identify the broken records?
How can we correct the file to look like the normal sample.csv file written earlier?

Thanks,
Karthik

If line are only broken in two and not more, you could try something like this:

awk 'NF<12{getline p; $0=$0 p}1' FS=, OFS=, file

This should fix the file even if lines are broken into more than two lines:

perl -0pe 's/,\n/,/g;s/\n,/,/g' file

True, unless there are empty fields in the first or last position, like

a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,
i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,
,b,c,d,g,h,i,j,k,l,m,n

--edit--
This may also work for multiple lines and cases of empty fields, if we know it should always be exactly 12 fields per line:

awk '{while(NF<12){getline p; $0=$0 p}}1' FS=, OFS=, file

I tried the codes which you guys gave. None of them work, all they do is just write the recrods as it is.

All of that works for me.
Hope you are aware that these don't edit the original file, rather output it into the screen, right?

Post the sample of the actual file and the code you have used for us to help you.

--ahamed

Yes yes i'm aware that these dont edit the orignal file.

All i'm doing is redirecting the output of these commands into another file.

I cant get the actual code. But it is exactly the same as the sample code i mentioned. It has 12 columns, separated by a comma.

This is the result with the sample you gave us

root@maximus:/tmp# cat infile
a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,
i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n

root@maximus:/tmp# perl -0pe 's/,\n/,/g;s/\n,/,/g' infile
a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n
a,b,c,d,g,h,i,j,k,l,m,n

Which one are you using awk or perl? Which is your os?

--ahamed

I tried all the commands, perl and awk in shell script in UNIX.

Unless you post the actual data, it is difficult for us.
Which is your OS?

--ahamed

This is the actual data in a csv file.

 
abcd,789756,20140211,20140202,464532.00,C,XZY,,,Settlement of purchase.,,1
abcd,795156,20140211,20140202,548962.52,C,JPY,,,Payement to: tokyo BNF=N,,1
abcd,763256,20140211,20140202,42188.80,C,USD,,,recieved from ,,1
abcd,756556,20140211,20140202,456898.28,C,EUR,,,Payment via.,,1
abcd,732656,20140211,20140202,58461.10,C,INR,,,Settlement of purchase.,,1
abcd,798756,20140211,20140202,323156.82,C,AUD,,,Settlement of purchase.,,1
abcd,789856,20140211,20140202,84621.63,C,CHF,,,Settlement of sale no B35453 EUR 471.60,,1
abcd,785456,20140211,20140202,86421.40,C,USD,,,Settlement of purchase.,,1

Some times the one or two records will have a next line character in between causing it to split into two lines like this

 
abcd,789756,20140211,20140202,464532.00,C,XZY,,,Settlement of purchase.,,1
abcd,795156,20140211,20140202,548962.52,C,JPY,,,Payement to: tokyo BNF=N,,1
abcd,763256,20140211,20140202,42188.80,C,USD,,,recieved from ,,1
abcd,756556,20140211,20140202,456898.28,C,EUR,,,Payment via.,,1
abcd,732656,20140211,20140202,
58461.10,C,INR,,,Settlement of purchase.,,1
abcd,798756,20140211,20140202,323156.82,C,AUD,,,Settlement of purchase.,,1
abcd,789856,20140211,20140202,84621.63,C,CHF,,,Settlement of sale no B35453 EUR 471.60,,1
abcd,785456,20140211,20140202,86421.40,C,USD,,,Settlement of purchase.,,1

I use windows OS.

---------- Post updated at 11:47 AM ---------- Previous update was at 11:45 AM ----------

If you see the 5 th record, it is split into 2.
This need to be auto corrected either in the same file or another file with the same file name as the original.

Try this... this is only for the issue where the split record ends with a ,
if this works, we will check the next one...

awk '/,$/{printf $0;next}1' infile

You use cygwin? What is the version of awk?

--ahamed

This didnt work as well!!

I;m using the command you suggested in a shell script file

 
awk '/,$/{printf $0;next}1' ${infile} >> ${outfile}

The outfile and infile are exactly the same.

Ahamed can you try something by removing the carriage return of lines having less than 12 records?

Try this...

#!/bin/bash

incom=0
while read line
do
  if [ $incom -eq 1 ]; then
    incom=0
    printf "%s%s\n" "$data" "$line"
    data=""
    continue
  fi

  grep ",$" <<<$line >/dev/null 2>&1
  if [ $? -eq 0 ]; then
    incom=1
    data=$line
    continue
  fi
  echo $line
done < infile

If you still dont get the result, post the output with bash -x script infile

--ahamed

---------- Post updated at 10:57 PM ---------- Previous update was at 10:54 PM ----------

Also, can you post the output of od -bc infile
If you have the dos2unix utility I suggest you run that first on the file and then try the commands.

--ahamed

1 Like

Since you're running on Windows, it may be that your input file's lines are terminated by a carriage return character and a newline character. Since awk is expecting just newline as the line terminator, the search for a comma at the end of a line fails because there is a carriage return character following the comma. To verify this, show us the first few lines of output from:

od -bc Your_Input_File

Assuming that output contains some sequences like \r (indicating carriage return characters), try the following:

awk '{gsub(/\r/, "")}/,$/{printf("%s",$0);next}1' Your_Input_File > Your_Output_File

or, if your output file has to have carriage return and newline terminators:

awk '{gsub(/\r/, "")}/,$/{printf("%s",$0);next}{printf("%s\r\n", $0)}' Your_Input_File > Your_Output_File
1 Like

There are also (single) quoting issues with awk on Windows, I seem to recall. He may need to run it as

awk "{while(NF<12){getline p; \$0=\$0 p}}1" FS=, OFS=, file

or something or put the script in a file.....

1 Like

Thanks a lot Ahamed.

I used the below to autocorrect and it worked :slight_smile:

 
perl -pi -e 's/,\r+\n//g;s/\n,/,/g' ' infile.

I have few more queries

below is the description of what i'm working on.

>There will be a directory with more than 5+ .csv files
> One or more files will sometimes have this issue

What needs to be done is
> I dont what to overwrite in the original file.
>I want to retain the original file.

how to proceed?

Thanks,
Karthik

This is the reason why we ask to post the OS details in the first place.

For your second query, remove the -i option and redirect it to another file or

perl -pi.orig -e 's/,\r+\n//g;s/\n,/,/g' infile

which will create a backup file with extension .orig

--ahamed

But the problem is what if the broken line is like this where the it is starting with a comma(,) like this?

 
 
abcd,732656,20140211,20140202,58461.10
,C,INR,,,Settlement of purchase.,,1

This code wont work in such cases.

How to modify the code to work in both cases?

You need to similarly modify the second statement after ;

--ahamed