Parse csv file

Hi,

Our requirement is to parse the input file(.csv format). The each column in the file is delimited with comma. We need to take each column and apply some business validation rule.

If data itself contains comma, then those fields are enclosed with double quotes ("). We can see this double quotes symbols by opening the csv file in notepad or wordpad.

Some times, data itself contains double quotes, in those cases the data is enclosed in double quotes also for each double quotes, one extra double quotes is added. You can see them if we open in notepad or word pad.

Sample example of the csv file when opened in notepad is as shown below
Test.csv

This,test is,with out comma or double quotes
This,"test, contains",comma
This,"test contains ""double quotes"" ","and , comma"

Output required is:
1st Line:
Field1:This
Field2:test is
Field3:with out comma or double quotes

2nd Line:
Field1:This
Field2:test, contains
Field3:comma

3rd Line:
Field1:This
Field2:test contains "double quotes"
Field3:and , comma

If you save the above file contents into a csv file and open in excel, then you will see column values in each line matches my required output.

I tried hard and could not get right way to do this. Please help me in achieving the above.

I have attached test.csv for the reference purpose.

Thanks in advance,
Venkat

Could any one please help in writing script..I am somewhat stuck with this script.

Thanks.
Venkat

Bumping up posts or double posting is not permitted in these forums.

Please read the rules, which you agreed to when you registered, if you have not already done so.

You may receive an infraction for this. If so, don't worry, just try to follow the rules more carefully. The infraction will expire in the near future

Thank You.

The UNIX and Linux Forums.

Maybe you should post your problem on Homework & Coursework Questions - The UNIX and Linux Forums but before that you should read:

  1. Community Spirit and Ethos on Homework & Coursework
  2. Rules for Homework & Coursework Questions Forum

... and don't forget to read the Forum Rules

I am sorry for this. Henceforth, I would be more careful.

Thanks
Venkat

while(<DATA>){
	my @tmp = split(/,(?=(?:[^"]*$)|(?:(?:[^"]*"[^"]*"[^"]*)*$))/,$_);
	print "$. Line:\n";
	for(my $i=1;$i<=$#tmp+1;$i++){
		$tmp[$i-1] =~ s/(^"|"$)//g;
		$tmp[$i-1] =~ s/""/"/g;
		print "Field($i+1):$tmp[$i-1]\n";
	}
	print "\n";
}
__DATA__
This,test is,with out comma or double quotes
This,"test, contains",comma
This,"test contains ""double quotes"" ","and , comma"

Hi summer_cherry,

Thank you very much for the code.

I have written small shell scripts using basic commands, hence my expertsie in Unix shell is limited.

Could you please let me know what should be passed to <Data> in while loop. I assume it should be input file. I tried passing that, but program did not work.

I know, i am missing something here. It would be really great if this script can be changed by substituting with right values for <Data> and so on.

Thanks.
Venkat

---------- Post updated 08-18-09 at 12:25 AM ---------- Previous update was 08-17-09 at 11:43 PM ----------

Hi summer_cherry,

Please ignore my previous request. I was able to execute this script.

Thank you very much for your help.

Thanks,
Venkat

Hi Summer_cherry,

This scripts gives the output what is required. I do not have expertise in perl scripts, hence i was not able to proceed to next level. As I said in my original reqest, I need to take each column and apply some business rules.

It would be really great if this script can be converted into Unix shell script (may be an AWK). Then it would be easy for me to handle further logic.

Thanks,
Venkat