CSV with commas in field values, remove duplicates, cut columns

Hi

Description of input file I have:
-------------------------
1) CSV with double quotes for string fields.
2) Some string fields have Comma as part of field value.
3) Have Duplicate lines
4) Have 200 columns/fields
5) File size is more than 10GB

Description of output file I need:
-------------------------------
1) Can be of CSV or Pipe delimited
2) But Comma within field value should remain
3) No Duplicate lines
4) I need only first 150 columns

Code I used till now:
-------------------

cat file.in | awk -F"," '!($1$2$3 in a){a[$1$2$3];print $0}' | cut -d, -f1-150 > file.out

But with this code, comma's within field value is treated as delimiter.

How then do you intend to deal with cases such as

lpr,'Krishna\'s printer','Beside my desk',631

I'd recommend using the Text::CSV module in Perl (Not a core module but I've never been on a site that it won't install in, and can be shipped with the script if dependencies are an issue.)

i don't have such scenario's. my trouble is to ignore commas within double quotes.

First, selecting 150 fields is tricky when you've got a complex delimiter. You could do it in awk, but only some versions of awk, and it probably wouldn't be fast enough. Do you have a C compiler?

#include <stdio.h>
#include <string.h>

#define FIELDS  150

int main(void)
{
        char buf[32768];

        while(fgets(buf, 32768, stdin))
        {
                int n;
                char *c=strstr(buf, "\",\""); // Find end of 1st field

                // Find end of nth field after
                for(n=0; c && (n<(FIELDS-1)); n++) 
                        c=strstr(c+1, "\",\"");

                if(c)  strcpy(c+1, "\n"); // end the line early
                fputs(buf, stdout); // print the line again
        }
}
$ gcc 150cols.c -o 150cols
$ cat 200cols
"1:asdf","2:asdf","3:asdf","4:asdf",...,"199:asdf","200:asdf"
$ ./150cols < 200cols > 150cols
$ cat 150cols
"1:asdf","2:asdf","3:asdf","4:asdf","5:asdf","6:asdf",...,"150:asdf"

Now that you can do that, I think you're going to need to sort your data in order to remove duplicates. The alternative, storing up to 10 gigabytes in memory during processing so you can tell whether a line's duplicate or not, just isn't feasible. So use it in combination with sort to remove duplicate lines:

sort -u < input | ./150cols > output

The below program will print 4 fields out of 6. Its may not be efficient as it goes through a lot of loops. Or if you have Perl's Text::CSV module installed, you may try that too.

input.txt:

test0,"hello,world,earth,universe","asia",antartica,"arctic",pacific,
test1,"abc,pqr,xyz,123","sed,awk,grep",perl,c,java
test2,"hel,wor,ear,uni","asi",ant,"arc",pac
test3,"start,middle,end",blue,green,red,white,
#! /usr/bin/perl -w
use strict;
open INPUT, "< input.txt";
for (<INPUT>) {
    my @y = split //,$_;
    my ($flag, $ln, $count) = (0, '', 0);
    for (@y) {
        if ($count == 4) {
            print "$ln\n";
            last;
        }
        if ($_ eq '"' && $flag == 0) {
            $flag = 1;
            $ln .= $_;
        }
        elsif ($_ eq '"' && $flag == 1) {
            $flag = 0;
            $ln .= $_;
        }
        elsif ($_ eq ',' && $flag == 0) {
            $count++;
            $ln .= $_;
        }
        else {
            $ln .= $_;
        }
    }
}
close INPUT;

output:

test0,"hello,world,earth,universe","asia",antartica,
test1,"abc,pqr,xyz,123","sed,awk,grep",perl,
test2,"hel,wor,ear,uni","asi",ant,
test3,"start,middle,end",blue,green,