CSV with commas in field values, remove duplicates, cut columns

krishnix · December 7, 2011, 10:35am

Hi

Description of input file I have:
-------------------------
1) CSV with double quotes for string fields.
2) Some string fields have Comma as part of field value.
3) Have Duplicate lines
4) Have 200 columns/fields
5) File size is more than 10GB

Description of output file I need:
-------------------------------
1) Can be of CSV or Pipe delimited
2) But Comma within field value should remain
3) No Duplicate lines
4) I need only first 150 columns

Code I used till now:
-------------------

cat file.in | awk -F"," '!($1$2$3 in a){a[$1$2$3];print $0}' | cut -d, -f1-150 > file.out

But with this code, comma's within field value is treated as delimiter.

Skrynesaver · December 7, 2011, 10:48am

How then do you intend to deal with cases such as

lpr,'Krishna\'s printer','Beside my desk',631

I'd recommend using the Text::CSV module in Perl (Not a core module but I've never been on a site that it won't install in, and can be shipped with the script if dependencies are an issue.)

krishnix · December 7, 2011, 11:14am

i don't have such scenario's. my trouble is to ignore commas within double quotes.

Corona688 · December 7, 2011, 2:17pm

First, selecting 150 fields is tricky when you've got a complex delimiter. You could do it in awk, but only some versions of awk, and it probably wouldn't be fast enough. Do you have a C compiler?

#include <stdio.h>
#include <string.h>

#define FIELDS  150

int main(void)
{
        char buf[32768];

        while(fgets(buf, 32768, stdin))
        {
                int n;
                char *c=strstr(buf, "\",\""); // Find end of 1st field

                // Find end of nth field after
                for(n=0; c && (n<(FIELDS-1)); n++) 
                        c=strstr(c+1, "\",\"");

                if(c)  strcpy(c+1, "\n"); // end the line early
                fputs(buf, stdout); // print the line again
        }
}

$ gcc 150cols.c -o 150cols
$ cat 200cols
"1:asdf","2:asdf","3:asdf","4:asdf",...,"199:asdf","200:asdf"
$ ./150cols < 200cols > 150cols
$ cat 150cols
"1:asdf","2:asdf","3:asdf","4:asdf","5:asdf","6:asdf",...,"150:asdf"

Now that you can do that, I think you're going to need to sort your data in order to remove duplicates. The alternative, storing up to 10 gigabytes in memory during processing so you can tell whether a line's duplicate or not, just isn't feasible. So use it in combination with sort to remove duplicate lines:

sort -u < input | ./150cols > output

balajesuri · December 8, 2011, 1:25am

The below program will print 4 fields out of 6. Its may not be efficient as it goes through a lot of loops. Or if you have Perl's Text::CSV module installed, you may try that too.

input.txt:

test0,"hello,world,earth,universe","asia",antartica,"arctic",pacific,
test1,"abc,pqr,xyz,123","sed,awk,grep",perl,c,java
test2,"hel,wor,ear,uni","asi",ant,"arc",pac
test3,"start,middle,end",blue,green,red,white,

#! /usr/bin/perl -w
use strict;
open INPUT, "< input.txt";
for (<INPUT>) {
    my @y = split //,$_;
    my ($flag, $ln, $count) = (0, '', 0);
    for (@y) {
        if ($count == 4) {
            print "$ln\n";
            last;
        }
        if ($_ eq '"' && $flag == 0) {
            $flag = 1;
            $ln .= $_;
        }
        elsif ($_ eq '"' && $flag == 1) {
            $flag = 0;
            $ln .= $_;
        }
        elsif ($_ eq ',' && $flag == 0) {
            $count++;
            $ln .= $_;
        }
        else {
            $ln .= $_;
        }
    }
}
close INPUT;

output:

test0,"hello,world,earth,universe","asia",antartica,
test1,"abc,pqr,xyz,123","sed,awk,grep",perl,
test2,"hel,wor,ear,uni","asi",ant,
test3,"start,middle,end",blue,green,