Filtering first file columns based on second file column

ks_reddy · October 15, 2012, 10:15am

Hi friends,

I have one file like below. (.csv type)

SNo,data1,data2
1,1,2
2,2,3
3,3,2

and another file like below.

Exclude
data1

where Exclude should be treated as column name in file2.
I want the output shown below.

SNo,data2
1,2
2,3
3,2

Where my data1 column got removed from my first file as it is mentioned in next file.
In reality i have 1000's of columns in my file 1 and want to remove some of them by keep updating in second file.
I got a one liner R code, but to load the first file and writing the result into another file is taking lot of time as well as more memory commit.

Regards
Sidda

joeyg · October 15, 2012, 10:48am

The following will help find the column to be excluded:

$ echo sn,data1,data2 | tr "," "\n" | cat
sn
data1
data2

$ echo sn,data1,data2 | tr "," "\n" | cat -n
     1  sn
     2  data1
     3  data2

$ echo sn,data1,data2 | tr "," "\n" | cat -n | grep "data1"
     2  data1

$ echo sn,data1,data2 | tr "," "\n" | cat -n | grep "data1" | cut -f1
     2

Now, to exclude a column, you can see if your 'cut' command recognizes the --complement option. Something like:

cut -f2 -- complement sample1.txt

Or...

 awk '{$2=""; print}' sample1.txt

Don_Cragun · October 15, 2012, 1:55pm

If you save the following in a file named dropheaders and make it executable, have a file named input that contains the data, and a file named exclude that contains the list of headers to skip:

#!/bin/ksh
# Usage: drophdrs [data [exclude]]
awk 'BEGIN{FS = OFS = ","}
dbg{    printf("FILENAME=%s, FNR=%d, NR=%d, NF=%d, $0=\"%s\"\n",
                FILENAME, FNR, NR, NF, $0)
}
FNR==1{ if(dbg) printf("%s file header with %d fields: %s\n", FILENAME, NF, $0)
        if(FNR==NR) {
                efn = FILENAME # Save filename of exclude file for diagnostics.
                next
        }
        # Determine which fields to skip from headers in the data file.
        for(i = 1; i <= NF; i++) if($i in skiphdr) {
                sf
                delete skiphdr[$i]
                if(dbg) printf("Field %d added to sf[] for header %s.\n", i, $i)
        }
        first = 1
        for(i in skiphdr) {
                if(first) {
                        first = 0
                        printf("File %s will not be processed because:\n",
                                FILENAME)
                }
                printf("\theader \"%s\" in exclude file (%s) was not found\n",
                        i, efn, FILENAME)
        }
        if(first == 0) exit 1
}
FNR==NR{# gather names of columns to be skipped from exclude (1st) file
        skiphdr[$1]
        if(dbg) printf("%s added to skiphdr\n", $1)
        next
}
{       sep = ""
        for(i = 1; i <= NF; i++)
                if(!(i in sf)) {
                        printf("%s%s", sep, $i)
                        sep = OFS
                }
        printf("\n")
}' ${2:-exclude} ${1:-input}

should do what you want just by entering the command:

dropheaders

If your data and exclude files have different names, use:

dropheaders data_file_name exclude_file_name