Filtering first file columns based on second file column

Hi friends,

I have one file like below. (.csv type)

SNo,data1,data2
1,1,2
2,2,3
3,3,2

and another file like below.

Exclude
data1

where Exclude should be treated as column name in file2.
I want the output shown below.

SNo,data2
1,2
2,3
3,2

Where my data1 column got removed from my first file as it is mentioned in next file.
In reality i have 1000's of columns in my file 1 and want to remove some of them by keep updating in second file.
I got a one liner R code, but to load the first file and writing the result into another file is taking lot of time as well as more memory commit.

Regards
Sidda

The following will help find the column to be excluded:

$ echo sn,data1,data2 | tr "," "\n" | cat
sn
data1
data2

$ echo sn,data1,data2 | tr "," "\n" | cat -n
     1  sn
     2  data1
     3  data2

$ echo sn,data1,data2 | tr "," "\n" | cat -n | grep "data1"
     2  data1

$ echo sn,data1,data2 | tr "," "\n" | cat -n | grep "data1" | cut -f1
     2

Now, to exclude a column, you can see if your 'cut' command recognizes the --complement option. Something like:

cut -f2 -- complement sample1.txt

Or...

 awk '{$2=""; print}' sample1.txt

If you save the following in a file named dropheaders and make it executable, have a file named input that contains the data, and a file named exclude that contains the list of headers to skip:

#!/bin/ksh
# Usage: drophdrs [data [exclude]]
awk 'BEGIN{FS = OFS = ","}
dbg{    printf("FILENAME=%s, FNR=%d, NR=%d, NF=%d, $0=\"%s\"\n",
                FILENAME, FNR, NR, NF, $0)
}
FNR==1{ if(dbg) printf("%s file header with %d fields: %s\n", FILENAME, NF, $0)
        if(FNR==NR) {
                efn = FILENAME # Save filename of exclude file for diagnostics.
                next
        }
        # Determine which fields to skip from headers in the data file.
        for(i = 1; i <= NF; i++) if($i in skiphdr) {
                sf
                delete skiphdr[$i]
                if(dbg) printf("Field %d added to sf[] for header %s.\n", i, $i)
        }
        first = 1
        for(i in skiphdr) {
                if(first) {
                        first = 0
                        printf("File %s will not be processed because:\n",
                                FILENAME)
                }
                printf("\theader \"%s\" in exclude file (%s) was not found\n",
                        i, efn, FILENAME)
        }
        if(first == 0) exit 1
}
FNR==NR{# gather names of columns to be skipped from exclude (1st) file
        skiphdr[$1]
        if(dbg) printf("%s added to skiphdr\n", $1)
        next
}
{       sep = ""
        for(i = 1; i <= NF; i++)
                if(!(i in sf)) {
                        printf("%s%s", sep, $i)
                        sep = OFS
                }
        printf("\n")
}' ${2:-exclude} ${1:-input}

should do what you want just by entering the command:

dropheaders

If your data and exclude files have different names, use:

dropheaders data_file_name exclude_file_name
1 Like