Creating subset of a file based on specific columns

smitra · June 16, 2013, 3:36am

Hello Unix experts,
I need a help to create a subset file. I know with cut comand, its very easy to select many different columns, or threshold. But here I have a bit problem as in my data file is big. And I don't want to identify the column numbers or names manually. I am trying to find any way to automatise this.

For example I have a file with about 1500 columns from TRFLP intensity data.

And the column names are like:

   [1] "Sample.Name"    "Marker"         "RE"             "Dye"            "Allele.1"       "Size.1"         "Height.1"       "Peak.Area.1"    "Data.Point.1"  
  [10] "Allele.2"       "Size.2"         "Height.2"       "Peak.Area.2"    "Data.Point.2"   "Allele.3"       "Size.3"         "Height.3"       "Peak.Area.3"   
  [19] "Data.Point.3"   "Allele.4"       "Size.4"         "Height.4"       "Peak.Area.4"    "Data.Point.4"   "Allele.5"       "Size.5"         "Height.5"      
  [28] "Peak.Area.5"    "Data.Point.5"   "Allele.6"       "Size.6"         "Height.6"       "Peak.Area.6"    "Data.Point.6"   "Allele.7"       "Size.7"        
  [37] "Height.7"       "Peak.Area.7"    "Data.Point.7"   "Allele.8"       "Size.8"         "Height.8"       "Peak.Area.8"    "Data.Point.8"   "Allele.9"      
  [46] "Size.9"         "Height.9"       "Peak.Area.9"    "Data.Point.9"   "Allele.10"      "Size.10"        "Height.10"      "Peak.Area.10"   "Data.Point.10"  .....

Suppose I want to create a subset selecting all the columns with name Peak.Area.1,Peak.Area.2 etc (as in unix Peak.Area.*)
How can I do that in easy way?
Thanks a lot for the help.
Best wishes,
Mitra

bartus11 · June 16, 2013, 4:37am

Please post sample input (using code tags) consisting of few columns and desired output for that data.

smitra · June 16, 2013, 6:38am

My Data is e.g.:

   Sample.Name                      Marker  RE Dye Allele.1 Size.1 Height.1 Peak.Area.1 Data.Point.1 Allele.2 Size.2 Height.2 Peak.Area.2 Data.Point.2
1       D71I1A  _Internal_Marker_Dye_Blue_ ALU   B        0     NA       NA          NA           NA        0     NA       NA          NA           NA
2       D71I1A _Internal_Marker_Dye_Green_ ALU   G        0     NA       NA          NA           NA        0     NA       NA          NA           NA
3       D71I1A  _Internal_Marker_Dye_Blue_ BSU   B        0     NA       NA          NA           NA        0     NA       NA          NA           NA
4       D71I1A _Internal_Marker_Dye_Green_ BSU   G        0     NA       NA          NA           NA        0     NA       NA          NA           NA
5       D71I1B  _Internal_Marker_Dye_Blue_ ALU   B        0     NA       NA          NA           NA        0  55.54       20         211         1576
6       D71I1B _Internal_Marker_Dye_Green_ ALU   G        0     NA       NA          NA           NA        0     NA       NA          NA           NA
7       D71I1B  _Internal_Marker_Dye_Blue_ BSU   B        0     NA       NA          NA           NA        0     NA       NA          NA           NA
8       D71I1B _Internal_Marker_Dye_Green_ BSU   G        0     NA       NA          NA           NA        0     NA       NA          NA           NA
9       D71I1C  _Internal_Marker_Dye_Blue_ ALU   B        0     NA       NA          NA           NA        0  55.38       18         192         1554
10      D71I1C _Internal_Marker_Dye_Green_ ALU   G        0     NA       NA          NA           NA        0     NA       NA          NA           NA

And I want a output like:

  Peak.Area.1 Peak.Area.2
1           NA          NA
2           NA          NA
3           NA          NA
4           NA          NA
5           NA         211
6           NA          NA
7           NA          NA
8           NA          NA
9           NA         192
10          NA          NA

But this is just an example.. I want it for a big file where there are over 1000 columns... thus I can't specify column 8 and 13 like in this example.
But I want to use the name Peak.Area.1,Peak.Area.2,Peak.Area.3 etc...something like Peak.Area.*.
Thanks,
Mitra

bartus11 · June 16, 2013, 6:57am

Try:

awk 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak.Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' file

smitra · June 16, 2013, 7:25am

Hello bartus11,
I am trying this...

smitra:TRFLP-RawData smitra$ awk 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak.Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' TRF_raw_data_reactor1.txt > test1.txt
smitra:TRFLP-RawData smitra$

or with csv file:

smitra:TRFLP-RawData smitra$ awk 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak.Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' TRF_raw_data_reactor1.csv > test1.txt
smitra:TRFLP-RawData smitra$

But somehow its returning a empty file..
No idea what did I do wrong

bartus11 · June 16, 2013, 7:33am

Post output of:

head -5 TRF_raw_data_reactor1.csv | cut -c 1-100

smitra · June 16, 2013, 7:38am

smitra:TRFLP-RawData smitra$ head -5 TRF_raw_data_reactor1.csv | cut -c 1-100
Sample Name,Marker,RE,Dye,Allele 1,Size 1,Height 1,Peak Area 1,Data Point 1,Allele 2,Size 2,Height 2
smitra:TRFLP-RawData smitra$

I also tried with

smitra:TRFLP-RawData smitra$ awk 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' TRF_raw_data_reactor1.txt>test1.txt

, but still the same

bartus11 · June 16, 2013, 7:46am

Your "real" data is different from the one you posted as sample (comma as delimiter)... Try this:

awk -F"," 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' TRF_raw_data_reactor1.txt>test1.txt