Hello Unix experts,
I need a help to create a subset file. I know with cut comand, its very easy to select many different columns, or threshold. But here I have a bit problem as in my data file is big. And I don't want to identify the column numbers or names manually. I am trying to find any way to automatise this.
For example I have a file with about 1500 columns from TRFLP intensity data.
Suppose I want to create a subset selecting all the columns with name Peak.Area.1,Peak.Area.2 etc (as in unix Peak.Area.*)
How can I do that in easy way?
Thanks a lot for the help.
Best wishes,
Mitra
Sample.Name Marker RE Dye Allele.1 Size.1 Height.1 Peak.Area.1 Data.Point.1 Allele.2 Size.2 Height.2 Peak.Area.2 Data.Point.2
1 D71I1A _Internal_Marker_Dye_Blue_ ALU B 0 NA NA NA NA 0 NA NA NA NA
2 D71I1A _Internal_Marker_Dye_Green_ ALU G 0 NA NA NA NA 0 NA NA NA NA
3 D71I1A _Internal_Marker_Dye_Blue_ BSU B 0 NA NA NA NA 0 NA NA NA NA
4 D71I1A _Internal_Marker_Dye_Green_ BSU G 0 NA NA NA NA 0 NA NA NA NA
5 D71I1B _Internal_Marker_Dye_Blue_ ALU B 0 NA NA NA NA 0 55.54 20 211 1576
6 D71I1B _Internal_Marker_Dye_Green_ ALU G 0 NA NA NA NA 0 NA NA NA NA
7 D71I1B _Internal_Marker_Dye_Blue_ BSU B 0 NA NA NA NA 0 NA NA NA NA
8 D71I1B _Internal_Marker_Dye_Green_ BSU G 0 NA NA NA NA 0 NA NA NA NA
9 D71I1C _Internal_Marker_Dye_Blue_ ALU B 0 NA NA NA NA 0 55.38 18 192 1554
10 D71I1C _Internal_Marker_Dye_Green_ ALU G 0 NA NA NA NA 0 NA NA NA NA
And I want a output like:
Peak.Area.1 Peak.Area.2
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA 211
6 NA NA
7 NA NA
8 NA NA
9 NA 192
10 NA NA
But this is just an example.. I want it for a big file where there are over 1000 columns... thus I can't specify column 8 and 13 like in this example.
But I want to use the name Peak.Area.1,Peak.Area.2,Peak.Area.3 etc...something like Peak.Area.*.
Thanks,
Mitra
awk 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak.Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' file
smitra:TRFLP-RawData smitra$ head -5 TRF_raw_data_reactor1.csv | cut -c 1-100
Sample Name,Marker,RE,Dye,Allele 1,Size 1,Height 1,Peak Area 1,Data Point 1,Allele 2,Size 2,Height 2
smitra:TRFLP-RawData smitra$
I also tried with
smitra:TRFLP-RawData smitra$ awk 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' TRF_raw_data_reactor1.txt>test1.txt
Your "real" data is different from the one you posted as sample (comma as delimiter)... Try this:
awk -F"," 'NR==1{for (i=1;i<=NF;i++) if ($i~"^Peak Area") {printf $i" ";a[i+1]=1};printf "\n"}
NR>1{printf $1" ";for (i=2;i<=NF;i++) if (i in a) printf $i" ";printf "\n"}' TRF_raw_data_reactor1.txt>test1.txt