I am trying to remove al letters after : character on specific columns from 10th column till 827. I used sed and cut to do so but I am sure there is better one liner someone can think of from unix community members.
Huge file but it has this structure (Total number of Columns = 827, rows = 605278)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Control1 Control2 Case1 Case2
chr1 65872 . T G 2480.51 . AC=65;AN=92;SF=0,1;VRT=1 GT:GQ:DP:AD:PL 1/1:3:243:176,66:27,3,0 0/1:21:148:135,13:21,0,21 0/1:9:250:201,49:9,0,115 .
chr1 65893 . G A 433.77 . AC=7;AN=10;SF=0,1;VRT=1 . . 0/1:173:144,29:198,0,91:91 0/1:143:100,43:180,0,233:99
I am trying to get this structure
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Control1 Control2 Case1 Case2
chr1 65872 . T G 2480.51 . AC=65;AN=92;SF=0,1;VRT=1 GT:GQ:DP:AD:PL 1/1 0/1 0/1 .
chr1 65893 . G A 433.77 . AC=7;AN=10;SF=0,1;VRT=1 . . 0/1 0/1
However, this replaced only one of the columns then the rest all were empty.
Another bad idea was to cut the first 3 letter from the file.txt > file1.txt and re-join desired columns from file.txt to make file3.txt
It can be done but very lousy solution from my end. I thought of using awk find ":" replace with "" on specific columns 10th-827th but didnt know how to do so?
Please also explain what you mean by "remove al letters after : character". The sample data you provided did not contain any letters in the fields you specified and instead of removing letters after a <colon> character, you removed all punctuation and decimal digits from those fields after the 1st <colon> character and also removed that <colon> character.
Your input field separator seems to be a sequence of four <space> characters. Your output field separator seems to be a sequence of four or five <space> characters (although I didn't check all of the output sequences). Is there any reason why a single <space> is insufficient as a field separator in your output?
Assuming "letters" stands for "characters", and interpreting your desired output like "remove everything in a field after the first colon", and using <TAB> as the output field separator, would this come close to what you need:
awk 'NR > 1 {for (i=1; i<=NF; i++) sub (/:.*$/, _, $i)} 1' OFS="\t" file
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Control1 Control2 Case1 Case2
chr1 65872 . T G 2480.51 . AC=65;AN=92;SF=0,1;VRT=1 GT 1/1 0/1 0/1 .
chr1 65893 . G A 433.77 . AC=7;AN=10;SF=0,1;VRT=1 . . 0/1 0/1
EDIT: I made a typo: the loop should start at the 10. field only:
I apologies for not clarifying properly, its exactly how RuiC described.
RudiC your code worked as expected, is there a way to ignore certain columns e.g #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT (from 1-9) thanks again for your code and explanation.
I'm glad it helped and my assumptions / interpretations were adequate. But your new request again is ambiguous: Modify the header? Remove ALL columns 1 - 9 in ALL rows? Remove a selection of columns?
Not the Headers but the entire first 9 columns is there a way to make the code work but exempt the first 9 columns (i.e i want to keep the information after the : on the the the first 9 columns only) = Remove a selection of columns
I am trying to understand the logic behind this bit. I would be grateful if you can break it down ?
(/:.*$/, _, $i)
$ for the columns ? and the columns are specified earlier in I?
, _, substitute ? usually in sed i leave it empty for substitution when i want to remove it.
I always find awk is complicated, any good place to read more about it as I have more use for it for the same data to find in-common and unique values between range of columns (patients versus control)?
Thanks again
You are a life saver. I have been experimenting with this for 3 days
You got it, congrats! Here the sub logics (please note that $ means different things in regexes and in awk field designators):
(/:.*$/, # regex greedily matching ":" then ANY char (multiple) until EOL designated by $
_, # _ is an undefined thus empty variable, equivalent to ""
$i) # in the loop, $i results in fields $1 .. $9