Removing letters after a certain character within a range of columns

daashti · January 3, 2017, 3:57am

Hi there,

I am trying to remove al letters after : character on specific columns from 10th column till 827. I used sed and cut to do so but I am sure there is better one liner someone can think of from unix community members.

Huge file but it has this structure (Total number of Columns = 827, rows = 605278)

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Control1    Control2    Case1    Case2
chr1    65872    .    T    G    2480.51    .    AC=65;AN=92;SF=0,1;VRT=1    GT:GQ:DP:AD:PL    1/1:3:243:176,66:27,3,0    0/1:21:148:135,13:21,0,21    0/1:9:250:201,49:9,0,115    .
chr1    65893    .    G    A    433.77    .    AC=7;AN=10;SF=0,1;VRT=1        .    .    0/1:173:144,29:198,0,91:91    0/1:143:100,43:180,0,233:99

I am trying to get this structure

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Control1    Control2    Case1    Case2
chr1    65872    .    T    G    2480.51    .    AC=65;AN=92;SF=0,1;VRT=1     GT:GQ:DP:AD:PL    1/1     0/1    0/1    .
chr1    65893    .    G    A    433.77    .    AC=7;AN=10;SF=0,1;VRT=1         .    .    0/1     0/1

Thanks for advance

matrixmadhan · January 3, 2017, 4:07am

Welcome to the forum!

Can you please show what is that you have tried so far? Thanks

---------- Post updated at 02:37 PM ---------- Previous update was at 02:32 PM ----------

Here is a rough sample

>cat inputfile
a-b-c-d
e-f-g-h

awk -F"-" '{ printf "\n"; for (i=1; i<=2; i++) { printf "-%s", $i } }' inputfile

daashti · January 3, 2017, 4:17am

Thanks for the quick reply,

I did

 sed 's/:.*//' file.txt > file1.txt

However, this replaced only one of the columns then the rest all were empty.

Another bad idea was to cut the first 3 letter from the file.txt > file1.txt and re-join desired columns from file.txt to make file3.txt

It can be done but very lousy solution from my end. I thought of using awk find ":" replace with "" on specific columns 10th-827th but didnt know how to do so?

Don_Cragun · January 3, 2017, 4:28am

Please also explain what you mean by "remove al letters after : character". The sample data you provided did not contain any letters in the fields you specified and instead of removing letters after a <colon> character, you removed all punctuation and decimal digits from those fields after the 1st <colon> character and also removed that <colon> character.

Your input field separator seems to be a sequence of four <space> characters. Your output field separator seems to be a sequence of four or five <space> characters (although I didn't check all of the output sequences). Is there any reason why a single <space> is insufficient as a field separator in your output?

RudiC · January 3, 2017, 4:46am

Assuming "letters" stands for "characters", and interpreting your desired output like "remove everything in a field after the first colon", and using <TAB> as the output field separator, would this come close to what you need:

awk 'NR > 1 {for (i=1; i<=NF; i++) sub (/:.*$/, _, $i)} 1' OFS="\t" file
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Control1    Control2    Case1    Case2
chr1	65872	.	T	G	2480.51	.	AC=65;AN=92;SF=0,1;VRT=1	GT	1/1	0/1	0/1	.
chr1	65893	.	G	A	433.77	.	AC=7;AN=10;SF=0,1;VRT=1	.	.	0/1	0/1

EDIT: I made a typo: the loop should start at the 10. field only:

awk 'NR > 1 {for (i=10; i<=NF; i++) sub (/:.*$/, _, $i)} 1' OFS="\t" file

daashti · January 3, 2017, 5:13am

Thanks Don Cragun and RudiC,

I apologies for not clarifying properly, its exactly how RuiC described.
RudiC your code worked as expected, is there a way to ignore certain columns e.g #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT (from 1-9) thanks again for your code and explanation.

RudiC · January 3, 2017, 5:22am

I'm glad it helped and my assumptions / interpretations were adequate. But your new request again is ambiguous: Modify the header? Remove ALL columns 1 - 9 in ALL rows? Remove a selection of columns?

daashti · January 3, 2017, 11:01am

Again sorry,

Not the Headers but the entire first 9 columns is there a way to make the code work but exempt the first 9 columns (i.e i want to keep the information after the : on the the the first 9 columns only) = Remove a selection of columns

Thanks

RudiC · January 3, 2017, 11:19am

Yes. Look at the code and how it loops across fields 10 until end-of-line. Now imagine what were needed to make it loop across fields 1 to 9...

daashti · January 3, 2017, 12:13pm

awk 'NR > 1 {for (i=1; i<=9; i++) sub (/:.*$/, _, $i)} 1' OFS="\t" file

right?

I am trying to understand the logic behind this bit. I would be grateful if you can break it down ?

(/:.*$/, _, $i)

$ for the columns ? and the columns are specified earlier in I?
, _, substitute ? usually in sed i leave it empty for substitution when i want to remove it.
I always find awk is complicated, any good place to read more about it as I have more use for it for the same data to find in-common and unique values between range of columns (patients versus control)?

Thanks again
You are a life saver. I have been experimenting with this for 3 days

RudiC · January 3, 2017, 12:53pm

You got it, congrats! Here the sub logics (please note that $ means different things in regexes and in awk field designators):

(/:.*$/,        # regex greedily matching ":" then ANY char (multiple) until EOL designated by $
 _,             # _ is an undefined thus empty variable, equivalent to ""
 $i)            # in the loop, $i results in fields $1 .. $9