How to use substr to extract character between two semicolon?

sajmar · March 13, 2015, 3:36pm

Dear folks
Hello
I have a data set which one of the column of this data set are string and I want to extract numbers which is between two ":". However, I know the substr command which will do this operation but my problem is the numbers between two ":" have different digits. this will make my extraction difficult. I will put a part of my data set for better understanding.

1/1:0,6:6:12:150,12,0
0/1:6,4:10:99:126,0,195
1/1:100,50:150:34,0,210

In this case, I want to extract the red number out of my data set.
I will be glad that anyone could help me.
Thanks
Sajmar

kenshinhimura · March 13, 2015, 3:56pm


$ cat  yourfile
1/1:0,6:6:12:150,12,0
0/1:6,4:10:99:126,0,195
1/1:100,50:150:34,0,210


 $ cat  yourfile|awk 'BEGIN{FS=OFS=":"}{$3="";}1'
1/1:0,6::12:150,12,0
0/1:6,4::99:126,0,195
1/1:100,50::34,0,210

sajmar · March 13, 2015, 4:06pm

Thank you kenshinhimura for your suggestion. Because this column in the data set is the tenth column of my data set. Could you please tell me how could I specify the number of column in this command?

kenshinhimura · March 13, 2015, 4:08pm

From:
awk 'BEGIN{FS=OFS=":"}{$3="";}1'



To:
awk 'BEGIN{FS=OFS=":"}{$10="";}1'

Don_Cragun · March 13, 2015, 4:15pm

Or to parameterize it:

FieldNumber=10
awk -v f="$FieldNumber" 'BEGIN{FS = OFS = ":"}{$f = ""}1' input_file

sajmar · March 13, 2015, 4:33pm

I want to extract the red numbers from the string column I have. when I run your suggestion command it remove the red numbers which are between two colon.I just want to keep those numbers.

---------- Post updated at 04:33 PM ---------- Previous update was at 04:29 PM ----------

Dear Don Cragun
I have 300 files which I used this command below to extract column 1, 2, 4, 5, and the first three number of the 10 column of my data set.

for file in *.work; do awk '{print $1,$2,$4,$5,substr($10,1,1),substr($10,3,1)}' $file > "$(basename "$file" .work).info"; done

My problem is that the number which are in the middle of the sting column tenth have different digit numbers.

kenshinhimura · March 13, 2015, 4:36pm

Nest time give a desired output.

maybe this is what you want

awk -F: '{print $10}'

sajmar · March 13, 2015, 4:45pm

It seems this will not give my desire output.

kenshinhimura · March 13, 2015, 4:48pm

can you print your desired output

sajmar · March 13, 2015, 4:56pm

This my one row of my raw data set with 10 column:

gi|358485511|ref|NC_006088.3|         699545         .     A         G           122.03            PASS           AC=2;AF=1.00;AN=2;DP=6;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=56.04;MQ0=0;QD=20.34                            GT:AD:PP:GQ:PL                  1/1:0,6:8:12:150,12,0

My desire output is:

gi|358485511|ref|NC_006088.3|   699545     A       G   1 1    8

To reminding, I have 300 file with a same data set structure.

vgersh99 · March 13, 2015, 5:19pm

awk '{split($NF,a, "[/:,]"); print $1,$2,$4,$5, a[1], a[2], a[5]}' myFile

Next time please post what you input is EXACTLY and what would be the desired output.

Don_Cragun · March 13, 2015, 5:43pm

Is each of your 300 files one line? Or are there multiple lines in each file?

What operating system are you using?

What shell are you using?

Does the command:

ls *.work

succeed, or does it give you a diagnostic saying that your argument list is too long?