cmccabe
1
I am trying to parse the input in awk to include the |gc= in $4 but am not able to. The below is close:
awk so far:
awk '{sub(/\|[^[:blank:]]+[[:blank:]]+[0-9]+/, ""); print }' input.txt
Input
chr1 955543 955763 AGRN-6|pr=2|gc=75 0 +
chr1 957571 957852 AGRN-7|pr=3|gc=61.2 0 +
chr1 970621 970740 AGRN-8|pr=1|gc=57.1 0 +
Current Output
chr1 955543 955763 AGRN-6 +
chr1 957571 957852 AGRN-7 +
chr1 970621 970740 AGRN-8 +
Desired Output (each field separated by a tab)
chr1 955543 955763 AGRN-6|gc=75 +
chr1 957571 957852 AGRN-7|gc=61.2 +
chr1 970621 970740 AGRN-8|gc=57.1 +
awk '{
printf("%s\t%s\t%s\t%s\t%s\n", $1,$2,$3,$4,$6)
}' oldfile >newfile
Just do not print column #5, assuming your examples for input are correct. You can also play with the awk OS variable to get tab separation.
1 Like
cmccabe
3
That `awk` produces:
chr1 955543 955763 AGRN-6|pr=2|gc=75 +
chr1 957571 957852 AGRN-7|pr=3|gc=61.2 +
chr1 970621 970740 AGRN-8|pr=1|gc=57.1 +
The |pr=2, |pr=3, and pr=1
is not needed and there looks to be a line skipped each after each row and that will may be problematic for later analysis.
Thank you :).
Aia
4
awk '{n=split($4, a, "|"); print $1, $2, $3, a[1]"|"a[n], $6}' cmccabe.file
or:
awk '{n=split($4, a, "|"); print $1,$2,$3,a[1]"|"a[n],$6}' OFS="\t" cmccabe.file
1 Like
cmccabe
5
I had something similar @Aia
awk '{split($4,a,"|"); print $1,$2,$3,a[1],"|",a[3],$6}' input
chr1 955543 955763 AGRN-6 | gc=75 +
chr1 957571 957852 AGRN-7 | gc=61.2 +
chr1 970621 970740 AGRN-8 | gc=57.1 +
but that outputs everything on one line. Your awk
is much better, thank you :).
1 Like
Aia
6
Yes, those highlighted red commas get translated into OFS.
Here's a Perl alternative:
perl -pe 's/(\|\w+=[\w\.]+){1,2}\s+\d+/$1/' cmccabe.file
RudiC
7
How about
awk '{sub ("\|.*\|", "|")}1' file
chr1 955543 955763 AGRN-6|gc=75 0 +
chr1 957571 957852 AGRN-7|gc=61.2 0 +
chr1 970621 970740 AGRN-8|gc=57.1 0 +
?
Aia
8
Part of the requirement is to remove the 5th field.
awk '{sub (/\|.*\|/, "|", $4)}{print $1,$2,$3,$4,$6}' file
chr1 955543 955763 AGRN-6|gc=75 +
chr1 957571 957852 AGRN-7|gc=61.2 +
chr1 970621 970740 AGRN-8|gc=57.1 +
chr1 970621 970740 AGRN-8|gc=57.1 +