I have a file of ~500,000 entries in the following:
file.txt
chr1 11868 12227 ENSG00000223972.5 . + HAVANA exon . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 12009 12057 ENSG00000223972.5 . + HAVANA exon . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 12178 12227 ENSG00000223972.5 . + HAVANA exon . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
I am using cygwin on a windows 7 OS trying to parse each line into 5 rows:
In the below
row 1 = $1 of the text
row 2 = $2 of the text
row 3 = $3 of the text
row 4 = gene_name= "..." - quotes removed
row 5 = exon_number "...." - quotes removed
Example of desired output:
chr1 11868 12227 DDX11L1 1
chr1 12009 12057 DDX11L1 1
chr1 12178 12227 DDX11L1 2
I was able to generate the file.txt but can not seem to parse it correctly.
Thank you :).
After 2 years and 457 posts in this forum you must have tried something? As far as I know, the format of your file.txt
is a pretty standard gtf/gff
, so there is nothing to generate really. Please show what you tried.
Yes the
gtf/gff
was created
wget ftp://ftp.sanger.ac.uk/pub/gencode/release_21/gencode.21.annotation.gtf.gz
gunzip --stdout gencode.v21.annotation.gtf.gz \
| gtf2bed - \
| grep "exon" \
> gencode.exons.bed
bedmap --echo --echo-map Regions.bed gencode.exons.bed
produced output close, but not desired and I thought maybe if I parsed the input it may help. That is if I had a exon file with only 5 rows that may be better.
I'm not sure but maybe:
awk -f FNR > 1{for(i=1;i<=NF;i++) {n=split($i,a, "[.:>_]") print a[1]+0,a[2]+0,a[3]+0,substr(a[gene_name],length(a[exon_number])), a[n]} } OFS='\t' gencode.exons.txt > parse.txt
Thank you :).
one way below using split, other methods include parsing by regex..
try
awk -F"\t" '{split($10,a,";"); for (i=1;i<=length(a);i++) if (a~/gene_name/) { split(a,b,"\"");x=b[2] } else if (a~/exon_number/) {split(a,c," ");y=c[2]}; print $1,$2,$3,x,y}' OFS="\t" file
1 Like
Thank you :), works great and thank you for introducing me to split :).
RudiC
July 8, 2015, 2:47pm
6
Try also
awk '
{match ($0, /gene_name [^ ]*/)
T1=substr ($0, RSTART+11, RLENGTH-13)
match ($0, /exon_number [^ ]*/)
T2=substr ($0, RSTART+11, RLENGTH-12)
print $1, $2, $3, T1, T2
}
' FS="\t" OFS="\t" file
chr1 11868 12227 DDX11L1 1
chr1 12009 12057 DDX11L1 1
chr1 12178 12227 DDX11L1 2
1 Like
cmccabe
September 17, 2015, 12:53pm
8
That code works great.
If I was trying to get the output to look like:
chr1 11868 12227 DDX11L1:1
chr1 12009 12057 DDX11L1:1
chr1 12178 12227 DDX11L1:2
Basically, just gene:exon in field 4
cmccabe@DTV-A5211QLM:~/Desktop/NGS$ awk '
> {match ($0, /gene_name [^ ]*/)
> T1=substr ($0, RSTART+11, RLENGTH-13)
> match ($0, /exon_number [^ ]*/)
> T2=substr ($0, RSTART+11, RLENGTH-12)
> print $1, $2, $3, T1:T2
> }
> ' FS="\t" OFS="\t" /home/cmccabe/Desktop/NGS/bed/gencode.exons.bed > /home/cmccabe/Desktop/NGS/bed/parse2_gencode.bed
awk: line 6: syntax error at or near :
[/CODE] Thank you
T1:T2 is not valid awk syntax. Did you mean T1":"T2 ?
1 Like
cmccabe
September 17, 2015, 1:47pm
11
Thank you. I modified (very little) the awk
to the below:
cmccabe@DTV-A5211QLM:~/Desktop/bed$ awk '
> {match ($0, /gene_name [^ ]*/)
> T1=substr ($0, RSTART+11, RLENGTH-13)
> match ($0, /exon_number [^ ]*/)
> T2=substr ($0, RSTART+11, RLENGTH-12)
> print $1, $2, $3, T1":""exon"T2
> }
> ' FS="\t" OFS="\t" /home/cmccabe/Desktop/NGS/bed/gencode.exons.bed > /home/cmccabe/Desktop/NGS/bed/parse2_gencode.bed
There appears to be a space between exon and 1 that may cause an issue later on. I'm not sure why the space is there or how to remove it? Thank you.
parse2.txt
chr1 11868 12227 DDX11L1:exon 1
chr1 11871 12227 DDX11L1:exon 1
chr1 11873 12227 DDX11L1:exon 1
I suspect the space is actually part of T2.
This regex /exon_number [^ ]*/
I think should be /exon_number *[^ ]*/
in case there's multi spaces
RudiC
September 17, 2015, 4:48pm
13
You'd need to count the spaces, then, as they would increase the RSTART+X
value.
cmccabe
September 18, 2015, 1:34pm
14
The RSTART+X
removed the space but now the output looks like:
chr1 11868 12227 DDX11L1:exonex
chr1 11871 12227 DDX11L1:exonex
chr1 11873 12227 DDX11L1:exonex
chould be:
chr1 11868 12227 DDX11L1:exon1
chr1 11871 12227 DDX11L1:exon1
chr1 11873 12227 DDX11L1:exon1
Thank you :).
RudiC
September 18, 2015, 1:52pm
15
In this case, X
is not a variable with a contents (which is zero as it's undefined), but a placeholder for the unknown number of spaces matched by the *
regex.
Try
T2=substr ($0, RSTART+12, RLENGTH-13)
cmccabe
September 18, 2015, 2:08pm
16
That worked. Thank you :).
This is the code:
awk '
{match ($0, /gene_name [^ ]*/)
T1=substr ($0, RSTART+11, RLENGTH-13)
match ($0, /exon_number [^ ]*/)
T2=substr ($0, RSTART+12, RLENGTH-13)
print $1, $2, $3, T1":""exon""."T2
}
' FS="\t" OFS="\t" /home/cmccabe/Desktop/NGS/bed/gencode.exons.bed > /home/cmccabe/Desktop/NGS/bed/parse_gencode.bed
I thought I understood but I think I am a bit off in my thinking, do you mind briefly explaining. Thank you very much.
RudiC
September 18, 2015, 2:33pm
17
match
finds the pattern consisting of a constant and a regex. substr
extracts the substring starting from RSTART plus the length of the constant in the pattern, of length RLENGTH plus the constant plus eventual additional closing characters ( ";
or ;
). Then fields 1 to 3 are printed, followed by the two substrings obtained as above.
---------- Post updated at 20:33 ---------- Previous update was at 20:31 ----------
Use ":exon."
instead of ":""exon""."
.
1 Like
cmccabe
September 18, 2015, 3:51pm
18
Thank you very much for your help and explanation, I appreciate it :).