Parse file for fields and specific text

cmccabe · July 7, 2015, 4:24pm

I have a file of ~500,000 entries in the following:

file.txt

chr1	11868	12227	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	12009	12057	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	12178	12227	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";

I am using cygwin on a windows 7 OS trying to parse each line into 5 rows:

In the below
row 1 = $1 of the text
row 2 = $2 of the text
row 3 = $3 of the text
row 4 = gene_name= "..." - quotes removed
row 5 = exon_number "...." - quotes removed

Example of desired output:

chr1	11868	12227     DDX11L1     1 
chr1	12009	12057     DDX11L1     1  
chr1	12178	12227     DDX11L1     2

I was able to generate the file.txt but can not seem to parse it correctly.

Thank you :).

senhia83 · July 7, 2015, 4:46pm

After 2 years and 457 posts in this forum you must have tried something? As far as I know, the format of your file.txt is a pretty standard gtf/gff , so there is nothing to generate really. Please show what you tried.

cmccabe · July 7, 2015, 5:30pm

Yes the

gtf/gff

was created

wget ftp://ftp.sanger.ac.uk/pub/gencode/release_21/gencode.21.annotation.gtf.gz
gunzip --stdout gencode.v21.annotation.gtf.gz \
    | gtf2bed - \
    | grep "exon" \
    > gencode.exons.bed
bedmap --echo --echo-map Regions.bed gencode.exons.bed

produced output close, but not desired and I thought maybe if I parsed the input it may help. That is if I had a exon file with only 5 rows that may be better.

I'm not sure but maybe:

 awk -f FNR > 1{for(i=1;i<=NF;i++) {n=split($i,a, "[.:>_]") print a[1]+0,a[2]+0,a[3]+0,substr(a[gene_name],length(a[exon_number])), a[n]} } OFS='\t' gencode.exons.txt > parse.txt

Thank you :).

senhia83 · July 7, 2015, 7:14pm

one way below using split, other methods include parsing by regex..

try

awk -F"\t"  '{split($10,a,";"); for (i=1;i<=length(a);i++) if (a~/gene_name/) { split(a,b,"\"");x=b[2] } else if (a~/exon_number/) {split(a,c," ");y=c[2]}; print $1,$2,$3,x,y}' OFS="\t" file

cmccabe · July 8, 2015, 10:50am

Thank you :), works great and thank you for introducing me to split :).

RudiC · July 8, 2015, 2:47pm

Try also

awk '
        {match ($0, /gene_name [^ ]*/)
         T1=substr ($0, RSTART+11, RLENGTH-13)
         match ($0, /exon_number [^ ]*/)
         T2=substr ($0, RSTART+11, RLENGTH-12)
         print $1, $2, $3, T1, T2
        }
' FS="\t" OFS="\t" file
chr1    11868    12227    DDX11L1     1
chr1    12009    12057    DDX11L1     1
chr1    12178    12227    DDX11L1     2

cmccabe · July 8, 2015, 4:26pm

Thank you :).

cmccabe · September 17, 2015, 12:53pm

That code works great.
If I was trying to get the output to look like:

 chr1    11868    12227    DDX11L1:1 
chr1    12009    12057    DDX11L1:1 
chr1    12178    12227    DDX11L1:2

Basically, just gene:exon in field 4

 
cmccabe@DTV-A5211QLM:~/Desktop/NGS$ awk '
>         {match ($0, /gene_name [^ ]*/)
>          T1=substr ($0, RSTART+11, RLENGTH-13)
>          match ($0, /exon_number [^ ]*/)
>          T2=substr ($0, RSTART+11, RLENGTH-12)
>          print $1, $2, $3, T1:T2
>         }
> ' FS="\t" OFS="\t" /home/cmccabe/Desktop/NGS/bed/gencode.exons.bed > /home/cmccabe/Desktop/NGS/bed/parse2_gencode.bed

awk: line 6: syntax error at or near :
[/CODE] Thank you

Corona688 · September 17, 2015, 1:23pm

T1:T2 is not valid awk syntax. Did you mean T1":"T2 ?

RudiC · September 17, 2015, 1:24pm

Use quotes: T1":"T2

cmccabe · September 17, 2015, 1:47pm

Thank you. I modified (very little) the awk to the below:

cmccabe@DTV-A5211QLM:~/Desktop/bed$ awk '
>         {match ($0, /gene_name [^ ]*/)
>          T1=substr ($0, RSTART+11, RLENGTH-13)
>          match ($0, /exon_number [^ ]*/)
>          T2=substr ($0, RSTART+11, RLENGTH-12)
>          print $1, $2, $3, T1":""exon"T2
>         }
> ' FS="\t" OFS="\t" /home/cmccabe/Desktop/NGS/bed/gencode.exons.bed > /home/cmccabe/Desktop/NGS/bed/parse2_gencode.bed

There appears to be a space between exon and 1 that may cause an issue later on. I'm not sure why the space is there or how to remove it? Thank you.

parse2.txt

chr1    11868    12227    DDX11L1:exon 1
chr1    11871    12227    DDX11L1:exon 1
chr1    11873    12227    DDX11L1:exon 1

Corona688 · September 17, 2015, 2:42pm

I suspect the space is actually part of T2.

This regex /exon_number [^ ]*/ I think should be /exon_number *[^ ]*/ in case there's multi spaces

RudiC · September 17, 2015, 4:48pm

You'd need to count the spaces, then, as they would increase the RSTART+X value.

cmccabe · September 18, 2015, 1:34pm

The RSTART+X removed the space but now the output looks like:

chr1    11868    12227    DDX11L1:exonex
chr1    11871    12227    DDX11L1:exonex
chr1    11873    12227    DDX11L1:exonex

chould be:

chr1    11868    12227    DDX11L1:exon1
chr1    11871    12227    DDX11L1:exon1
chr1    11873    12227    DDX11L1:exon1

Thank you :).

RudiC · September 18, 2015, 1:52pm

In this case, X is not a variable with a contents (which is zero as it's undefined), but a placeholder for the unknown number of spaces matched by the * regex.
Try

T2=substr ($0, RSTART+12, RLENGTH-13)

cmccabe · September 18, 2015, 2:08pm

That worked. Thank you :).

This is the code:

awk '
        {match ($0, /gene_name [^ ]*/)
         T1=substr ($0, RSTART+11, RLENGTH-13)
         match ($0, /exon_number [^ ]*/)
         T2=substr ($0, RSTART+12, RLENGTH-13)
         print $1, $2, $3, T1":""exon""."T2
        }
' FS="\t" OFS="\t" /home/cmccabe/Desktop/NGS/bed/gencode.exons.bed > /home/cmccabe/Desktop/NGS/bed/parse_gencode.bed

I thought I understood but I think I am a bit off in my thinking, do you mind briefly explaining. Thank you very much.

RudiC · September 18, 2015, 2:33pm

match finds the pattern consisting of a constant and a regex. substr extracts the substring starting from RSTART plus the length of the constant in the pattern, of length RLENGTH plus the constant plus eventual additional closing characters ( "; or ; ). Then fields 1 to 3 are printed, followed by the two substrings obtained as above.

---------- Post updated at 20:33 ---------- Previous update was at 20:31 ----------

Use ":exon." instead of ":""exon""." .

cmccabe · September 18, 2015, 3:51pm

Thank you very much for your help and explanation, I appreciate it :).