Hello,
My input file1 is like this by tab-delimited
chr1 mm10_knownGene stop_codon 3216022 3216024 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1 mm10_knownGene CDS 3216025 3216968 0.000000 - 2 gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1 mm10_knownGene exon 3214482 3216968 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1 mm10_knownGene CDS 3421702 3421901 0.000000 - 1 gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1 mm10_knownGene exon 3421702 3421901 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1 mm10_knownGene CDS 3670552 3671348 0.000000 - 0 gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1 mm10_knownGene start_codon 3671346 3671348 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1 mm10_knownGene exon 3670552 3671498 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";
chr1 mm10_knownGene start_codon 4857914 4857916 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene CDS 4857914 4857976 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene exon 4857694 4857976 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene CDS 4867470 4867532 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene exon 4867470 4867532 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene CDS 4878027 4878132 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene exon 4878027 4878132 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene CDS 4886744 4886831 0.000000 + 2 gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene exon 4886744 4886831 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene CDS 4889460 4889602 0.000000 + 1 gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene exon 4889460 4889602 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene CDS 4890740 4890796 0.000000 + 2 gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene exon 4890740 4890796 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene CDS 4891915 4892069 0.000000 + 2 gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene exon 4891915 4892069 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene CDS 4893417 4893563 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene exon 4893417 4893563 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene CDS 4894934 4895005 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene exon 4894934 4895005 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene CDS 4896356 4896361 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene stop_codon 4896362 4896364 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
chr1 mm10_knownGene exon 4896356 4897909 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";
My input file2 is like this
uc007aeu.1 Xkr4
uc011wht.1 Tcea1
Now I want to replace the contents of inputfile1 (after gene_id and transcript_id) with the second column value in inputfile2. I did try by separating out the columns and joining based on the columns but since join needs to sort and I DO NOT want this order of input file to be sorted, it is becoming hard for me to get output. Any ideas are highly appreciated.
Please note that the input file row order should not be changed.
Thanks
Not sure what you mean exactly. Perhaps something like this?
awk 'NR==FNR{A[$1]=$2; next} $2 in A{$2=$4=A[$2]}1' FS='\t' file2 FS=\" OFS=\" file1
Output:
chr1 mm10_knownGene stop_codon 3216022 3216024 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4";
chr1 mm10_knownGene CDS 3216025 3216968 0.000000 - 2 gene_id "Xkr4"; transcript_id "Xkr4";
chr1 mm10_knownGene exon 3214482 3216968 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4";
chr1 mm10_knownGene CDS 3421702 3421901 0.000000 - 1 gene_id "Xkr4"; transcript_id "Xkr4";
chr1 mm10_knownGene exon 3421702 3421901 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4";
chr1 mm10_knownGene CDS 3670552 3671348 0.000000 - 0 gene_id "Xkr4"; transcript_id "Xkr4";
chr1 mm10_knownGene start_codon 3671346 3671348 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4";
chr1 mm10_knownGene exon 3670552 3671498 0.000000 - . gene_id "Xkr4"; transcript_id "Xkr4";
chr1 mm10_knownGene start_codon 4857914 4857916 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene CDS 4857914 4857976 0.000000 + 0 gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene exon 4857694 4857976 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene CDS 4867470 4867532 0.000000 + 0 gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene exon 4867470 4867532 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene CDS 4878027 4878132 0.000000 + 0 gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene exon 4878027 4878132 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene CDS 4886744 4886831 0.000000 + 2 gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene exon 4886744 4886831 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene CDS 4889460 4889602 0.000000 + 1 gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene exon 4889460 4889602 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene CDS 4890740 4890796 0.000000 + 2 gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene exon 4890740 4890796 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene CDS 4891915 4892069 0.000000 + 2 gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene exon 4891915 4892069 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene CDS 4893417 4893563 0.000000 + 0 gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene exon 4893417 4893563 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene CDS 4894934 4895005 0.000000 + 0 gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene exon 4894934 4895005 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene CDS 4896356 4896361 0.000000 + 0 gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene stop_codon 4896362 4896364 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
chr1 mm10_knownGene exon 4896356 4897909 0.000000 + . gene_id "Tcea1"; transcript_id "Tcea1";
--
Or did you mean:
awk 'NR==FNR{A[$1]=$2; next} $2 in A{$0=$0 A[$2]}1' FS='\t' file2 FS=\" file1
chr1 mm10_knownGene stop_codon 3216022 3216024 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1 mm10_knownGene CDS 3216025 3216968 0.000000 - 2 gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1 mm10_knownGene exon 3214482 3216968 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1 mm10_knownGene CDS 3421702 3421901 0.000000 - 1 gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1 mm10_knownGene exon 3421702 3421901 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1 mm10_knownGene CDS 3670552 3671348 0.000000 - 0 gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1 mm10_knownGene start_codon 3671346 3671348 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1 mm10_knownGene exon 3670552 3671498 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1 mm10_knownGene start_codon 4857914 4857916 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene CDS 4857914 4857976 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene exon 4857694 4857976 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene CDS 4867470 4867532 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene exon 4867470 4867532 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene CDS 4878027 4878132 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene exon 4878027 4878132 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene CDS 4886744 4886831 0.000000 + 2 gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene exon 4886744 4886831 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene CDS 4889460 4889602 0.000000 + 1 gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene exon 4889460 4889602 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene CDS 4890740 4890796 0.000000 + 2 gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene exon 4890740 4890796 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene CDS 4891915 4892069 0.000000 + 2 gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene exon 4891915 4892069 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene CDS 4893417 4893563 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene exon 4893417 4893563 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene CDS 4894934 4895005 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene exon 4894934 4895005 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene CDS 4896356 4896361 0.000000 + 0 gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene stop_codon 4896362 4896364 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1 mm10_knownGene exon 4896356 4897909 0.000000 + . gene_id "uc011wht.1"; transcript_id "uc011wht.1";Tcea1
1 Like
RudiC
November 15, 2016, 2:58pm
3
That specification is not too clear. If I interpreted it correctly, try
awk 'NR == FNR {T["\"" $1 "\";"] = $2; next} $12 in T {sub ($12 ".$", "& " T[$12])} 1' file2 file1
1 Like
Exactly what I was looking for. Thank you @Scrutinizer