sed replace file contents by reading from another file

jacobs.smith · November 15, 2016, 2:23pm

Hello,

My input file1 is like this by tab-delimited

chr1	mm10_knownGene	stop_codon	3216022	3216024	0.000000	-	.	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1	mm10_knownGene	CDS	3216025	3216968	0.000000	-	2	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1	mm10_knownGene	exon	3214482	3216968	0.000000	-	.	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1	mm10_knownGene	CDS	3421702	3421901	0.000000	-	1	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1	mm10_knownGene	exon	3421702	3421901	0.000000	-	.	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1	mm10_knownGene	CDS	3670552	3671348	0.000000	-	0	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1	mm10_knownGene	start_codon	3671346	3671348	0.000000	-	.	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1	mm10_knownGene	exon	3670552	3671498	0.000000	-	.	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
chr1	mm10_knownGene	start_codon	4857914	4857916	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	CDS	4857914	4857976	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	exon	4857694	4857976	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	CDS	4867470	4867532	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	exon	4867470	4867532	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	CDS	4878027	4878132	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	exon	4878027	4878132	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	CDS	4886744	4886831	0.000000	+	2	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	exon	4886744	4886831	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	CDS	4889460	4889602	0.000000	+	1	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	exon	4889460	4889602	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	CDS	4890740	4890796	0.000000	+	2	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	exon	4890740	4890796	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	CDS	4891915	4892069	0.000000	+	2	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	exon	4891915	4892069	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	CDS	4893417	4893563	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	exon	4893417	4893563	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	CDS	4894934	4895005	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	exon	4894934	4895005	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	CDS	4896356	4896361	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	stop_codon	4896362	4896364	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; 
chr1	mm10_knownGene	exon	4896356	4897909	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1";

My input file2 is like this

uc007aeu.1	Xkr4
uc011wht.1	Tcea1

Now I want to replace the contents of inputfile1 (after gene_id and transcript_id) with the second column value in inputfile2. I did try by separating out the columns and joining based on the columns but since join needs to sort and I DO NOT want this order of input file to be sorted, it is becoming hard for me to get output. Any ideas are highly appreciated.

Please note that the input file row order should not be changed.

Thanks

Scrutinizer · November 15, 2016, 2:52pm

Not sure what you mean exactly. Perhaps something like this?

awk 'NR==FNR{A[$1]=$2; next} $2 in A{$2=$4=A[$2]}1' FS='\t' file2 FS=\" OFS=\" file1

Output:

chr1	mm10_knownGene	stop_codon	3216022	3216024	0.000000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; 
chr1	mm10_knownGene	CDS	3216025	3216968	0.000000	-	2	gene_id "Xkr4"; transcript_id "Xkr4"; 
chr1	mm10_knownGene	exon	3214482	3216968	0.000000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; 
chr1	mm10_knownGene	CDS	3421702	3421901	0.000000	-	1	gene_id "Xkr4"; transcript_id "Xkr4"; 
chr1	mm10_knownGene	exon	3421702	3421901	0.000000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; 
chr1	mm10_knownGene	CDS	3670552	3671348	0.000000	-	0	gene_id "Xkr4"; transcript_id "Xkr4"; 
chr1	mm10_knownGene	start_codon	3671346	3671348	0.000000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; 
chr1	mm10_knownGene	exon	3670552	3671498	0.000000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; 
chr1	mm10_knownGene	start_codon	4857914	4857916	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	CDS	4857914	4857976	0.000000	+	0	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	exon	4857694	4857976	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	CDS	4867470	4867532	0.000000	+	0	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	exon	4867470	4867532	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	CDS	4878027	4878132	0.000000	+	0	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	exon	4878027	4878132	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	CDS	4886744	4886831	0.000000	+	2	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	exon	4886744	4886831	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	CDS	4889460	4889602	0.000000	+	1	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	exon	4889460	4889602	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	CDS	4890740	4890796	0.000000	+	2	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	exon	4890740	4890796	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	CDS	4891915	4892069	0.000000	+	2	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	exon	4891915	4892069	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	CDS	4893417	4893563	0.000000	+	0	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	exon	4893417	4893563	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	CDS	4894934	4895005	0.000000	+	0	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	exon	4894934	4895005	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	CDS	4896356	4896361	0.000000	+	0	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	stop_codon	4896362	4896364	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1"; 
chr1	mm10_knownGene	exon	4896356	4897909	0.000000	+	.	gene_id "Tcea1"; transcript_id "Tcea1";

--
Or did you mean:

awk 'NR==FNR{A[$1]=$2; next} $2 in A{$0=$0 A[$2]}1' FS='\t' file2 FS=\" file1

chr1	mm10_knownGene	stop_codon	3216022	3216024	0.000000	-	.	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1	mm10_knownGene	CDS	3216025	3216968	0.000000	-	2	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1	mm10_knownGene	exon	3214482	3216968	0.000000	-	.	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1	mm10_knownGene	CDS	3421702	3421901	0.000000	-	1	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1	mm10_knownGene	exon	3421702	3421901	0.000000	-	.	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1	mm10_knownGene	CDS	3670552	3671348	0.000000	-	0	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1	mm10_knownGene	start_codon	3671346	3671348	0.000000	-	.	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1	mm10_knownGene	exon	3670552	3671498	0.000000	-	.	gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; Xkr4
chr1	mm10_knownGene	start_codon	4857914	4857916	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	CDS	4857914	4857976	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	exon	4857694	4857976	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	CDS	4867470	4867532	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	exon	4867470	4867532	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	CDS	4878027	4878132	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	exon	4878027	4878132	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	CDS	4886744	4886831	0.000000	+	2	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	exon	4886744	4886831	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	CDS	4889460	4889602	0.000000	+	1	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	exon	4889460	4889602	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	CDS	4890740	4890796	0.000000	+	2	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	exon	4890740	4890796	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	CDS	4891915	4892069	0.000000	+	2	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	exon	4891915	4892069	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	CDS	4893417	4893563	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	exon	4893417	4893563	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	CDS	4894934	4895005	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	exon	4894934	4895005	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	CDS	4896356	4896361	0.000000	+	0	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	stop_codon	4896362	4896364	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1"; Tcea1
chr1	mm10_knownGene	exon	4896356	4897909	0.000000	+	.	gene_id "uc011wht.1"; transcript_id "uc011wht.1";Tcea1

RudiC · November 15, 2016, 2:58pm

That specification is not too clear. If I interpreted it correctly, try

awk 'NR == FNR {T["\"" $1 "\";"] = $2; next} $12 in T {sub ($12 ".$", "& " T[$12])} 1' file2 file1

jacobs.smith · November 15, 2016, 3:03pm

Exactly what I was looking for. Thank you @Scrutinizer