Remove or rename based on contents of file

cmccabe · May 9, 2015, 2:11pm

I am trying to use the two files shown below to either remove or rename contents in one of those files. If in file1.txt $5 matches $5 of file2.txt and the value in $1 of file1.txt is not "No Match" then that value is substituted for all values in $5 and $1 of file2.txt. If however in $1 of file1.txt the value is "No Match", then the row in file2.txt with that in it and the one below it are removed. Thank you :).

Contents of file1.txt

file1.txt
No Match	chr1	35696	36106	DTE3504500000004
PXL-A0000005	chr1	69066	69311	DTE3504500000005

Contents of file2.txt

RefPrimer	ref	antiref	omosome	PrimerSet	SeqRxn
AntirefPrimer	antiref	ref	omosome		
DTE3504500000001ref	34529	35031	1	DTE3504500000001	SeqRxn4
DTE3504500000001antiref	35031	34529	1		
DTE3504500000002ref	35032	35283	1	DTE3504500000002	SeqRxn4
DTE3504500000002antiref	35283	35032	1		
DTE3504500000003ref	35284	35506	1	DTE3504500000003	SeqRxn4
DTE3504500000003antiref	35506	35284	1		
DTE3504500000004ref	35696	36106	1	DTE3504500000004	SeqRxn4
DTE3504500000004antiref	36106	35696	1		
DTE3504500000004ref	69066	69311	1	DTE3504500000004	SeqRxn4
DTE3504500000004antiref	69311	69066	1

For example,
"DTE3504500000004" is the value of $5 in file1.txt and that matches row 3 of file2.txt $5 , since the value in $1 of file1.txt is "No Match", rows 3 and 4 are removed from file2.txt.

"DTE3504500000005" is the value of $5 in file1.txt and that matches row 9 of file2.txt $5 , since the value in $1 of file1.txt is not "No Match", rather "PXL-A0000005" that new value is used to replace all occurrences of the old value.

Desired output:

RefPrimer	ref	antiref	omosome	PrimerSet	SeqRxn
AntirefPrimer	antiref	ref	omosome		
	(rows 3 and 4 removed)
DTE3504500000002ref	35032	35283	1	DTE3504500000002	SeqRxn4
DTE3504500000002antiref	35283	35032	1		
DTE3504500000003ref	35284	35506	1	DTE3504500000003	SeqRxn4
DTE3504500000003antiref	35506	35284	1		
DTE3504500000004ref	35696	36106	1	DTE3504500000004	SeqRxn4
PXL-A0000005ref	69066	69311	1	PXL-A0000005	SeqRxn4
PXL-A0000005antiref	69311	69066	1

Don_Cragun · May 10, 2015, 9:23pm

I'm lost.

The first line in file1.txt has DTE3504500000004 in field 5. From your description (with the 1st field on that line being No Match ), the last four lines of file2.txt should have been removed; not the 3rd and 4th lines.

The second line in file1.txt has DTE3504500000005 in field 5. Since that string does not appear in file2.txt , why should anything in file2.txt be changed because of that line?

cmccabe · May 11, 2015, 1:20pm

I hope this is more clear:
I am trying to use the two files shown below to either remove or rename contents in one of those files. If in combine.txt $5 matches $5 of output.txt and the value in $1 of combine.txt is not "No Match" then that value is substituted for all values in $5 and $1 of output.txt. If however in $1 of combine.txt the value is "No Match", then the row in output.txt with that $5 value in it and the one below it are removed. Thank you :).

For example,
"DTE3504500000004" is the value of $5 in combine.txt and that matches row 3 of output.txt $5 , since the value in $1 of combine.txt is "No Match", rows 9 and 10 are removed from output.txt.

"DTE3504500000005" is the value of $5 in combine.txt and that matches row 11 of output.txt $5 , since the value in $1 of combine.txt is not "No Match", rather "PXL-A0000005" that new value is used to replace all occurrences of the old value in output.txt.

file1.txt
No Match    chr1    35696    36106    DTE3504500000004
PXL-A0000005    chr1    69066    69311    DTE3504500000005

Initial output.txt:
RefPrimer    ref    antiref    omosome    PrimerSet    SeqRxn
AntirefPrimer    antiref    ref    omosome        
DTE3504500000001ref    34529    35031    1    DTE3504500000001    SeqRxn4
DTE3504500000001antiref    35031    34529    1        
DTE3504500000002ref    35032    35283    1    DTE3504500000002    SeqRxn4
DTE3504500000002antiref    35283    35032    1        
DTE3504500000003ref    35284    35506    1    DTE3504500000003    SeqRxn4
DTE3504500000003antiref    35506    35284    1        
DTE3504500000004ref    35696    36106    1    DTE3504500000004    SeqRxn4
DTE3504500000004antiref    36106    35696    1        
DTE3504500000005ref    69066    69311    1    DTE3504500000005    SeqRxn4
DTE3504500000005antiref    69311    69066    1

Desired output.txt:
RefPrimer    ref    antiref    omosome    PrimerSet    SeqRxn
AntirefPrimer    antiref    ref    omosome        
DTE3504500000001ref    34529    35031    1    DTE3504500000001    SeqRxn4
DTE3504500000001antiref    35031    34529    1        
DTE3504500000002ref    35032    35283    1    DTE3504500000002    SeqRxn4
DTE3504500000002antiref    35283    35032    1        
DTE3504500000003ref    35284    35506    1    DTE3504500000003    SeqRxn4
DTE3504500000003antiref    35506    35284    1        
PXL-A0000005ref    69066    69311    1    PXL-A0000005    SeqRxn4
PXL-A0000005antiref    69311    69066    1

Don_Cragun · May 11, 2015, 5:25pm

Assuming that when you said:

file1.txt
No Match    chr1    35696    36106    DTE3504500000004
PXL-A0000005    chr1    69066    69311    DTE3504500000005

is what is in combine.txt , you really meant that the file you referred to as combine.txt is really named file1.txt (rather than the first line of combine.txt containing the line file1.txt , then maybe something like:

awk -F'\t' '
NR == FNR {
	r[$5] = $1
	next
}
FNR > 2 && m <= 0 && $5 in r {
	p = $5
	m = 2
}
m-- > 0 {
	if(r[p] == "No Match") 
		next
	gsub(p, r[p])
}
1' file1.txt output.txt > output.$$ && cp output.$$ output.txt && rm -f output.$$

will do what you want.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk .

cmccabe · May 12, 2015, 12:39pm

Works great.... thank you :).

ampsys · May 13, 2015, 1:33am

don cragun:

Assuming that when you said:
file1.txt
No Match    chr1    35696    36106    DTE3504500000004
PXL-A0000005    chr1    69066    69311    DTE3504500000005
is what is in combine.txt , you really meant that the file you referred to as combine.txt is really named file1.txt (rather than the first line of combine.txt containing the line file1.txt , then maybe something like:
awk -F'\t' '
NR == FNR {
   r[$5] = $1
   next
}
FNR > 2 && m <= 0 && $5 in r {
   p = $5
   m = 2
}
m-- > 0 {
   if(r[p] == "No Match") 
   next
   gsub(p, r[p])
}
1' file1.txt output.txt > output.$$ && cp output.$$ output.txt && rm -f output.$$
will do what you want.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk .

Holy shit.