Search strings and highlight them using Perl or bash/awk/sed

Hi,

I have two files: a.doc and b.txt

I wish to search the strings from file b.txt in a.doc and want to highlight them in a.doc with different colours using Perl or bash./awk/sed?

Please guide me. :slight_smile:
Thanks!!!!!

Try:

perl -lpe 'BEGIN{open b, "b.txt";chomp(@b=<b>)}{for $i (@b) {s/$i/\033[31m$i\033[0m/g}}' a.doc
1 Like

Thanks :slight_smile:
I would highly appreciate if you can explain this script.

I am getting following error while running the above script (1.pl):

I need some help here to help out :slight_smile:
Whit this script I get only one single line, correct is 5, why?

awk 'NR==FNR {a[$0]=$0;next} {for (i in a) {if ($0~a) print}}' b.txt a.txt
Seq3 TTAAACTTTTTTCAACCCTAATG-----CGGTTTGAACCATTAACC-----------TAAC 48

correct answer is

Seq1 -------------------------------TTAAAAAGTTTGAGTTCTAAA---------------- 21
Seq2 -----CTTGGCTCTTTCGTAAGTTTTTCATTAAGGAACTTGAATACACGGTTT----AC- 50
Seq3 TTAAACTTTTTTCAACCCTAATG-----CGGTTTGAACCATTAACC-----------TAAC 48
Seq4 --------GAAAGGAGCGGAGTG-GTCACGTGACAAGTTCTCAGACGCACGTGC--TTGT 49
Seq4 --------GAAAGGAGCGGAGTG-GTCACGTGACAAGTTCTCAGACGCACGTGC--TTGT 49

Running a test like this show correctly all possibility

awk 'NR==FNR {a[$0]=$0;next} {for (i in a) {print $0,a}}' b.txt a.txt
Seq1 -------------------------------TTAAAAAGTTTGAGTTCTAAA---------------- 21 ACG
Seq1 -------------------------------TTAAAAAGTTTGAGTTCTAAA---------------- 21 TAATG
Seq1 -------------------------------TTAAAAAGTTTGAGTTCTAAA---------------- 21 AAAAAG
Seq1 -------------------------------TTAAAAAGTTTGAGTTCTAAA---------------- 21 GACAAGT
Seq1 -------------------------------TTAAAAAGTTTGAGTTCTAAA---------------- 21 CAAGC
Seq1 -------------------------------TTAAAAAGTTTGAGTTCTAAA---------------- 21 GCTTG
Seq2 -----CTTGGCTCTTTCGTAAGTTTTTCATTAAGGAACTTGAATACACGGTTT----AC- 50 ACG
Seq2 -----CTTGGCTCTTTCGTAAGTTTTTCATTAAGGAACTTGAATACACGGTTT----AC- 50 TAATG
Seq2 -----CTTGGCTCTTTCGTAAGTTTTTCATTAAGGAACTTGAATACACGGTTT----AC- 50 AAAAAG
Seq2 -----CTTGGCTCTTTCGTAAGTTTTTCATTAAGGAACTTGAATACACGGTTT----AC- 50 GACAAGT
Seq2 -----CTTGGCTCTTTCGTAAGTTTTTCATTAAGGAACTTGAATACACGGTTT----AC- 50 CAAGC
Seq2 -----CTTGGCTCTTTCGTAAGTTTTTCATTAAGGAACTTGAATACACGGTTT----AC- 50 GCTTG
Seq3 TTAAACTTTTTTCAACCCTAATG-----CGGTTTGAACCATTAACC-----------TAAC 48 ACG
Seq3 TTAAACTTTTTTCAACCCTAATG-----CGGTTTGAACCATTAACC-----------TAAC 48 TAATG
Seq3 TTAAACTTTTTTCAACCCTAATG-----CGGTTTGAACCATTAACC-----------TAAC 48 AAAAAG
Seq3 TTAAACTTTTTTCAACCCTAATG-----CGGTTTGAACCATTAACC-----------TAAC 48 GACAAGT
Seq3 TTAAACTTTTTTCAACCCTAATG-----CGGTTTGAACCATTAACC-----------TAAC 48 CAAGC
Seq3 TTAAACTTTTTTCAACCCTAATG-----CGGTTTGAACCATTAACC-----------TAAC 48 GCTTG
Seq4 --------GAAAGGAGCGGAGTG-GTCACGTGACAAGTTCTCAGACGCACGTGC--TTGT 49 ACG
Seq4 --------GAAAGGAGCGGAGTG-GTCACGTGACAAGTTCTCAGACGCACGTGC--TTGT 49 TAATG
Seq4 --------GAAAGGAGCGGAGTG-GTCACGTGACAAGTTCTCAGACGCACGTGC--TTGT 49 AAAAAG
Seq4 --------GAAAGGAGCGGAGTG-GTCACGTGACAAGTTCTCAGACGCACGTGC--TTGT 49 GACAAGT
Seq4 --------GAAAGGAGCGGAGTG-GTCACGTGACAAGTTCTCAGACGCACGTGC--TTGT 49 CAAGC
Seq4 --------GAAAGGAGCGGAGTG-GTCACGTGACAAGTTCTCAGACGCACGTGC--TTGT 49 GCTTG

bioinfo, don't put my code in a file. Simply run it in a terminal as I posted it, replacing a.doc and b.txt for the filenames you have (in case they differ from those two).

1 Like

@bartus11
Here is what I got when run your code

perl -lpe 'BEGIN{open b, "b.txt";chomp(@b=<b>)}{for $i (@b) {s/$i/\033[31m$i\033[0m/g}}' a.txt
Seq1 -------------------------------TTAAAAAGTTTGAGTTCTAAA---------------- 21
Seq2 -----CTTGGCTCTTTCGTAAGTTTTTCATTAAGGAACTTGAATACACGGTTT----AC- 50
Seq3 TTAAACTTTTTTCAACCCTAATG-----CGGTTTGAACCATTAACC-----------TAAC 48
Seq4 --------GAAAGGAGCGGAGTG-GTCACGTGACAAGTTCTCAGACGCACGTGC--TTGT 49

As you see, it show the line with hits, but only highlight one hit, look at my post #4.
OP request is color on all data in bold.

For some strange reason this has same problem, it only highlight one hits.

awk 'NR==FNR {a[$0]=$0;next} {for (i in a) {gsub(a,"\033[1;31m&\033[0m",$0)}}1' b.txt a.txt
Seq1 -------------------------------TTAAAAAGTTTGAGTTCTAAA---------------- 21
Seq2 -----CTTGGCTCTTTCGTAAGTTTTTCATTAAGGAACTTGAATACACGGTTT----AC- 50
Seq3 TTAAACTTTTTTCAACCCTAATG-----CGGTTTGAACCATTAACC-----------TAAC 48
Seq4 --------GAAAGGAGCGGAGTG-GTCACGTGACAAGTTCTCAGACGCACGTGC--TTGT 49
1 Like

Jotne, post output of:

cat -ev a.txt
cat -ev b.txt
2 Likes

s#%#%&E#&%!!!

There was space after code in b.txt
Since I just copied it from post #1, I did not check.

cat -ev b.txt
GACAAGT $
AAAAAG $
TAATG$
CAAGC$
ACG $
GCTTG$

Now both awk and perl works fine.
Thanks.

awk 'NR==FNR {a[$0]=$0;next} {for (i in a) {gsub(a,"\033[1;31m&\033[0m",$0)}}1' b.txt a.txt
Seq1 -------------------------------TTAAAAAGTTTGAGTTCTAAA---------------- 21
Seq2 -----CTTGGCTCTTTCGTAAGTTTTTCATTAAGGAACTTGAATACACGGTTT----AC- 50
Seq3 TTAAACTTTTTTCAACCCTAATG-----CGGTTTGAACCATTAACC-----------TAAC 48
Seq4 --------GAAAGGAGCGGAGTG-GTCACGTGACAAGTTCTCAGACGCACGTGC--TTGT 49
1 Like

Thanks bartus and Jotne. :slight_smile:
Can you please explain the code.

Thanks.

awk '	NR==FNR {a[$0]=$0;next}
 	{for (i in a) 
		{gsub(a,"\033[1;31m&\033[0m",$0)}
	}1
	' b.txt a.txt

NR==FNR {a[$0]=$0;next}
NR==FNR This is a technique used to do something on the first file, when more file are listed, in this case b.txt
a[$0]=$0 Store every record of b.txt in an array named a eks a[GACAAGT]=GACAAGT
for (i in a) for every element in array a (b.txt), test it against a.txt
gsub(a,"\033[1;31m&\033[0m",$0) Test every element in array a , against the line $0 from a.txt , if found replace the found text with itself & plus ansi color code.
Eks if found AGC , replace it with \033[1;31mAGC\033[0m = AGC
Then the 1 at the final will print all lines from a.txt with modified colors for every find.

Edit: the $0 at the final is not needed, since this is the default line to test
Also changing that array name to b , to reflect its store content of b.txt to make it more clear.

awk 'NR==FNR {b[$0]=$0;next} {for (i in b) {gsub(b,"\033[1;31m&\033[0m")}}1' b.txt a.txt

Edit2: no need to have b.txt stored as both value and index of array b

awk 'NR==FNR {b[$0]++;next} {for (i in b) {gsub(i,"\033[1;31m&\033[0m")}}1' b.txt a.txt
1 Like

Thanks a lot. :slight_smile:
I will try them.