Hi,
I need help to match patterns from between two different files and extract region of strings.
inputfile1.fa
>l-WR24-1:1
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGG
GGCGGAGGGCGACGGCGGGTGGTGAGCGGCCCGGGAGGGGCCGGGCGGTGGGGTCACGTG
CGGCGGGCGGGGCGCGGGCTGACCCAGCTGTGCCCGCAGGCGCCCGGGCGGGCCGGGGGC
CGTGGCGGAGGAGGAGCGCTGCACGGTGGAGCGTCGGGCCGACCTCACCTACGCGGAGTT
CTACCACAAAGTGGACTTGCCCTTCCAGGAGTATGTGGAGCAGCTGCTGCACCCCCAGGA
CCCCACCTCCCTGGGCAATGACACCCTGTACTTCTTCGGGGACAACAACTTCACCGAGTG
GGCCTCTCTCTTTCGGCACTACTCCCCACCCCCATTTGGCCTGCTGGGAACCGCTCCAG
>l-ZF385A-2:1
CAGAATGTGGGTGAGGGTGGCGCCTATGAGGCTGAGCTTCGGGTCACCGCCCCTCCAGAG
GCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGACTCTCGAGTGGGATTT
GGGAGGATACCCCTCTAGAGGGGACACCAAAACCTGACCAGTGCCCACCCCATCTCCAGA
ACTTCTCCAGCCTGAGCTGTGACTACTTTGCCGTGAACCAGAGCCGCCTGCTGGTGTGTG
ACCTGGGCAACCCCATGAAGGCAGGAGCCAGTCTGTGGGGTGGCCTTCGGTTTACAGTCC
CTCATCTCCGGGACACTAAGAAAACCATCCAGTTTGACTTCCAGATCCTCAGGTAGGGAG
TGAGTGTGTCTAGGCTGGGGCTGAGCTGGGGACGGAAGGGAGGGCTGGGCGCCATTCTCA
CTGGCTGCACTCCAGCACCTCAGTCTTGCCTCCATCCCACAGCAAGAATCTCAACAACTC
GCAAAGCGACGTGGTTTCCTTTCGGCTCTCCGTGGAGGCTCAGGCCCAGGTCACCCTGAA
CGGGTCAGTGCCAGGCAAAATGGGGTCT
>l-YJC-1:1
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACG
CGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGT
CAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCACAGCATCCCCACGGG
CCTCCACGCCAACCTGTCCGAGGGCCGCCCCGTGGGTCCGGCCCGCCGTGGCGCCTCATC
GCTGCTCGGCCCGG
inputfile2.txt
l-WR24-1:1 1 71
l-WR73-7:4 28 506
l-WR86-1:1 140 138
l-YJC-1:1 1 161
l-YJC-1:1 1 165
l-ZFP57-11:1 1 991
l-ZF320-1:5 5 6031
l-ZMYND10-2:1 5 253
l-ZF329-4:1 151 5704
l-ZF708-1:1 195 3744
l-ZF843-3:1 14 1053
l-ZF385A-2:1 33 105
l-ZF843-3:2 4 235
The output file as below. It should contain the strings extracted from inputfile1.fa according to start and stop numbers (in blue) indicated in inputfile2.txt as shown above.
>l-WR24-1:1 1 71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGG
GGCGGAGGGCG
>l-ZF385A-2:1 33 105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGG
GGTGAGATGAGAC
>l-YJC-1:1 1 161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACG
CGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGT
CAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
>l-YJC-1:1 1 165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACG
CGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGT
CAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA
I have tried to match these two files using bioinformatics tools like seqtk and bedtools but they did not give me what I want and some of their data requirements just don't fit mine. I am still learning awk and I want to do this in awk. Below are one of the codes that I did to match the two files but failed and I do not know how to extract certain regions of characters.
awk 'FNR == NR{A[$1]; next}/^\>/{f = (substr($1,2) in A) ? 1 : 0}f' inputfile2.txt inputfile1.fa
Would appreciate your kind help. thanks
a bit verbose, but a possible starter.
awk -f bunny.awk inputfile2.txt inputfile1.fa
where bunny.awk
is:
function printRec() {
#print a[f], s[f], e[f]
for ( i in s) {
split(i,t, OFS)
if (f == ">" t[1])
print ">" i ORS substr(a[f],s,e-s+1)
}
f=""
split("",a)
}
FNR==NR {
idx=$1 OFS $2 OFS $3
s[idx]=$2
e[idx]=$3
next
}
/>/ && f {
printRec()
}
f { a[f]=(f in a)?a[f] $1:$1 }
/^>/ { f=$1 }
END { printRec() }
results in:
>l-WR24-1:1 1 71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1 33 105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1 1 161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
>l-YJC-1:1 1 165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA
1 Like
RudiC
December 9, 2019, 1:59pm
3
Try also
awk '
NR==FNR {PAT[$1,$2,$3]
next
}
{IX = $1
L1 = length ($1) + 1
$1 = $1 "|"
$0 = $0
for (p in PAT) {split (p, T)
if (IX == T[1]) print RS p ORS substr ($0, T[2]+L1, T[3]-T[2]+1)
}
}
' SUBSEP="\t" inputfile2.txt RS=">" OFS="" inputfile1.fa
>l-WR24-1:1 1 71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1 33 105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1 1 161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
>l-YJC-1:1 1 165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA
If you really really need the output lines 60 chars in length, use
if (IX == T[1]) {print RS p
TMP = substr ($0, T[2]+L1, T[3]-T[2]+1)
PTR = 1
while (PTR < length (TMP)) {print substr (TMP, PTR, 60)
PTR += 60
}
}
3 Likes
if you need to "fold" the lines at 60 char width, pipe the output to fold
:
awk .... | fold -w 60
2 Likes
vgersh99:
a bit verbose, but a possible starter.
awk -f bunny.awk inputfile2.txt inputfile1.fa
where bunny.awk
is:
function printRec() {
#print a[f], s[f], e[f]
for ( i in s) {
split(i,t, OFS)
if (f == ">" t[1])
print ">" i ORS substr(a[f],s,e-s+1)
}
f=""
split("",a)
}
FNR==NR {
idx=$1 OFS $2 OFS $3
s[idx]=$2
e[idx]=$3
next
}
/>/ && f {
printRec()
}
f { a[f]=(f in a)?a[f] $1:$1 }
/^>/ { f=$1 }
END { printRec() }
results in:
>l-WR24-1:1 1 71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1 33 105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1 1 161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
>l-YJC-1:1 1 165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA
Hi vgersh99,
Your codes work like charm. Thanks a million.
--- Post updated at 03:59 AM ---
rudic:
Try also
awk '
NR==FNR {PAT[$1,$2,$3]
next
}
{IX = $1
L1 = length ($1) + 1
$1 = $1 "|"
$0 = $0
for (p in PAT) {split (p, T)
if (IX == T[1]) print RS p ORS substr ($0, T[2]+L1, T[3]-T[2]+1)
}
}
' SUBSEP="\t" inputfile2.txt RS=">" OFS="" inputfile1.fa
>l-WR24-1:1 1 71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1 33 105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1 1 161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
>l-YJC-1:1 1 165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA
If you really really need the output lines 60 chars in length, use
if (IX == T[1]) {print RS p
TMP = substr ($0, T[2]+L1, T[3]-T[2]+1)
PTR = 1
while (PTR < length (TMP)) {print substr (TMP, PTR, 60)
PTR += 60
}
}
Hi RudiC,
Your codes work great on my real data too. Thanks a million
--- Post updated at 04:43 AM ---
Ok, got it... thanks