please note that ----- is not present in actual file i just wrote to make the columns visible otherwise it is a tab separated file.
It has three columns-
column1 contains patterns from FILE1
column 2 contains the start position to cut the string
and column3 contains end position til the string has to be cut
WHERE POSITION 1 STARTS FROM THE LINE CONTAINING ALL THE ALPHABETS IN CAPS WITHOUT ANY SPACE AND ENDS WHERE NEXT '>' SYMBOL APPEARS
I want to have a new file that contains header(LINE BEGINING WITH '>' SYMBOL) maching first column of FILE2 and then only a substring
for eg- in case of first record of FILE2
The ouput file should contain-
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (20-32)
SVGIQGIDYPSV
and so on------ for all the other records of FILE2
Your contribution is highly appretiated.
Thanks
Smriti
hey! this is wht I ws exactly looking for. Thank you so much Jean.
you know what Jean? It didn't work initially. Then I removed the field separator - tab to keep the space as default value and I also changed it in FILE2. Then it worked perfectly.
Before also I noticed that ths field separator value does't work in my shell. Tell me if you know why this can happen.
I run that script on files which need to extract longer substrings and the format of output file is getting disturbed. I tried some options of printf but coudn't correct it. Please help me out.
where as the output should be like this:
>gi|37779709|gb|AAP20876.1| lectin [Pinellia ternata] (28-264)
NYLLSGETLDTDGHLKNGDFDFIMQEDCNAVLYNGNWQSNTANKGRDCKLTLTDRGELVINNGEGSAVWR
SGSQSAKGNYAAVLHPEGKLVIYGPSVFKINPWVPGLNSLRLGNVPFTCNMLFSGQVLYGDGKITARNHML
VMQGDCNLVLYGGKCDWQSNTHGNGEHCFLRLNHKGELIIKDDDFKSIWSSQSSSKQGDYVFILQDNGY
GVIYGPAIWATSSKRSVAAQETMIGM
>gi|146403769|gb|ABQ32294.1| lectin [Colocasia esculenta] (27-224)
NYLLSGQTLNTDGHLKNGDFDLVMQNDCNLVLYNGNWQSNTANNGRDCKLTLTDYGELVIKNGDGSTVWR
SRAKSVKGNYAAVLHPDGRLVVFGPSVFKIDPWVPGLNSLRFRNIPFTDNLLFSGQVLYGDGRLTAKNHQLV
MQGDCNLVLYGGKYGWQSNTHGNGEHCFLRLNHKGELIIKDDDFRPSGAAVPAPS
where each line after header line should not have more than 70 characters.
Should the overlong lines be truncated or folder over multiple lines, or what?
To truncate (this truncates to eight characters; feel free to add more dots):
sed -e 's/^\(........\).*/\1/'
To fold, have a look at the fold command, or maybe use something like
sed -e 's/\(.......\)\(.\)/\1\
\2/g'
(Yes, that's a literal newline, to do the folding.) Again, this example folds at every eight characters; feel free to add more dots. (Actually the first line will be one character shorter. I'm too lazy to fix that. Or rather, the fix will depend on your sed dialect, and I don't want to go there.)
This could just as well be done with awk or perl or the truncation even with cut.
I don't want to truncate it and cut will also do that only. one more thing is that i want to apply ths folding only on lines after the line with '>' symbol.
and
sed -e 's/\(.......\)\(.\)/\1\
\2/g'
is giving a vague output ---
It is giving a / after every eight characters.
I gave the command as
sed -e 's/\(.......\)\(.\)/\1\\2/g' filename
I tried
perl -pe 's/(.{70})(?!$)/$1\n/g unless m/^>/' file_test
it is giving the error---
perl -pe 's/(.{70})(?file_test)/$1\n/g unless m/^>/' file_test
Sequence (?f...) not recognized in regex; marked by <-- HERE in m/(.{70})(?f <-- HERE ile_test)/ at -e line 1.
Do you have a really old version of Perl? In Perl 5 regular expressions, ?! is a negative assertion lookahead; it checks that the text just ahead of the match does not match the given regular expression (in this case, end of line). I used that to avoid the problem I had with the sed script, which caused the first line to be wrapped one character earlier than subsequent ones.
after trying many other combinations I realised that it is related to length of COLUMN ONE of FILE2 , if it varies the output varies.
and sometimes it gives a blank line at the end in the output and if i put ths last entry of FILE2 somewhere in the middle all the entries after that are not processed. And I can't understand this.
i'll try by making the length of column 1 same in all entries.
The order of the output correspond to the order of headers in file1.
The logic is :
For each record in file 1
Proceed extractions specified in file 2 relative to this record
The script:
awk '
NR==FNR {
if (NR==1)
Key_len = length($1) + 1;
k = ">" substr($0,1, Key_len-1);
n = ++Keys[k];
From[k,n] = $2;
To[k,n] = $3;
Len[k,n] = $3 - $2;
next;
}
function print_selected( i,k,p,str) {
if (selected) {
k = substr(Header, 1, Key_len);
for (i=1; i<=Keys[k]; i++) {
printf("%s (%s-%s)\n", Header, From[k,i], To[k,i]);
str = substr(Alphabets, From[k,i], Len[k,i]);
p = 1
while (p<=Len[k,i]) {
print substr(str, p, 70);
p += 70;
}
}
}
}
/^>/ {
print_selected();
selected = (substr($0, 1, Key_len) in Keys);
Header = $0;
Alphabets = "";
next;
}
selected {
Alphabets = Alphabets $0;
}
END {
print_selected();
}
' sm2.dat sm1.dat
Ya I got your point but can we do it other way round also like-
For each record in file 2
Proceed extractions in file 1 relative to this record
If we can match first column of FILE2 as pattern in FILE1 header line and when it is found we proceed to extract the subtring in the lines following the header line till the next '>' appears by using values of column 2 and 3 of FILE2.
Just want to know your opinion on this logic as I am in learning phase. I'll try to work this out on my own if you think it will not be unnecessarily complicated.
Just want to develop my logics
Give your suggestions and hints to elaborate it if you find it ok.
The answer depends of the key length (first column of file2).
If keys are fixed length, there is no problem to invert the logic.
Read file1: Memorize each header and datas records in an array indexed by the key.
Read file2:For each record, if the record had been memorized (key is an index in the array) proceed extraction from memorized datas.
If keys are variable length, the header and records from file1 will be memorized in a array indexed by record number, we can't use another index because we don't know what part of the record is the key.
For each extraction specified in file 2, we will scan sequentially the array until we found the right record (more time consuming).