Sequence extraction

i want to extract specific region of interest from big file. i have only start position, end position and seq id, see my query is:

I have file1 is this

>GL3482.1
GAACTTGAGATCCGGGGA
GCAGTGGATCTCCACCAG
CGGCCAGAACTGGTGCAC
CTCCAGGCCAGCCTCGTC
CTGCGTGTC
>GL3550.1
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>GL3472.1
TTTTCCTGTTCACTGCTGCTTTTCTATAGACAGCA
GCAGCAAGCAGTAAGAGAAAGTA

file2 is this :

seq id     start    end
GL3482.1   323100   323743
GL3550.1   41911    40888
GL3472.1   274408   272617

and i want result like this:

>GL3482.1 
TTGAGA
>GL3550.1 
GGCATTTTTGTG
>GL3472.1
TTCCTGTT

when i run this command :

$ awk 'NR==FNR{if($0~/^>/){i=substr($0,2);getline};a=a $0;next}{print ">" $1 ORS substr(a[$1], $2, $3-$2+1)}' file1 FS=\\t file2

i got no output or output file. why so?

---------- Post updated at 01:28 AM ---------- Previous update was at 01:21 AM ----------

i am using Ubuntu OS

Maybe because FS is set to '\t' and there appear to be no TABs in your second sample file, so the fields will not match. And you are setting the index to the entire record, save the first character. which is probably not what you want (you probably meant to use $1 here, but that would not be a sure way to do it either, because in the FASTA format the identifier is allowed to contain spaces). And the FASTA file is word wrapped, so you need to take out the newlines and not use getline to get only the second line ....

The best way to do that is is to use ">" as a record separator and use "\n" as the field separator. By setting OFS as the empty string, and assigning a value to one of the fields, all newlines will be replaced by empty strings, so this will effectively remove the word wrap. And the sequence will become one continuous string, which will make it suitable for substring selection.

Using your file order, we would get something like this.

awk 'NR==FNR{i=$1; $1=x; A=$0; next} $1 in A{print ">" $1 ORS substr(A[$1], $2, $3-$2+1)}' RS=\> FS='\n' OFS= file1 FS=" " RS="\n" file2

If we read the files the other way around, then it becomes more memory efficient:

awk 'NR==FNR{S[$1]=$2; E[$1]=$3; next} $1 in S{i=$1; $1=x; print RS i FS substr($0,S,E-S+1)}' file2 RS=\> FS='\n' OFS= file1

With your two sample input files (with the combined lengths of the lines in each group that do not start with a > being less than 100 characters), I don't see how you would expect any output when the substring you are trying to extract from those strings starts more than 40,000 characters into that string, and in two of the three cases has an ending position in the string that comes before the starting position (thereby requesting a substring that has negative length).

In addition to those problems, as Scrutinizer said, your script specifies that the input field separator for file2 is a tab character, but there are no tab characters in the data you showed us. Therefore, you are requesting a substring of 1 character starting at position 0 (when arrays of characters in awk start at position 1).

Note also that although you might be able to create an array element in awk or gawk on Ubuntu that is more than 323,000 characters long; on most UNIX systems and BSD-based systems, awk won't let you read a line, write a single output string, or create a variable whose value is much more that LINE_MAX bytes long (on most systems LINE_MAX is 2,048).

1 Like

let me clear my query again :

i have 2 files, file1 is 1.fasta sequences file which are as follow:

>gi|547177824|gb|AWWX01000001.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig0, whole genome shotgun sequence
ACATAATCCCCGAGGCAAAATAAGTCTCTAATGAACTTGACCCTATGAGTGTCAAGGTGAGGGAGTCCTA
AGAGACTGACAAGGGCTGTGAAGGCTACAAGGGAAGAAGACAACAGGATCAGCTGGAAAAGGCTTAAAAT
TGCAGCCCTTGATTCTTCCACTGTGCCCTGGGGCCATGAATCGCTACAGCCTCACTGAAGGAATCTGAGG
TAGCATCTCAGAGCTCCCATGCCAAGACCATGGGGAACAAATTTGAGTTGGATGTGGCAGCATGGCCCAG
AGGTCATGAAATAACCAGCACAAACTTTGTTAGTGGATGACAACTGGCCCTTTTAAGTGATCACCTTAAT
AGACTATCATTCAGGGACTAAAAGAGAAAACTTAGTACCCCAAAAATTGTACCAAAGGACAGAAATTCAC
CTCAGAACTTTCCAGCAAAG
>gi|547177823|gb|AWWX01000002.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig1, whole genome shotgun sequence
CAATTTGATTCCAGCTTGTGTTCATCCATCTGGGAATTTTCATGATGTACTGTGCATATAACTTAAATAA
TCAGGGTGACAATATGCAGCCTTAATGTACTTGTGTCCCAATTTTGAACCTGTCTGTTGTTCCGTGTCCA
CTTCTAACTGTTGCTTCTTGACCTGTATACCAGTTTTTCAGAAGGCAGGTAAGGTGCTCTGGTATTCCCA
TCTCTTTAATAATTTTGCAGTTTCCTGTGATACACACAGACAAAGGCTTTTACATAGTCAAACAAGTAGA
AGTTTTTCTGGAGTTCTCTTGCTTTGCCTA
>gi|547177822|gb|AWWX01000003.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig2, whole genome shotgun sequence
GGTCCAAGAGGCTCCTTCTTTTGTCCAGAAAGAGTTCCTGTGAACAGGGCTCAGTGGCTGACTGATGGTT
AGTATCATCTGGAGAAGGGAAAGGGCTTTTGTAGGCAAATTTAGCATTGCCAAGCCAGCAGAGTCTACCT
GGCAAGTCTTTCAAAGTTCAGAATTTTTCCTAAAGGCAAGGAAAACATGGACATAAGGGACCTCGCTGGA
ATCCTAAGCGCAAATCTTCAGCAATGGACAGTCCCATCTGAAAAGGAAATGATGAATGCTCATAAGGGGC
TTCGTGTGTCACTGAATTCCCTTATGGACTTTTCCTGGCTGTGGCATCCCTCATCTGTGATCTTGCTGGG
CATAATGTGGTTCAGTTCTCGCCCCAGGGGCCGTTGTGGCTTCCCTGAGTATTTTTTTAAGAGATTATTT
ATGTGTTTAATTTTGGTTGCACTGGCTCTTCATTGCTCCATGTGTGTTCTCTAGTTTCGAGGAGCAGGGG
CTA
>gi|547177821|gb|AWWX01000004.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig3, whole genome shotgun sequence
CTATATAAATAAATCTATTTCTTGCCTAACACTTTGCCTGTTGCTGAATTCCTTCTGTGCCGAGACACAA
ACAATGTGAGCCACAGTACATCCAGGTACCAGATAAGTGATTCTAAATAAAAGACCATGGTTCAAGACTC
AAACTGGATTTTGGGGAGGGGGGGAGGTTGGAGTCCTGGCCATGTGGATTTTAGTCCAAACCTGATGTGA
TAGGTTTCAGGTAGAGATCATCTCCATGCCACCACATTGGAAATTTTATACTATGGTCTGTTATACTGGT
TCA
>gi|547177820|gb|AWWX01000005.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig4, whole genome shotgun sequence
TCTCAAGCAGACTTTACCTATAAAAAGGGCAATGTGTCCTGGAACTGCAAGCTGAGCTTAGGAACAGAGA
ACTGCAAAAACTATTGGCATGAACAACTCTGTCCAAACATCTACATTAGGAACCTTGACTGACCTCTAGC
ATGGTTTCTAGCAGCAACCTAAGGCCACGTTCTAGGACAACTCAGCTACCCCTGAGTTCCTGTCTAGAAA
ATTTCAAGGCTACCAAAGGAATCTGCTCCAGCCAACATCTGACATAAGCCCCTCATCTTCCTTTACTTAG
AGTGTCTATTTAAAACAAAGACCAAAAAAAAAAAAAACAAAAACAAAAAACCCTCACGATTACAAGAAAA
GTGTGACGGAAATAAACTAGAAATTGATAACAAAAAGACAGCAGGAAAATTAAAGATTATTTGCAGACTA
AGCAGAAAACATCTAAA
>gi|547177819|gb|AWWX01000006.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig5, whole genome shotgun sequence
ATCAGCAAGCTCAACCTCGGGCAAACGATTGTCCGAGTGATGATGGGGGCTGCCCAAGTCCGGTCTCCTC
ATGGTGCTCCTGGGGGTCATCTTCATGAATGGTAACCACGCCACCGAGGAGGAGGTCTGGGAATTCCTGA
GTATGTTGGGGATCTATGCTGGGAGGAGGCACTGGATCTTTGGGGAGCCCAGAAGGCTCATCACCAAAGA
TCTGGTGCAGAAGGAGTACCTGAACTACCGCCGGGTGCCCAATAGTGATCCTCCGCGCTACCAGTTGCTG
TGGGGCCCGAGAGCTTGTGCTGAGACCAGTAAGATGAAGGTACTGGAGGTTCTAGCCAAGTTCCACGGTA
GGGTCCCTAGTTCCTTCCCAGACCTATATGACGAGGCTCTGAGAGATCAGGCGGAGACAGCAGGGCGGAG
AGGTGTGGCCAGGGCTCCCGACCATGGCTGAGGCCAGTGCCC
>gi|547177818|gb|AWWX01000007.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig6, whole genome shotgun sequence
ATGAAGAAATGAAGAGAAAAAAAATGAGCACAAATACCTCTCATGAACATAGAAGCAAAATCCCCAACCA
AATATAAGAAAAGAAACCTAAAAATCCATAGAAGGAATTACATACTATATCTAAGTAGGATGCATTTCAG
ATGTGCAACATTTGAAAATCAGCAATCATAAATCATAATAACAATAGTCTAAGGAAGATAATCACAAGAT
GAGATCAACAGATGAAGGGAAAAATCATCTGATAAATCCAACACCCATTCATCATAAAAATTCTACAGCA
AACTAGATACAGACAGGCATTTCCTCAGCTTCATATAAAACATACTAAAAAGAAATACAGGTAAAACATA
CTTCATAGT
>gi|547177817|gb|AWWX01000008.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig7, whole genome shotgun sequence
AACTTTGGTGCCCCGTGGTCCTGGCATGGGGCCTCGGAGATGCTGCCTCAGATTCCTTCAGTGAAGCAGT
GGAGATTCATGATCCCAGGGCACAGTGCAAGAATCCAGGGCTCAAGTTTAAAGCATTTTTGCAGCTCATC
CTGTTGTCTGCTTCCCTTGCAGGCTGCACATCTATGCCAGTCTCTTAAGTACTCCCCCACCTAGGTACAC
ATAGGGTCAAGTTCAGTGGAGACTTACTGTCCATCAGGGATTATTTAATTTTATGTTTTCACTTTTGTTA
ATAGTTTTCTTCATTCCTTCAATAAGATTTCCGTGGTTACTTCTGGGTACAAAAA
>gi|547177816|gb|AWWX01000009.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig8, whole genome shotgun sequence
CCTTCTTTTTCCTGAGACTCTCAGGAAATTCTCCTTGCCTCCATGAAGACTGTATACATTTTTCTAAAGT
TTCTTTGTAGGTAAATTTTCTCTATGGTTGCTTTTGTTTGTGGTATATTTTCTGTTACTTGGTGTAAGTA
ACTGTTGATTCTTCCAGACCAACTGGATGATTTTGCTGCCATGATCTTTTTATTATTCCAGATTTTTGAA
GTTTCTTACATGTTTGAATATTTCCTGTCCTGTTTTCTAAGATTTAATTCGAGAATCAAATTGTCATTGT
GATCTTTTGCTTTTCATTACTTCTGACTTTTATTCATTTGTCATTGTTGTATATAACTCTAGTGGCTATA
TGTACTT

file2 is result.ods file which is as :

gi|546687122|gb|AWWX01446731.1|	       13172	13194
gi|546693672|gb|AWWX01441057.1|	       6859	        6837
gi|546698969|gb|AWWX01436431.1|	       18753	18775
gi|546703077|gb|AWWX01432778.1|   	4132	        4154
gi|546670495|gb|AWWX01450063.1|	        4111	        4133
gi|546689695|gb|AWWX01444610.1|	       14602	14580
gi|546691352|gb|AWWX01443112.1|	       10073	10051
gi|546880329|gb|AWWX01275531.1|	       1158	        1136
gi|546670216|gb|AWWX01450333.1|	       10633	10655
gi|546678257|gb|AWWX01448205.1|	        1112	        1134
gi|546693672|gb|AWWX01441057.1|	        6832	        6854
gi|546704475|gb|AWWX01431394.1|	        18135	18113
gi|546699347|gb|AWWX01436084.1|	        26840	26862
gi|546702960|gb|AWWX01432895.1|	        13515	13493
gi|546833971|gb|AWWX01313367.1|	        615	        593
gi|546860287|gb|AWWX01291803.1|	         2188	2210
gi|546689115|gb|AWWX01445179.1|	        10761	10739
gi|546701370|gb|AWWX01434384.1|	        2616	        2638
gi|546694075|gb|AWWX01440674.1|	        9568	        9546
gi|546701082|gb|AWWX01434635.1|	        8423	        8445
gi|547071923|gb|AWWX01098172.1|	        135	        157
gi|546705086|gb|AWWX01430793.1|	        3181	        3203
gi|546704429|gb|AWWX01431440.1|	       1352	        1330
gi|546709146|gb|AWWX01426952.1|	       52962	52984

and we want a awk script which creates an output file which contain sequences extracted from these coordinates along with id with > symbol .
for example:

>gi|546709146|gb|AWWX01426952.1|
acctgctgcatgcgtgcgtggcgtgcaaaatgcagtcaaggcaggtcagtccatgcatgacgtgcaatgcattgcatggcgtgcaaaatgcaggcgtggcgtgcaaaatgcagtcaaaaaattgccgtgcaatgggcc

output file should be like this.

I hope now it is clear to you.

The comments by Don Cragun still apply: most of the sequences are shorter than the coordinates that you have in your file2, and there are coordinates that result in negative string lengths.
On top, none of the samples in file2 match any of those in file1.

How about posting some meaningful samples?

Please explain your example; it is not at all clear to me.

You say you want the output:

>gi|546709146|gb|AWWX01426952.1|
acctgctgcatgcgtgcgtggcgtgcaaaatgcagtcaaggcaggtcagtccatgcatgacgtgcaatgcattgcatggcgtgcaaaatgcaggcgtggcgtgcaaaatgcagtcaaaaaattgccgtgcaatgggcc

which seems to have been selected by the line:

gi|546709146|gb|AWWX01426952.1|	       52962	52984

from the file result.ods which I thought meant that you wanted characters 52962 through 52984 from the data following a line starting with:

>gi|546709146|gb|AWWX01426952.1|

in the file 1.fasta . But, that string does not appear in 1.fasta and, even if it did, the data from characters 52962 through 52984 would be 23 characters long; not the 138 characters you have said you want to have output.

What am I misunderstanding???

may be that entries would be at the bottom of the file actually sir, i just copied and posted very few entries from file1 and file2 may be matched would be skipped. i cant paste my whole data file here i have 4565005 sequence entries in my data.

and about negative value, please provide me the script for like 300 250 = 50
no problem if there is 250-300 =-50 i will correct it and will then apply script.

Well then, what be the result of applying Scrutinizer's proposals to your files?

yes sir exactly it is 23 characters long sequence not 138. 138 i wrote to make you understand that it is small sequence.

---------- Post updated at 05:30 AM ---------- Previous update was at 05:26 AM ----------

Rudic sir, Scrutinizers sir's script os not creating a new output file separately in the same folder. i want outpiy like:

>gi|546709146|gb|AWWX01426952.1|
acctgctgcatgcgtgcgtggcgtgcaaaatgcagtcaaggcaggtcagtccatgcatgacgt

in separate file i.e output_new.fasta

---------- Post updated at 05:31 AM ---------- Previous update was at 05:30 AM ----------

if you are not understanding then let me edit my file and then i will post my whole data over here.

We don't need you to post you whole data file. We need to you post two small sample input files, and the exact output that should be produced when given those two sample input files. If the Start and End positions are sometimes the End and Start positions instead, you need to make that explicit up front; not assume that we will guess that the data you're showing us is corrupt and that we are supposed to guess what should be done with that corrupt data.

Do not show us sample data that does not match the sample output you provide. Doing that just confuses anyone who might want to help you!

Telling us that you want exactly 23 characters and showing us 138 doesn't make us understand that it is a small sequence; it makes us understand that you are trying to confuse us OR that you can't be bothered to explain what you are trying to do.

If Scrutinizer's script is producing the output you want, but not redirecting it to the file in which you want that output saved, add the redirection operator:

 > output_new.fasta

to the end of the awk command he suggested!

Hi Don, I don't think this is the case on "most systems", but rather on some systems.

For awk, LINE_MAX is a minimum requirement specified by POSIX, but I found no systems with a limit equal to LINE_MAX. A few systems have a low limit, but higher than LINE_MAX and most awk implementations on various platforms have a much higher limit or perhaps no limit.

A small test on Solaris:

$ getconf LINE_MAX
2048
$ LANG=C tr -dc '[a-z]' < /dev/urandom | dd count=1000 2>/dev/null | nawk '{foo=substr($0,1,409600); print foo}' | wc -c
  409601
$

I found these case to have a high limit if any:

Linux      : gawk, mawk
AIX 7      : awk
Solaris 10 : nawk
OSX 10.10  : BSD awk, gawk, mawk

The lower limits I found were:

Solaris 10 : /usr/xpg4/bin/awk: 19999 Bytes
HPUX 11.11 : awk :               3000 Bytes
IRIX 6.5   : awk :               3000 Bytes

--
Interestingly on Solaris nawk has a high limit, whereas early POSIX compliant /usr/xpg4/bin/awk has a low limit.

2 Likes

Hi Scrutinizer,
Thanks for the information. I knew that the Solaris /usr/xpg4/bin/awk had a limit larger than LINE_MAX, but still "relatively" small. I didn't remember that nawk was unlimited.

The OS X 10.9 BSD-based awk also had a 3000 byte limit. I hadn't checked the limit lately not realizing that it had changed. Sometime between OS X version 10.9 and OS X Yosemite, version 10.10.4 that limit was raised considerably or removed. And, looking at the OS X awk man page, the usual BSD banner has disappeared. The command:

awk --version

now returns:

awk version 20070501

while the sed utility (whose man page still has the BSD General Commands Manual banner) command:

sed --version

still returns:

sed: illegal option -- -
usage: sed script [-Ealn] [-i extension] [file ...]
       sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]

so I'm guessing that awk isn't from BSD anymore.

Hi Don, I am not sure about awk, on OS X, I seem to remember it always had that 20070501 version label. And to me it seems like it still behaves like before:

$ echo hello | awk 1 RS=el
h
llo

$

If I look at the man page of OS X 10.6.2, it looks like my current 10.10.4 man page, and there is no BSD label in there. It also looks identical to the FreeBSD 11.0 awk man page and the NetBSD 6.5 awk man page and they also do not have BSD banners..

Hi Scrutinizer,
The OS X 10.10.4 awk also still rejects -v options with the option-argument in the same argument as the option specifier. I.e., awk -v a="abc" sets the awk variable a to abc , but awk -va="abc" fails with the diagnostic:

awk: invalid -v option

The standards require conforming implementations of awk to accept both forms as valid ways to set a to abc .

I could swear that at some point in the past year, awk on OS X gave me a diagnostic and exited when it read a line from a file that was longer than 3000 bytes, when I tried to set a variable to a string longer than 3000 bytes, and when I tried to use print or printf to write more than 3000 bytes in a single call. But, I successfully read a line that contained more than 350Mb a few minutes ago. So, if it did have a lower limit before, it doesn't in OS X Yosemite, version 10.10.4.

Sorry for my confusion...

Some observations about thread-O/Ps problem description:

1) The FASTA format (at least the original Pearson format) uses series of lines up to 120 characters each. Normally they are considerably shorter (80 characters and less). The discussion about line length should be moot therefore and i suppose thread-O/P just has neglected to describe his problem properly insofar as he is talking about sequence offsets, not line offsets. "500" probably means "character 20 on the seventh line" (of 80 characters each).

Still, if this is the case is everybodies guess and he should perhaps explain what he really wants.

2) Speaking of offsets: the FASTA format allows for the following nucleic acid codes only in the sequence representation:

A 	A 				Adenine
C 	C 				Cytosine
G 	G 				Guanine
T 	T 				Thymine
U 	U 				Uracil
R 	A or G 				puRine
Y 	C, T or U 			pYrimidines
K 	G, T or U 			bases which are Ketones
M 	A or C 				bases with aMino groups
S 	C or G 				Strong interaction
W 	A, T or U 			Weak interaction
B 	not A (i.e. C, G, T or U) 	B comes after A
D 	not C (i.e. A, G, T or U) 	D comes after C
H 	not G (i.e., A, C, T or U) 	H comes after G
V 	neither T nor U (i.e. A, C or G) 	V comes after U
N 	A C G T U 			Nucleic acid
X 	masked 	
- 	gap of indeterminate length 	

plus the following codes for amino acids:

* 	translation stop
- 	gap of indeterminate length
A 	Alanine
B 	Aspartic acid (D) or Asparagine (N)
C 	Cysteine
D 	Aspartic acid
E 	Glutamic acid
F 	Phenylalanine
G 	Glycine
H 	Histidine
I 	Isoleucine
J 	Leucine (L) or Isoleucine (I)
K 	Lysine
L 	Leucine
M 	Methionine
N 	Asparagine
O 	Pyrrolysine
P 	Proline
Q 	Glutamine
R 	Arginine
S 	Serine
T 	Threonine
U 	Selenocysteine
V 	Valine
W 	Tryptophan
X 	any
Y 	Tyrosine
Z 	Glutamic acid (E) or Glutamine (Q)

Characters other than the mentioned are ignored. Is there a rule in place if ignored character (if there are any) should be removed from the character count too (will "10" mean "tenth non-ignored character" or "tenth character overall"?

2a) What about sequence alignment?

3) Is your FASTA format the original Pearson format or the (somewhat different) NCBI format? i.e. are multi-line comments introduced with the semicolon character at the beginning of the line to be expected or not?

(for reference see : The NCBI Handbook - Chapter 16The BLAST Sequence Analysis Tool)
bakunin