Challenging Awk array problem

Hi,

I rather have a very complicated awk problem here, at least to me. I have two files.

File 1:

607    687    174    0    0    chr1    3000001    3000156    -194195276    -    L1_Mur2    LINE    L1    -4310    1567    1413    1
607    917    214    114    45    chr1    3000237    3000733    -194194699    -    L1_Mur2    LINE    L1    -4488    1389    913    1
607    215    31    0    30    chr1    3000733    3000766    -194194666    +    (TTTG)n    Simple_repeat    Simple_repeat    2    33    0    2
607    845    233    76    114    chr1    3000766    3000792    -194194640    -    L1_Mur2    LINE    L1    -6816    912    887    1
607    621    250    65    37    chr1    3001287    3001583    -194193849    -    Lx9    LINE    L1    -1596    6048    5742    3
607    1320    197    332    7    chr1    3001722    3002005    -194193427    -    RLTR25A    LTR    ERVK    0    1028    625    4

File 2:

4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069  chr3  154935392 GAGTTTTACAGTCCA
28|3721 +  gi|149288852|ref|NC_000067.5|NC_000067 chr1  152633707 GAGTTTTACAGTCCA
28|3721  + gi|149361432|ref|NC_000073.5|NC_000073 chr7  86595415 GAGTTTTACAGTCCA
34|3145  - gi|149321426|ref|NC_000084.5|NC_000084 chr18  43464724 ACGGCTTACGA
34|3145  - gi|149354224|ref|NC_000071.5|NC_000071 chr5  37676290 ACGGCTTACGA

If field 6 of file 1 is same as field 4 of file 2, then see if field 5 of file 2 lies within the range specified by the fields 7 and 8 of file 1. If yes, extract the line from file 2 and add the fields 11, 12 and 13 of file 1 in to a separate file. Whew!

Ok for example - field 4 of file 2 i.e. chr1 is same as field 6 of file 1. Then see if field 5 of file 2 i.e.3000072 (which is always a number) lies in the range of fields 7 and 8 (3000001 3000156) of file 1. So, I need the output (the line from file 2 plus fields 11,12 and 13 of file 1) in a separate file as

4|17999 - gi|149361523|ref|NC_000074.5|NC_000074  chr1  3000072 TTTATCGTCATCGTC L1_Mur2    LINE    L1

Thank you very much in advance

What's the delimiter in file 2? The pipe, "|" ? If so, taking the first line in file 2, isn't field 4 "ref"? And field 5 is "NC_000074.5"?

Regards and welcome to the forum,
Alister

---------- Post updated at 05:52 PM ---------- Previous update was at 05:50 PM ----------

Nevermind. The pipes threw me. I see that it's whitespace delimited. Duh! Great first impression, huh? :wink:

1 Like

cracks me up when folks in the UK say: "Duh!"

Quick question for the OP...are the files sorted and the records are guaranteed in the same order? Otherwise, what's the key to tie the records? I ask since your initial evaluation seems to focus on flags like chr1...

1 Like

Assuming that the nth line in file1 corresponds to the nth line in file2:

paste -d\\n file1 file2 |
awk '{
    chr=$6; min=$7; max=$8; s=$11" "$12" "$13;
    getline;
    if (chr==$4 && $5>=min && $5<=max)
        print $0, s;
}'

Regards,
Alister

1 Like

Alister, another quick and dirty masterpiece! Works great, for me...

But I must ask: what if they wanted to iterate over the lines in the file? Or would you just wrap it into a while read...?

$ tr -s ' ' <Edit1
607 687 174 0 0 chr1 3000001 3000156 -194195276 - L1_Mur2 LINE L1 -4310 1567 1413 1
607 917 214 114 45 chr1 3000237 3000733 -194194699 - L1_Mur2 LINE L1 -4488 1389 913 1
607 215 31 0 30 chr1 3000733 3000766 -194194666 + (TTTG)n Simple_repeat Simple_repeat 2 33 0 2
607 845 233 76 114 chr1 3000766 3000792 -194194640 - L1_Mur2 LINE L1 -6816 912 887 1
607 621 250 65 37 chr1 3001287 3001583 -194193849 - Lx9 LINE L1 -1596 6048 5742 3
607 1320 197 332 7 chr1 3001722 3002005 -194193427 - RLTR25A LTR ERVK 0 1028 625 4

$ tr -s ' ' <Edit2
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1 3000072 TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069 chr3 154935392 GAGTTTTACAGTCCA
28|3721 + gi|149288852|ref|NC_000067.5|NC_000067 chr1 152633707 GAGTTTTACAGTCCA
28|3721 + gi|149361432|ref|NC_000073.5|NC_000073 chr7 86595415 GAGTTTTACAGTCCA
34|3145 - gi|149321426|ref|NC_000084.5|NC_000084 chr18 43464724 ACGGCTTACGA
34|3145 - gi|149354224|ref|NC_000071.5|NC_000071 chr5 37676290 ACGGCTTACGA

$ paste -d\\n Edit1 Edit2 |
> awk '{
>     chr=$6; min=$7; max=$8; s=$11" "$12" "$13;
>     getline;
>     if (chr==$4 && $5>=min && $5<=max)
>         print $0, s;
> }'
 L1_Mur2 LINE L1361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC
1 Like

Hi, curleb:

I'm not sure I understand what you mean by iterate over the lines of the file. Every line in both files is read by that solution; the paste merges them:

file1 line1
file2 line1
file1 line2
file2 line2
file1 line3
file2 line3
...etc

There just happens to be only one line that meets the requirements.

If I misundersood your question, please elaborate.

Cheers,
Alister

P.S. The last line in your output seems garbled. Perhaps a terminal hiccup? Or a carriage return somehwere? Here's a sample run from my system:

$ cat file1
607    687    174    0    0    chr1    3000001    3000156    -194195276    -    L1_Mur2    LINE    L1    -4310    1567    1413    1
607    917    214    114    45    chr1    3000237    3000733    -194194699    -    L1_Mur2    LINE    L1    -4488    1389    913    1
607    215    31    0    30    chr1    3000733    3000766    -194194666    +    (TTTG)n    Simple_repeat    Simple_repeat    2    33    0    2
607    845    233    76    114    chr1    3000766    3000792    -194194640    -    L1_Mur2    LINE    L1    -6816    912    887    1
607    621    250    65    37    chr1    3001287    3001583    -194193849    -    Lx9    LINE    L1    -1596    6048    5742    3
607    1320    197    332    7    chr1    3001722    3002005    -194193427    -    RLTR25A    LTR    ERVK    0    1028    625    4
$ cat file2
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069  chr3  154935392 GAGTTTTACAGTCCA
28|3721 +  gi|149288852|ref|NC_000067.5|NC_000067 chr1  152633707 GAGTTTTACAGTCCA
28|3721  + gi|149361432|ref|NC_000073.5|NC_000073 chr7  86595415 GAGTTTTACAGTCCA
34|3145  - gi|149321426|ref|NC_000084.5|NC_000084 chr18  43464724 ACGGCTTACGA
34|3145  - gi|149354224|ref|NC_000071.5|NC_000071 chr5  37676290 ACGGCTTACGA
$ paste -d\\n file1 file2 |
> awk '{
>     chr=$6; min=$7; max=$8; s=$11" "$12" "$13;
>     getline;
>     if (chr==$4 && $5>=min && $5<=max)
>         print $0, s;
> }'
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC L1_Mur2 LINE L1
1 Like

hmmm...I'm not seeing it do that, but I'm only playing on ksh93 (U/Win)...

Mocked it up as follows, so that the first and last lines match the scenario, but I only get the one:

$ tr -s ' ' <Edit1
607 687 174 0 0 chr1 3000001 3000156 -194195276 - L1_Mur2 LINE L1 -4310 1567 1413 1
607 917 214 114 45 chr1 3000237 3000733 -194194699 - L1_Mur2 LINE L1 -4488 1389 913 1
607 215 31 0 30 chr1 3000733 3000766 -194194666 + (TTTG)n Simple_repeat Simple_repeat 2 33 0 2
607 845 233 76 114 chr1 3000766 3000792 -194194640 - L1_Mur2 LINE L1 -6816 912 887 1
607 621 250 65 37 chr1 3001287 3001583 -194193849 - Lx9 LINE L1 -1596 6048 5742 3
607 1320 197 332 7 chr1 37600000 37676290 -194193427 - RLTR25A LTR ERVK 0 1028 625 4

$ tr -s ' ' <Edit2
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1 3000072 TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069 chr3 154935392 GAGTTTTACAGTCCA
28|3721 + gi|149288852|ref|NC_000067.5|NC_000067 chr1 152633707 GAGTTTTACAGTCCA
28|3721 + gi|149361432|ref|NC_000073.5|NC_000073 chr7 86595415 GAGTTTTACAGTCCA
34|3145 - gi|149321426|ref|NC_000084.5|NC_000084 chr18 43464724 ACGGCTTACGA
34|3145 - gi|149354224|ref|NC_000071.5|NC_000071 chr5 37676290 ACGGCTTACGA

$ paste -d\\n Edit1 Edit2 |awk '{chr=$6; min=$7; max=$8; s=$11" "$12" "$13; getline; if (chr==$4 && $5>=min && $5<=max) print $0;}'
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC
1 Like

The other condition that must be met isn't met: chr5 == chr1 (false) :wink:

1 Like

...doh!

time for bed! :smiley:

$ tr -s ' ' <Edit1
607 687 174 0 0 chr1 3000001 3000156 -194195276 - L1_Mur2 LINE L1 -4310 1567 1413 1
607 917 214 114 45 chr1 3000237 3000733 -194194699 - L1_Mur2 LINE L1 -4488 1389 913 1
607 215 31 0 30 chr1 3000733 3000766 -194194666 + (TTTG)n Simple_repeat Simple_repeat 2 33 0 2
607 845 233 76 114 chr1 3000766 3000792 -194194640 - L1_Mur2 LINE L1 -6816 912 887 1
607 621 250 65 37 chr1 3001287 3001583 -194193849 - Lx9 LINE L1 -1596 6048 5742 3
607 1320 197 332 7 chr1 37600000 37676290 -194193427 - RLTR25A LTR ERVK 0 1028 625 4

$ tr -s ' ' <Edit2
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1 3000072 TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069 chr3 154935392 GAGTTTTACAGTCCA
28|3721 + gi|149288852|ref|NC_000067.5|NC_000067 chr1 152633707 GAGTTTTACAGTCCA
28|3721 + gi|149361432|ref|NC_000073.5|NC_000073 chr7 86595415 GAGTTTTACAGTCCA
34|3145 - gi|149321426|ref|NC_000084.5|NC_000084 chr18 43464724 ACGGCTTACGA
34|3145 - gi|149354224|ref|NC_000071.5|NC_000071 chr1 37676290 ACGGCTTACGA

$ paste -d\\n Edit1 Edit2 |awk '{chr=$6; min=$7; max=$8; s=$11" "$12" "$13; getline; if (chr==$4 && $5>=min && $5<=max) print $0;}'
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC
34|3145  - gi|149354224|ref|NC_000071.5|NC_000071 chr1  37676290 ACGGCTTACGA
1 Like

///Quick question for the OP...are the files sorted and the records are guaranteed in the same order? Otherwise, what's the key to tie the records? I ask since your initial evaluation seems to focus on flags like chr1...///

Thank you very much guys. I did not expect such quick responses. Somehow I did not get email alerts too.

To answer your question, file 2 is sorted by 'chr' (ascending) and file 1 is sorted by field 1 (ascending). I have not tried the code given here yet. I will check it out.

Thanks a ton you guys - you rock.

---------- Post updated at 09:12 PM ---------- Previous update was at 09:08 PM ----------

oopsy - file 2 is sorted by field 1 and file 1 is sorted by chr (chr1 to chr19).

---------- Post updated at 09:47 PM ---------- Previous update was at 09:12 PM ----------

Sorry being so dumb but when i execute this code in my xp computer, i am getting all kinds of errors like " the system cant find the path specified", and at s=$11^ unexpected newline or end of a string found" . Can you please tell me where am I doing wrong? thanks.

That code you're using is incomplete. When reformating my code, curleb neglected (or chose) to omit ", s" after the "print $0". Take a look at the code as I posted to see what I'm talking about. Challenging Awk array problem Post: 302423713 Without that bit, the code will not append fields 11-13 when it should.

That shouldn't be the source of your errors, though. Have you run unix tools on this windows machine before? Can you confirm that you have paste and awk?

Regards,
Alister

1 Like

Neglected is more accurate. Intrigued by that use of paste. Sorry.

1 Like

I usually run simple scripts on my machine and they work fine. I generally run a program like this: gawk "code"

But I am bit confused by
paste -d\\n file1 file2 |

I know I should not type "paste" . The files 1 and 2 are in the install directory of gawk. So I guess i should not use \\n either. But then I still get "s=$11^ unexpected newline or end of string" error.

What do you mean you should not type "paste"? Or that you should not use \\n? "paste" is the name of the command. You must type it. "\\n" is an option argument to paste that tells it to use a newline when merging the two files; it is crucial that it is used. You should enter the code exactly as posted: "paste -d\\n file1 file2 | awk....". If the data files aren't named file1 and file2, then change the filenames to point to the correct locations, but nothing else.

1 Like

I tried pasting the code exactly, but it says the command paste is not recognized as external or internal command. Then I changed the single quotes to double quotes and that message wont show up, but i get the same s=$11^unexpected newline error.

Here's a simpler approach that only uses AWK (and which I should've suggested from the beginning):

awk '{chr=$6; min=$7; max=$8; s=$11" "$12" "$13; getline < f2; if (chr==$4 && $5>=min && $5<=max) print $0, s;}' f2='file2' file1

You can change what's in red to suit your needs, but leave the rest as is.

Changing the quotes as you say you did completely changes the meaning of the code; using double quotes at the outer level causes every instance of a dollar sign folowed by a digit to be expanded by the shell instead of being passed literally to AWK for its use. AWK will never see them. Also, you are altering what is quoted and what is not quoted by creating unintended quoted strings with the double quotes that were embedded in the single quotes which were removed.

Alister

1 Like

Alister, thanks for the code. But when I run it, it gives me an error message "the system can not find the file specified" even though the files are in the installation directory.

Also, while running other scripts I always use double quotes and it works fine. Single quotes doesn't work in my windows laptop. They work fine in my ubuntu office computer though.

It might be easier to run this in a script with -f option, but then the code might have to be changed a bit at the end so that the files will be given as in put in the command prompt. Here file 1 can be removed from the script and put in the command prompt but not sure how file2 should be accommodated.

If you prefer it that way, sure.

awk -f dna.awk f2='file2' file1

Where dna.awk contains:

{chr=$6; min=$7; max=$8; s=$11" "$12" "$13; getline < f2; if (chr==$4 && $5>=min && $5<=max) print $0, s;}

If you still have problems, copy-paste the commands exactly as you ran them and the error messages exactly as you see them. Perhaps that will enable someone with experience dealing with unix tools on windows to help out.

Alister

---------- Post updated at 01:19 PM ---------- Previous update was at 01:14 PM ----------

Or, simpler still, you can put it into a shell script:

#!/bin/sh

awk '{chr=$6; min=$7; max=$8; s=$11" "$12" "$13; getline < f2; if (chr==$4 && $5>=min && $5<=max) print $0, s;}' f2="$2" "$1"

Assuming the shell script file is named dna.sh, exists in the current directory, and is made executable, it can be run thusly:

$ ./dna.sh file1 file2
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC L1_Mur2 LINE L1

Or, if not an executable file:

$ sh dna.sh file1 file2
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC L1_Mur2 LINE L1
1 Like

I tried this, but the program just runs and prints nothing and goes back to command prompt. No error messages whatsoever.

I ran this:
awk -f dna.awk f2='2.txt' 1.txt

I use gawk, is there a significant difference between regular awk and gawk?

Aaahh! now its working. Ok I removed the single quotes on '2.txt' and its working. So the code that worked is awk -f dna.awk f2=2.txt 1.txt

I knew there is a simple mistake I have been making.

Thank you very much you all, particularly Alister. You rock.

Ok - Now I am in to another problem (life is tough!). May be I did not explain this properly and I am apologize for it. The code here seems to assume line to line matching of file 1 and file 2. But my actual files (which are very big) do not match line by line. For example let me re-frame the original files.

file 1 (THIS IS SAME AS ORIGINAL)

607    687    174    0    0    chr1    3000001    3000156    -194195276    -    L1_Mur2    LINE    L1    -4310    1567    1413    1
607    917    214    114    45    chr1    3000237    3000733    -194194699    -    L1_Mur2    LINE    L1    -4488    1389    913    1
607    215    31    0    30    chr1    3000733    3000766    -194194666    +    (TTTG)n    Simple_repeat    Simple_repeat    2    33    0    2
607    845    233    76    114    chr1    3000766    3000792    -194194640    -    L1_Mur2    LINE    L1    -6816    912    887    1
607    621    250    65    37    chr1    3001287    3001583    -194193849    -    Lx9    LINE    L1    -1596    6048    5742    3
607    1320    197    332    7    chr1    3001722    3002005    -194193427    -    RLTR25A    LTR    ERVK    0    1028    625    4

file 2

4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069  chr3  154935392 GAGTTTTACAGTCCA
28|3721 +  gi|149288852|ref|NC_000067.5|NC_000067 chr1  152633707 GAGTTTTACAGTCCA
28|3721  + gi|149361432|ref|NC_000073.5|NC_000073 chr1  3000073 GAGTTTTACAGTCCA
34|3145  - gi|149321426|ref|NC_000084.5|NC_000084 chr1 3000767 ACGGCTTACGA
34|3145  - gi|149354224|ref|NC_000071.5|NC_000071 chr5  37676290 ACGGCTTACGA

So the output should be,

4|17999 - gi|149361523|ref|NC_000074.5|NC_000074  chr1  3000072 TTTATCGTCATCGTC L1_Mur2    LINE    L1
28|3721  + gi|149361432|ref|NC_000073.5|NC_000073 chr1  3000073  GAGTTTTACAGTCCA     L1_Mur2    LINE    L1
34|3145  - gi|149321426|ref|NC_000084.5|NC_000084 chr1 3000767 ACGGCTTACGA     (TTTG)n    Simple_repeat    Simple_repeat

The code here seems to be matching the two files line to line. I tried editing the code to this purpose but of no avail. Please help me. thanks.