If field 6 of file 1 is same as field 4 of file 2, then see if field 5 of file 2 lies within the range specified by the fields 7 and 8 of file 1. If yes, extract the line from file 2 and add the fields 11, 12 and 13 of file 1 in to a separate file. Whew!
Ok for example - field 4 of file 2 i.e. chr1 is same as field 6 of file 1. Then see if field 5 of file 2 i.e.3000072 (which is always a number) lies in the range of fields 7 and 8 (3000001 3000156) of file 1. So, I need the output (the line from file 2 plus fields 11,12 and 13 of file 1) in a separate file as
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1 3000072 TTTATCGTCATCGTC L1_Mur2 LINE L1
Quick question for the OP...are the files sorted and the records are guaranteed in the same order? Otherwise, what's the key to tie the records? I ask since your initial evaluation seems to focus on flags like chr1...
I'm not sure I understand what you mean by iterate over the lines of the file. Every line in both files is read by that solution; the paste merges them:
///Quick question for the OP...are the files sorted and the records are guaranteed in the same order? Otherwise, what's the key to tie the records? I ask since your initial evaluation seems to focus on flags like chr1...///
Thank you very much guys. I did not expect such quick responses. Somehow I did not get email alerts too.
To answer your question, file 2 is sorted by 'chr' (ascending) and file 1 is sorted by field 1 (ascending). I have not tried the code given here yet. I will check it out.
Thanks a ton you guys - you rock.
---------- Post updated at 09:12 PM ---------- Previous update was at 09:08 PM ----------
oopsy - file 2 is sorted by field 1 and file 1 is sorted by chr (chr1 to chr19).
---------- Post updated at 09:47 PM ---------- Previous update was at 09:12 PM ----------
Sorry being so dumb but when i execute this code in my xp computer, i am getting all kinds of errors like " the system cant find the path specified", and at s=$11^ unexpected newline or end of a string found" . Can you please tell me where am I doing wrong? thanks.
That code you're using is incomplete. When reformating my code, curleb neglected (or chose) to omit ", s" after the "print $0". Take a look at the code as I posted to see what I'm talking about. Challenging Awk array problem Post: 302423713 Without that bit, the code will not append fields 11-13 when it should.
That shouldn't be the source of your errors, though. Have you run unix tools on this windows machine before? Can you confirm that you have paste and awk?
I usually run simple scripts on my machine and they work fine. I generally run a program like this: gawk "code"
But I am bit confused by
paste -d\\n file1 file2 |
I know I should not type "paste" . The files 1 and 2 are in the install directory of gawk. So I guess i should not use \\n either. But then I still get "s=$11^ unexpected newline or end of string" error.
What do you mean you should not type "paste"? Or that you should not use \\n? "paste" is the name of the command. You must type it. "\\n" is an option argument to paste that tells it to use a newline when merging the two files; it is crucial that it is used. You should enter the code exactly as posted: "paste -d\\n file1 file2 | awk....". If the data files aren't named file1 and file2, then change the filenames to point to the correct locations, but nothing else.
I tried pasting the code exactly, but it says the command paste is not recognized as external or internal command. Then I changed the single quotes to double quotes and that message wont show up, but i get the same s=$11^unexpected newline error.
You can change what's in red to suit your needs, but leave the rest as is.
Changing the quotes as you say you did completely changes the meaning of the code; using double quotes at the outer level causes every instance of a dollar sign folowed by a digit to be expanded by the shell instead of being passed literally to AWK for its use. AWK will never see them. Also, you are altering what is quoted and what is not quoted by creating unintended quoted strings with the double quotes that were embedded in the single quotes which were removed.
Alister, thanks for the code. But when I run it, it gives me an error message "the system can not find the file specified" even though the files are in the installation directory.
Also, while running other scripts I always use double quotes and it works fine. Single quotes doesn't work in my windows laptop. They work fine in my ubuntu office computer though.
It might be easier to run this in a script with -f option, but then the code might have to be changed a bit at the end so that the files will be given as in put in the command prompt. Here file 1 can be removed from the script and put in the command prompt but not sure how file2 should be accommodated.
If you still have problems, copy-paste the commands exactly as you ran them and the error messages exactly as you see them. Perhaps that will enable someone with experience dealing with unix tools on windows to help out.
Alister
---------- Post updated at 01:19 PM ---------- Previous update was at 01:14 PM ----------
Or, simpler still, you can put it into a shell script:
Ok - Now I am in to another problem (life is tough!). May be I did not explain this properly and I am apologize for it. The code here seems to assume line to line matching of file 1 and file 2. But my actual files (which are very big) do not match line by line. For example let me re-frame the original files.