Beginner Questions.

vlay2 · June 2, 2010, 9:30pm

This is the Test_Data.snp file: MEGAUPLOAD - The leading online storage and file delivery service

The problem statement, all variables and given/known data:
Problem Set:

Before you get started working with these challenges, be aware that the first challenge is reformatting the test data file so that you get rid of the �header' and get all of the columns
delimited for working with in unix. (I'll give you another clue in addition to getting rid of the header, learn �grep', �cat', �cut', �awk', �sed' )

write a script to change the extension of your file : Test_Data.snp to Test_Data.txt
print all lines that have an �A' base call either in the reference (column 2) or query (column 3) strain
print only column titled �LEN R' to a new file called Reference_length.txt
sort the file by column 4 ( titled [P2])
print only the lines that have a basecall in columns 2 and 3 (under [SUB] headings) and sort by [LEN R] , output to new file called snp_report.txt

Relevant commands, code, scripts, algorithms:

I'm not sure what this means?

The attempts at a solution (include all code and scripts):

The only thing I know how to do is actually show the data set in the terminal window

Complete Name of School (University), City (State), Country, Name of Professor, and Course Number (Link to Course):

This is part of a learning scholarship over the summer. I am working with Dr. Mia Champion of TGEN North in Flagstaff. She recommended that I come here for help.

Thanks for any help you can provide. I literally just started learning this a day ago, so please bear with me.

curleb · June 2, 2010, 10:02pm

You might want to post a sample of the file layout next time, rather than ask we download your whole file. Otherwise, the following should answer most, if not all in succession, but just so you're aware: there's always more than one way to do it.

It's now up to you to actually deconstruct them per your study guide(s) or texts. HTH.

mv Test_Data.snp Test_Data.txt

awk ' $2 ~ /A/ || $3 ~ /A/ { print $0;} ' Test_Data.snp

awk '{print $9;}' Test_Data.snp >Reference_length.txt

sort -n -k4 <Test_Data.snp

awk ' $2 !~ /\./ && $3 !~ /\./ { print $0; }' Test_Data.snp >snp_report.txt

In case anyone else might want to offer something:

$ head -20 Test_Data.snp #|tail +6 |awk ' $2 !~ /\./ && $3 !~ /\./ { print $0; }'

NUCMER

    [P1]    [P2]      |   [BUFF]   [DIST]  |  [LEN R]  [LEN Q]  | [FRM]  [TAGS]
========================================================================================
       7   A .   1892597   |        7        7  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
     140   C T   1892730   |        2       90  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
     142   T A   1892732   |        2       88  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
     153   A G   1892743   |       11       77  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
     213   A G   1892803   |       17       17  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
     630   T C   401       |      175      401  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
     805   G A   576       |      175      576  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
    1054   C T   825       |      249      825  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
    2960   . G   2732      |       77     2732  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
    3037   G A   2809      |       77     2809  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
    3329   A C   3101      |      104     3101  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
    4354   A G   1832816   |       67     4354  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
    4421   C A   1832883   |       27     4421  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
    4448   T C   1832910   |       27     4448  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella
    4539   A G   1833001   |       17     4539  |  1895994  1892819  |  1  1  LVS_Francisella   SchuS4_Francisella

vlay2 · June 3, 2010, 10:09pm

Thank you. I really appreciate it. I've been having a tough time not only figuring out the problem set, but asking for help on the forums. There's just a lot of jargon that I simply don't know. I appreciate your understanding and help.

---------- Post updated at 07:09 PM ---------- Previous update was at 01:38 AM ----------

I just got all the outputs I wanted except I'm still not sure how to "delimit" and remove the header so I can use the data in UNIX?

Can anyone help?

Thanks a lot!

dazdseg · June 4, 2010, 7:15am

do u want to remove the header columns ?

NUCMER

from ur column ???

if this is what u want?? (as in if nucmer is ur header) ??

vlay2 · June 7, 2010, 3:00pm

Honestly, I don't know. I didn't get a lot of info on the problem. Sorry

---------- Post updated 06-07-10 at 12:00 PM ---------- Previous update was 06-06-10 at 09:03 PM ----------

My mentor told me that to get trid of the header I had to use

sed '1,5d'

but I'm not sure how to implement that for the problem.

curleb · June 7, 2010, 3:12pm

pretty much in the same way as the following does the same with the tail command:

head -20 Test_Data.snp |tail +6

In your case, you'd pipe it through to the following:

head -20 Test_Data.snp |sed '1,5d'

You could also use either directly, such as follows (which is actually more efficient):

tail +6 Test_Data.snp

sed '1,5d' Test_Data.snp

Best to think of this effort as a sandbox and get dirty playing...not likely you're in a place to muck too much up.

dazdseg · June 11, 2010, 10:20am

sed '1,5d'

the sed is anther editor in itself. moreover, this command will delete the 1st line to 5th line of the file from the top .

vlay2 · July 9, 2010, 3:48pm

I was hoping someone could sort of point me in the right direction on how to solve this problem.

What I have to do is compare two sets of numbers. What I need to find is:

The numbers that are the same between both sets
and the numbers that are unique to EACH set.

The two number sets are two different files also.

I've come a little farther than I used to be, so I'm not totally oblivious to UNIX now, but this seriously still the hardest question I've had so I'm clearly not great.

Any help would be greatly appreciated!

Thanks