Match partial text

cmccabe · March 7, 2015, 11:34am

I posted the incorrect files yesterday and apologize. I also modified the awk script but with no luck. There are two text files in the zip (name.txt and output.txt). I am trying to match $2 in name.txt with $1 in output.txt and if they match then $1 of name.txt is copied to $7 of output.txt. The tricky part (well at least for me is), that only part of $2 will match $1. Thank you :).

awk 'NR==FNR{A[$1]=$2; next}  A[$2]  {$2=$2 " " A[$4]}1' output.txt name.txt > output.txt

 Desired output.txt 
DTE3504500000001ref	34529	35031	1	DTE3504500000001	SeqRxn4  1
DTE3504500000001antiref	35031	34529	1

Scrutinizer · March 7, 2015, 12:33pm

Both files are in DOS format, not in Unix format.

Could you describe what should be matched? Apparently DTE3504500000001ref should match DTE3504500000001 , but DTE3504500000001antiref should not. What is the criterion?

cmccabe · March 7, 2015, 12:43pm

Sorry the correct files are attached. The DTE3504500000001 is the criterion to match so that both records will be assigned the same value. Also, the final output.txt needs to be delimiated so it can be opened in excel. I am not sure where to put the

 OFS="\t"

. Thank you :).

RudiC · March 7, 2015, 1:51pm

Try

awk 'NF == 2 {T[$2]=$1; next} {print $0, T[$5]}' FS="\t" OFS="\t" /tmp/name.txt /tmp/output.txt

Might be lengthy to read ALL the data from name into memory...

Scrutinizer · March 7, 2015, 2:16pm

cmccabe:

Sorry the correct files are attached. The DTE3504500000001 is the criterion to match so that both records will be assigned the same value. Also, the final output.txt needs to be delimiated so it can be opened in excel. I am not sure where to put the
 OFS="\t"
. Thank you :).

Then in your example, why does only the record with DTE3504500000001ref get an extra 1 at the end and why doesn't the one with DTE3504500000001antiref get one?

DTE3504500000001ref	34529	35031	1	DTE3504500000001	SeqRxn4  1
DTE3504500000001antiref	35031	34529	1

cmccabe · March 9, 2015, 9:51am

Thank you very much. I attached the combined.txt but forgot that $5 needs to be copied to $8 and the blank row in between the two lines removed. The combined.txt really only needs to loo like the below, but I'm not sure how to do this. Basically, I am going to importing the sheet into a SQL database and trying to format the data accordinly, the values in the sheet are combined (..... ref and .....antiref) and the chromosome is matched. Thank you very much :).

Code:

 awk 'NF == 2 {T[$2]=$1; next} {print $0, T[$5]}' FS="\t" OFS="\t" name.txt output.txt > combined.txt

1     DTE3504500000001
1     DTE3504500000002 
1     DTE3504500000003

RudiC · March 9, 2015, 1:08pm

You lost me. Does it work? Doesn't it? What's missing?

cmccabe · March 10, 2015, 12:49pm

It works great, I am just trying to have the combined.txt be the values in the output sheet combined (..... ref and .....antiref) and the chromosome matched. This way the row space between them can be removed. Thank you very much :).

 current combined.txt
DTE3504500000001ref         1
DTE3504500000001antiref   "space"
DTE3504500000002ref         1
DTE3504500000002antiref   "space"

 Desired combined.txt
1     DTE3504500000001
1     DTE3504500000002 
1     DTE3504500000003

Thank you for your help :).

---------- Post updated 03-10-15 at 10:49 AM ---------- Previous update was 03-09-15 at 12:16 PM ----------

Is this possible to do using awk, I don't always trust the output of excel. Thank you :).

---------- Post updated at 11:49 AM ---------- Previous update was at 10:49 AM ----------

Maybe something like:

 awk 'NF == 2 {T[$2]=$1; next} {print $0, T[$5]}' FS="\t" OFS="\t" name.txt output.txt > combined.txt | awk 'NF' combined.txt >combined2.txt

Not sure if thats right but I hope I am learning slowly. Thank you :).