awk incorrect format

I was wondering whether anyone has any idea what is happening here. I'm using simple code to compare 2 tab delimited files based on column 1 values. If the column1 value of file1 exists in file2, then I'm to print the column4 value in file2 in column3 of file1. Here is my code:

1st I have to produce file1 by concatenating columns 4&5 of the input file:

INPUT FILE:

1    52828739    rs12044739    C    T
1    52835713    rs72899818    T    C
1    52836736    rs10157619    G    A
1    52844478    rs6684941    A    G

Using this simple code, I reorder columns in the INPUT file and concatenate columns 4 &5:

awk -F '\t' ' {print $3,$1,$2,$4$5}' OFS="\t" INPUT > file1

This produces the following file1 which looks fine:

rs12044739    1    52828739    CT
rs72899818    1    52835713    TC
rs10157619    1    52836736    GA
rs6684941    1    52844478    AG

This is file2:

rs12044739    1    52828739    CC
rs72899818    1    52835713    TC
rs10157619    1    52836736    GG
rs6684941    1    52844478    AA

Using the code:

awk 'NR == FNR {REP [$1] = $4; next} $1 in REP {$3 = REP[$1]} 1' OFS="\t" file2 file1 > results

Yields the following output file, RESULTS, which so far seems to look fine:

rs12044739    1    CC    CT
rs72899818    1    TC    TC
rs10157619    1    GG    GA
rs6684941    1    AA    AG

For final processing we need to print all rows for which column3 = column4. So I used this simple code:

awk -F '\t' '{ if ($3 = $4) print $0}'  OFS="\t" RESULTS > RESULTS2

Thus RESULTS2 should look like this:

rs72899818    1    TC    TC

Instead what I get is this:

rs72899818    1    TC    

TC

Any ideas as to what is causing the column4 value to print to the following row?

I note that none of the data you posted looks like it came from strictly tab-separated files. I suspect rogue spaces and/or carriage returns in your data.

From your example file1 and file2 I get

rs12044739    1    52828739    CT
rs72899818      1       TT      TC
rs10157619    1    52836736    GA
rs6684941    1    52844478    AG
1 Like

Sorry, I forgot to add a step. I have edited my post to reflect this....

How about, hoping that any invisible surprises like the ones corona688 mentioned will have been taken care of, and untested, from a windows machine, so bear with me:

awk -F "\t" 'NR == FNR {REP[$3] = $4$5; next} REP[$1] == $4 {$3 = REP[$1]; gsub (/[^ACGT]/, "", $3); print}' OFS="\t" INPUT file2

BTW, if ($3 = $4) should read if ($3 == $4) , but this would not explain the additional line feed that you complain about.

1 Like

Thanks Rudi, but that didn't produce the desired output RESULT2. BTW, is there a quick way to clean up a file that may have hidden spaces and other sorts of things to make it a clean tab delimited file.

What Rudi & Corona are saying: UNIX text files have different carriage control characters from windows text files - like tab delimited Excel output.

cleanup windows files == dos2unix command:

carriage control for
UNIX: ASCII 10 written "\n", called a newline character.
Windows: ASCII 13 and ASCII 10 - "\r\n", called return and newline.

awk will misbehave on windows text files. Most decent editors let you change UNIX <-> Windows at will. The UNIX dos2unix command does what you need when the file got onto the Linux box with bad carriage control. unix2dos goes the other way for you. Windows does not like UNIX carriage control, either Tit for tat, I guess.

1 Like

That did the trick. One of the text files sent to me must have been processed with a windows machine...

1 Like