Paste columns based on common column: multiple files

genome · December 14, 2017, 4:16pm

Hi all,

I've multiple files. In this case 5. Space separated columns. Each file has 12 columns. Each file has 300-400K lines.
I want to get the output such that if a value in column 2 is present in all the files then get all the columns of that value and print it side by side.

Desired output must have 60 columns.

File 1
head HGWAS1/merged_info_CHR1.info

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10235:T:TA 10235 T TA 0.001 0.157 0.998 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.436 0.435 0.646 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.095 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.095 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.001 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.000 0.017 1.000 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.211 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.155 1.000 0 -1 -1 -1

File 2
head HGWAS2/merged_info_CHR1.info

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.414 0.473 0.670 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.001 0.141 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.427 0.488 0.673 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.045 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.045 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.020 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.426 0.999 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.003 1.000 0 -1 -1 -1

File 3
head HGWAS3/merged_info_CHR1.info

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.434 0.522 0.691 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.000 0.122 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.421 0.526 0.693 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.132 0.999 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.132 0.999 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.294 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.001 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.081 1.000 0 -1 -1 -1

File 4
head HGWAS4/merged_info_CHR1.info

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.418 0.539 0.700 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.001 0.180 0.998 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.406 0.528 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.063 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.063 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.015 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.079 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.022 1.000 0 -1 -1 -1
--- 1:10579:C:A 10579 C A 0.000 0.007 1.000 0 -1 -1 -1

File 5
head HGWAS6/merged_info_CHR1.info

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0
--- 1:10177:A:AC 10177 A AC 0.406 0.512 0.695 0 -1 -1 -1
--- 1:10235:T:TA 10235 T TA 0.000 0.115 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.407 0.522 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T 0.000 0.029 1.000 0 -1 -1 -1
--- 1:10506:C:G 10506 C G 0.000 0.029 1.000 0 -1 -1 -1
--- 1:10511:G:A 10511 G A 0.000 0.759 1.000 0 -1 -1 -1
--- 1:10539:C:A 10539 C A 0.001 0.205 0.999 0 -1 -1 -1
--- 1:10542:C:T 10542 C T 0.000 0.012 1.000 0 -1 -1 -1

Desired output: Order of the column is important from each file.

snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 snp_id rs_id position a0 a1 exp_freq_a1 info certainty type info_type0 concord_type0 r2_type0 
--- 1:10235:T:TA 10235 T TA 0.001 0.157 0.998 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.001 0.141 0.999 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.000 0.122 0.999 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.001 0.180 0.998 0 -1 -1 -1 --- 1:10235:T:TA 10235 T TA 0.000 0.115 0.999 0 -1 -1 -1
--- rs145072688:10352:T:TA 10352 T TA 0.436 0.435 0.646 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.427 0.488 0.673 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.421 0.526 0.693 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.406 0.528 0.695 0 -1 -1 -1 --- rs145072688:10352:T:TA 10352 T TA 0.407 0.522 0.695 0 -1 -1 -1
--- 1:10505:A:T 10505 A T ---similarly
--- 1:10506:C:G 10506 C G  --similary
--- 1:10539:C:A 10539 C A ---similarly

What I've done:
I used join, then I end up with 24 columns. But Column two has values present in all the files. Next, I can grep it in all files and am losing out there. I am failing to put the grepped output in columnar format. Apparently not a good way to aim for this problem

Neo · December 14, 2017, 10:09pm

You should post your full script and show your work.

Thanks.

RudiC · December 15, 2017, 3:32am

Something along this line:

awk '
BEGIN           {TF = ARGC
                }

FNR == 1        {FC++
                 if (FC < TF) ARGV[ARGC++] = FILENAME
                }

FC < TF         {CNT[$2]++
                 next
                }

CNT[$2] == 5    {if (!LINE[$2]) SEQ[++SN] = $2
                 LINE[$2] = LINE[$2] $0 " "
                }

END             {for (s=1; s<=SN; s++) print LINE[SEQ]
                }
'  HGWAS?/merged_info_CHR1.info

genome · December 15, 2017, 9:27am

for i in {1..22}
do
    #--iterate over chromosomes
    saveTemp=""
    files_info="$(find $input_dir -name "*_CHR$i.info"  | sort )"
    files_list=""
    
    #---split by new lines and make it array---
    SAVEIFS=$IFS
    IFS=$'\n'
    files_info=($files_info)
    IFS=$SAVEIFS 
    
    join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,2.11,2.12  ${files_info[0]}  ${files_info[1]}  > $output_dir/"tempCHR_"$i".info" 

    SAVEtemp=$output_dir/"tempCHR_"$i".info"
    printf "$i joined for first two files\n"

    for (( x=2;x<${#files_info[@]};x++ ))
    do
	join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,2.11,2.12 $SAVEtemp  ${files_info[$x]}  > $output_dir/"tempchr"$i"_"$x".info" 

	SAVEtemp=$output_dir/"tempchr"$i"_"$x".info"
    done
    mv $SAVEtemp $output_dir/"joined_CHR""$i"".info"
    SAVEtemp=$output_dir/"joined_CHR""$i"".info"
    printf "CHR $i is done for joining\n"
    
    for w in ` awk '{print $2}' $SAVEtemp | grep -v "rs_id" `
    do
	st="" #start null string to concatenate
	
	for (( x=0;x<${#files_info[@]};x++ ))
	do
	    #--loop through files to grep the string
	    
	    temp_st=$(grep -w $w ${files_info[$x]}) 
	    st=$st" "$temp_st

	done
	echo "$st" >> $output_dir/"cols_joined_CHR"$i".info" 
    done

    printf "Proceseed files for $i chromosome!\n"
done

I left it running last evening and script has not finished working with chromosome 1. Terrible.

Yoda · December 15, 2017, 11:48am

Here is an approach in gawk:-

gawk '
        FNR > 1 {
                R[$2]
                V[ARGIND FS $2] = $0
        }
        END {
                for ( k in R )
                {
                        f = 0
                        for ( i = 1; i <= ARGIND; i++ )
                        {
                                if ( ! ( ( i FS k ) in V ) )
                                        f = 1
                        }
                        if ( f == 0 )
                        {
                                for ( i = 1; i <= ARGIND; i++ )
                                        printf "%s ", V[i FS k]
                                printf "\n"
                        }
                }
        }
' HGWAS{1..5}/merged_info_CHR1.info

genome · December 15, 2017, 11:59am

Do you mind explaining what and how it does what?

RudiC · December 15, 2017, 12:15pm

See also post#3 which I had to hide until you showed your work as requested by Neo.

genome · December 15, 2017, 12:20pm

@RudiC:
Yup, I saw that just now. I can't get my head around it. TF, FC I'm absolutely stumped by it. Please explain.

RudiC · December 15, 2017, 12:43pm

awk '
BEGIN           {TF = ARGC - 1                                  # save initial total file count (minus 1 for $0) 
                }

FNR == 1        {FC++                                           # count files opened
                 if (FC <= TF) ARGV[ARGC++] = FILENAME          # if still in the original files 
                }                                               # list, append file name to argument list  
                                                                # so every file will be read twice
FC <= TF        {CNT[$2]++                                      # if still in org. files, count $2 occurrences
                 next                                           # and don''t proceed in script
                }

CNT[$2] == TF   {if (!LINE[$2]) SEQ[++SN] = $2                  # only for $2 that occurred file count times, increment
                                                                # sequence counter if 1. occurrence of $2 in 2. round
                 LINE[$2] = LINE[$2] $0 " "                     # collect lines of every file into a string array
                }

END             {for (s=1; s<=SN; s++) print LINE[SEQ]       # print all collected (5 fold) lines in correct order
                }
'  HGWAS?/merged_info_CHR1.info

I modified this a bit to account for varying file argument counts. Please be aware that one of your input files has DOS <CR> (\r, ^M 0x0D) line terminators, distorting the screen output / result.

genome · December 15, 2017, 1:00pm

Sorry, rudic I can't understand now too.

FNR == 1

How does this help?

If occurrence of $2 is equal to file count:

CNT[$2] == TF   {if (!LINE[$2]) SEQ[++SN] = $2                  # only for $2 that occurred file count times, increment
                                                                # sequence counter if 1. occurrence of $2 in 2. round
                 LINE[$2] = LINE[$2] $0 " "                     # collect lines of every file into a string array
                }

if (!LINE[$2]) SEQ[++SN] = $2

What is this line doing?
LINE, SEQ, SN

Edited: added occurrence

RudiC · December 15, 2017, 1:54pm

In awk , input lines are counted in the (internal) NR variable across ALL files, while FNR does the same but is reset for every new file. So FNR==1 indicates the begin of a new file, and the (user defined) file count FC is incremented.
I highly recommend to do some reading on awk , e.g. man awk . There, all system variables are listed and you can tell them from user vars like TF and SEQ that are declared and allocated when needed.
The square brackets [...] enclose array indices, so LINE and SEQ are arrays, holding the output lines and the order in which to print them. SN is a scalar serial number.

RudiC · December 16, 2017, 10:02am

Well, not one of my magic moments. Try instead

awk '
BEGIN           {TF = ARGC - 1
                }

                {if (!LINE[$2]) SEQ[++SN] = $2
                 LINE[$2] = LINE[$2] $0 " "
                 CNT[$2]++
                }
END             {for (s=1; s<=SN; s++) if (CNT[SEQ] == TF) print LINE[SEQ]
                }
'  HGWAS?/merged_info_CHR1.info

Only if you're running out of memory with too large or too many files, you might want to fall back to the post#9 version reading files twice but saving some memory.

genome · December 29, 2017, 4:51pm

Hi Rudic

Sorry for my delay in writing back.
I am still going through your code and am struck with it.
I'm copying few lines from #9:

awk '
BEGIN           {TF = ARGC - 1                                  # save initial total file count (minus 1 for $0) 
                }
#------

FNR == 1 
#-------
FC <= TF
        
#---------
CNT[$2] == TF

I'm surprised you've not used if condition for these lines. What I mean is:


awk '
BEGIN           {TF = ARGC - 1                                  # save initial total file count (minus 1 for $0) 
                }

#-----
if (FNR == 1 )

#----
if (FC <= TF        )

#-----
if (CNT[$2] == TF )

The code doesn't work too if I use if with these checks. My usual mind set would put if to confirm on these and then proceed ahead. I'm slow with catching up on awk. Sorry.

---------- Post updated at 04:51 PM ---------- Previous update was at 04:21 PM ----------

The difficult parts for me to get hold of me are:

1- The files are read twice. I cannot even think/read how in the code that's been taken care of.
2- The variables, SN, TF they are internal variables but they have their presence all the way in the code. I mean they don't need to be initiated before using them as in other languages (C, etc. )

RudiC · December 30, 2017, 7:14am

A reflection of awk basics might help. man awk :

FNR == 1 is such an expression; if its evaluation yields TRUE, the action following will be executed. So, NO if needed in this awk construct (although the if statement is available for flow control in the action parts).

1- Reading files twice has been eliminated in post#12. Nevertheless, the trick was to append the actual file name to the awk script's parameter list as described in post#9's comments.
2- Yes, awk variables are being created / initiated the first time they're referenced.

genome · December 30, 2017, 5:08pm

@RudiC:
Thank you for your reply. I read awk man page.

Apologies for my repeats on this post.
I know the code is reading files twice, but which lines manage it, I'm wondering.

Copying code from #9:

FNR == 1        {FC++                                           # count files opened
                 if (FC <= TF) ARGV[ARGC++] = FILENAME          # if still in the original files 
                }                                               # list, append file name to argument list  
                                                                # so every file will be read twice
FC <= TF        {CNT[$2]++                                      # if still in org. files, count $2 occurrences
                 next                                           # and don''t proceed in script
                }

Does next allow to read the files once and again read them for the following code? I know next skips all the code below in the script.

Or the following lines take care of reading files twice: ?

           if (FC <= TF) ARGV[ARGC++] = FILENAME

RudiC · December 31, 2017, 5:26am

next skips all remaining code for the actual line, reads the next line and starts over.

Appending a string to the ARGV array will make awk interpret it as a file name and try to open and read it, so appending FILENAME will read that file another time (after all other ARGV elements have been depleted).