awk common between files

genome · January 16, 2018, 2:35pm

Hello there:

I want to find common among files. They all have one column.

Format for data:

CEU_snp_CHR21.txt

	
21:10758305
21:10827533
21:10913441
21:10920098
21:10952160
21:10966322
21:10985991

NAT_CHR21_variants.txt

	
21:10971951
21:14601415
21:14640400
21:14687571
21:14768343
21:14771811

variants_YRI_CHR21.txt

	
21:10758305
21:10827533
21:10913441
21:10920098
21:10952160
21:10966322

21_common_batches

	
21:14449943
21:14586958
21:14600044
21:14600045
21:14603751

Code:

	
awk '  BEGIN  {TF = ARGC - 1}
{
if(! ( $1 in LINE) )  
{
SEQ[++SN]=$1
LINE[$1]=$1
CNT[$1]++
}
else{
CNT[$1]++
} }

END{
{
for(s=1;s<=SN;s++){
if(CNT[SEQ] == TF){
print LINE[SEQ]

} 

} } } ' CEU_snp_CHR21.txt  NAT_CHR21_variants.txt  variants_YRI_CHR21.txt 21_common_batches

I don't get correct or convincing output through this code and I'm unable to figure out why.
Does order of file matter in awk in my code?

I snatched this code from one of my earlier posts.

RudiC · January 16, 2018, 3:04pm

So what exactly do you want for output? There's not a single entry that is common for all four files. Max is two...

genome · January 16, 2018, 3:20pm

Code has a bug somewhere. Pasting 20,000 wouldn't be possible.

FNR == 1

Should I be using this?

Don_Cragun · January 16, 2018, 7:13pm

Posting 20,000 what wouldn't be possible?

We have no idea. You usually use the condition FNR == 1 to cause the associated action to be executed on the first line read from each input file. Is there any reason why you need to care about which file contained an input record?

Start by telling us what you are trying to accomplish. Then tell us what is wrong with the output being produced by the code you've shown us in post #1. Then, maybe, we can suggest ways to fix your code to get what you want.

The code you have shown us seems at first glance to be a slightly complicated way of removing all lines from a set of files that contain a duplicate field #1 value in the set of files you provide as input files to your awk script preserving the order in which those non-duplicated values were seen that uses more memory to get the job done than is needed.

How do you know that the output you have received is not correct? What would make the output convincing?