awk error when increasing number of files in folder

owwow14 · July 1, 2015, 10:51am

I have a folder with several files of which I want to eliminate all of the terms that they have in common using `awk`.
Here is the script that I have been using:

    awk '                
    FNR==1 {
        if (seen[FILENAME]++) {
            firstPass = 0
            outfile = FILENAME "_new"
        }
        else {
            firstPass = 1
            numFiles++
            ARGV[ARGC++] = FILENAME
        }
    }
    firstPass { count[$2]++; next }
    count[$2] != numFiles { print > outfile }
    ' *

An example of the information in the files would be:

File1

    3	oil 
    4	and  
    8	vinegar

File2

    4	hot  
    2	and  
    9	cold

The output should be:

    File1_new
    
        3	oil   
        8	vinegar

    File2_new
    
        4	hot  
        9	cold

It works when I use a small number of files (i.e. 10), but when I start to increase that number, I get the following error message:

awk: file20_new makes too many open files  input record number 27, file file20_new  source line number 14

Where is the error coming from when I use larger amounts of files?

blackrageous · July 1, 2015, 10:56am

Try using the close function once you are done with each file.
See...

regex - awk error "makes too many open files" - Stack Overflow

owwow14 · July 1, 2015, 11:10am

Following your suggestion and closing the file according to that post, I get the same result:

awk '                
FNR==1 {
    if (seen[FILENAME]++) {
        firstPass = 0
        close(outfile = FILENAME "_new")
    }
    else {
        firstPass = 1
        numFiles++
        ARGV[ARGC++] = FILENAME
    }
}
firstPass { count[$2]++; next }
count[$2] != numFiles { print > outfile }
' *

RudiC · July 1, 2015, 11:14am

Please help me out - what's the purpose of

            ARGV[ARGC++] = FILENAME

?

---------- Post updated at 17:14 ---------- Previous update was at 17:13 ----------

I'm afraid it will blow up your input file list...

---------- Post updated at 18:33 ---------- Previous update was at 17:14 ----------

OK, I've gotten it now. Appends every file name exactly once to the file list, so you work on the file list again when the total No. of duplicate words is found.

MadeInGermany · July 1, 2015, 12:23pm

Is the given algorithm correct?
If only the unique words per file should be printed, shouldn't it be

awk '
FNR==1 {
  # close the previous file
  if (NR!=1) close(fname)
  fname=FILENAME
}
# main code
{ total[$2]++; perfile[fname,$2]++ }
END {
  for (fw in perfile) {
    split (fw,idx,SUBSEP)
    f=idx[1]; w=idx[2]
    if (perfile[fw]==total[w]) print f,w
  }
}
' *

The solution to the problem is the first block; in the next block simply replace all FILENAME by fname .