Find common lines with one file and with all of the files in another folder

Eve · February 24, 2018, 8:52am

Hi! I would like to

comm -12

with one file and with all of the files in another folder that has a 100 files or more (that file is not in that folder) to find common text lines. I would like to have each case that they have common lines to be written to a different output file and the names of the output files shoud be the two file names together that had common lines united by a dash sign - for instance
filetobecompared-filethathadacommonline

Sincerely grateful if anyone can help!
I don't have a python, could it be done with awk or anything else that works?

RudiC · February 24, 2018, 9:51am

Welcome to the forum.

Untested (due to lack of samples):

awk 'NR == FNR {CMP[$0]; next} $0 in CMP {print >> (TMPFN=ARGV[1] "-" FILENAME); close (TMPFN)} ' onefile anotherdir/*

You can drop the close() if the file count is less than the system parameter OPEN_MAX.

Eve · February 25, 2018, 9:42am

Hi! Thank you for your help! All of the outputs are correct, but it puts all of the outputs into a one file. I hoped that all of the outputs would be in different files and that the file names would be the two file names together that had common lines united by a dash sign.

abdulbadii · February 25, 2018, 10:41am

for i in /notinthatfolder/*.* {
 comm -12 filetobecompared $i >filetobecompared-$i
 }

Don_Cragun · February 25, 2018, 2:11pm

This can't possibly work... Your output filenames (not pathnames) contain at least two slash characters. And that is still assuming that filetobecompared doesn't contain any <slash> characters; which isn't clear in the above code nor in the specification of the problem in the first post in this thread.

Furthermore, comm is only specified to work if both input files being processed "are ordered in the current collating sequence" (i.e., in sorted order). And, despite the first post in this thread specifying comm , there is no indication that the input files being processed meet this requirement.

The awk code RudiC suggested doesn't care about input files being sorted, but it does have a problem with <slash> characters in output file. But, I'm confused by Eve's statement saying that RudiC's code puts output in a single file. It doesn't; it creates output file pathnames for each output line based on the two input file pathnames. It should do exactly what was requested in post #1 if all of the files being compared are in the current working directory and no <slash> characters are used in any of the arguments given as any of hte filename operands passed to his awk script.

If you want a tested script that can create report files in a specified directory where each report file contains lines that are common between one file (specified to be in any directory) and one or more files (each specified to be in any directory) you could try something more like the following:

#!/bin/ksh
usage() {
	printf "$Usage\n" "$IAm" >&2
}

# Define script variables.
IAm=${0##*/}
Usage="Usage: %s output_directory initial_file file_to_compare..."

# Verify that we have at least 3 arguments.
if [ $# -lt 3 ]
then	printf '%s: Not enough operands\n' "$IAm" >&2
	usage
	exit 1
fi

# Create output directory if it doesn't already exist and error out if we can't
# create it.
REPORT_DIR=$1
if ! mkdir -p "$REPORT_DIR"
then	usage
	exit 2
fi
printf '%s: Reports will be created in the directory "%s"\n' \
    "$IAm" "$REPORT_DIR"

# Shift off the output directory operand and invoke awk with the remaining
# arguments.
shift
awk -v destdir="$REPORT_DIR" '
FNR == 1 {
	# Create output pathname in destdir based on basename of input filenames
	if(NR == 1) {
		# Set the 1st part of the output pathname based on destdir and
		# the basename of the first input pathname.
		path1 = FILENAME
		sub(".*/", "", path1)
		path1 = destdir "/" path1 "-"
	} else {# Set the new output pathname that will be used if any
		# differences are found in the current input file based on
		# path1 and the basename of the current input pathname.
		new = FILENAME
		sub(".*/", "", new)
		new = path1 new
	}
}
NR == FNR {
	# Grab contents of the 1st input file.
	CMP[$0]
	next
}
$0 in CMP {
	# A line in the current file matched a line in the first file...
	if(last != new) {
		if(last)
			close(last)	# close the previous output file
		# Set the name of the new output file.
		last = new
	}
	# Print this duplicsted line into the current output file.
	print >> last
}' "$@"

This was written and tested using a Korn shell and also tested with bash . It should work with any shell that uses Bourne shell syntax and performs the parameter expansions required by the POSIX standards.

Note, however, that the output filenames uses the basename of the two input files found to have common lines. If files from different directories are being processed in a single run and the basenames of some of the files might be identical from different pathnames, it would be easy to also add a line in each output file naming the input file (or both input files) from which the following lines were copied, but that wasn't done here because it wasn't requested.

If you want to try this on a Solaris/SunOS system, change awk in the script to /usr/xpg4/bin/awk or nawk .

Eve · March 3, 2018, 6:07pm

Sorry that I didn't reply in the right time. All of the files were previously sorted. It was eventually acceptable for me in this case that the outcomes were all in one file. I could continue like that and I didn't want bother you any longer. I didn't get the very long code that Don Cragun offered working - my C Shell window closed by itself everytime I tried to use it. And since I can use the case when all of the outcomes are in one file too you don't need to find a solution for that any longer. I'm grateful to you and you were very helpful!

Don_Cragun · March 4, 2018, 7:07pm

This is why it is crucial that you always start a thread in this forum by explaining the environment you're using. You might note that I explicitly stated the requirements for running the code I suggested:

The csh shell does not use Bourne shell syntax and does not perform any of the parameter expansions required by the POSIX standards. Therefore, you should have expected that it might not work unless you used a shell that met the requirements I specified. But, if you had stored my suggestion in a file and used csh to run that file, it should have given you syntax errors running that script; it should not have closed your window.

I am sorry that I wasted your time by trying to help you with a script that should have worked perfectly for you if you had saved it into a file and then run it with ksh or bash instead of csh .