To find all common lines from 'n' no. of files

The_Observer · June 12, 2008, 8:21am

Hi,

I have one situation. I have some 6-7 no. of files in one directory & I have to extract all the lines which exist in all these files. means I need to extract all common lines from all these files & put them in a separate file.

Please help. I know it could be done with the help of cut,sort & uniq commands. But it will take more time whenever the script is executed. I want some quick & shortcut method.

I am using ksh shell.

mschwage · June 12, 2008, 3:22pm

cat * | sort > /tmp/alllines
uniq -d /tmp/alllines > /tmp/repeatedlines

-mschwage

The_Observer · June 13, 2008, 6:01am

Sorry Sir,
My requirement is I want the lines which are present in all these 'n' files. Means every line which will come in output must be present in each of these 'n' files.

Please guide.

ranjithpr · June 13, 2008, 9:35am

>comm_data.txt
for fname in 1.txt 2.txt 3.txt 4.txt
do
if [ ! -s comm_data.txt ]
then
sort $fname > comm_data.txt
else
sort $fname > tmp2
comm -12 comm_data.txt tmp2 > tmp3
mv tmp3 comm_data.txt
fi
done
rm -f tmp2

If the contents of ur files are in order you can use following

>comm_data.txt
for fname in 1.txt 2.txt 3.txt 4.txt
do
if [ ! -s comm_data.txt ]
then
cp $fname comm_data.txt
else
comm -12 comm_data.txt $fname > tmp3
mv tmp3 comm_data.txt
fi
done
cat comm_data.txt

ripat · June 13, 2008, 10:00am

If you have three files f1 f2 f3:

cat f1 f2 f3 | awk '{a[$0]++} END{for (i in a) if (a==3) print i}'

The_Observer · June 16, 2008, 9:34am

Thanks ripat, it's perfectly working fine. But ,I have a little concern.
Actually, I won't be sure how many no. of files will get generated everytime the script runs. So, I will be storing the no. of files in a variable & in the code -

cat f1 f2 f3 | awk '{a[$0]++} END{for (i in a) if (a[i]==3) print i}'

I think I have to use the value in that variable in place of 3 here. I tried to replace 3 with variable here. But seems to be not working. I even attempted 'awk -v' option. But, in vain.

Can u please quickly help me out in this. The requirement is quite urgent.

Thanks in advance for ur help.

jim_mcnamara · June 16, 2008, 10:20am

ls -1 | wc -l | read filecnt
awk -v filecnt=$filecnt '{a[$0]++} END{for (i in a) if (a==filecnt) print i}'

The_Observer · June 16, 2008, 10:52am

ls -1 | wc -l | read count

awk -v count=$count '{a[$0]++} END{for (i in a) if (a[i]==count) print i}'
awk: syntax error near line 1
awk: bailing out near line 1

I am getting this error. Not sure why ?

Please help urgently.

radoulov · June 16, 2008, 10:54am

awk 'END { for (r in _) if (_[r] == ARGC - 1) print r }
{ _[$0]++ }' filename1 [filename2 .. ]

Use nawk or /usr/xpg4/bin/awk on Solaris!

If you have GNU Awk, you could use nextfile for efficiency.

Of course, this will fail if the same record appears more than once in the same file.

The_Observer · June 16, 2008, 11:17am

Thank u very much.

Using nawk or /usr/xpg4/bin/awk works perfectly fine without making any change in the original code.

Thanks a lot again.

drl · June 16, 2008, 1:03pm

Hi.

I ran both awk solutions and they seemed to work. There is one aspect that may be troubling. If the files contain no duplicates, then all is well. However, here is an example where the trouble might occur. I am using radoulov's code since it is a bit shorter:

#!/usr/bin/env sh

# @(#) user2    Demonstrate finding lines in common.

#  ____
# /
# |   Infrastructure BEGIN

set -o nounset
echo

## The shebang using "env" line is designed for portability. For
#  higher security, use:
#
#  #!/bin/sh -

## Use local command version for the commands in this demonstration.

set +o nounset
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version =o $(_eat $0 $1) awk
set -o nounset

for file in f*
do
  echo
  echo " -- $file --"
  cat -n $file
done

# Use nawk or /usr/xpg4/bin/awk on Solaris.

# |   Infrastructure END
# \
#  ---

echo
echo " Results from awk:"
filecnt=$( ls -1 f* | wc -l )

awk '
END     { for (r in _)
                if (_[r] == ARGC - 1)
                        print r
        }
        { _[$0]++ }
' f*

exit 0

Producing:

% ./user2

(Versions displayed with local utility "version")
Linux 2.6.11-x1
GNU bash, version 2.05b.0(1)-release (i386-pc-linux-gnu)
GNU Awk 3.1.4

 -- f1 --
     1  a
     2  b
     3  x
     4  x

 -- f2 --
     1  a
     2  d
     3  y
     4  y

 Results from awk:
x
y
a

Note that "x" and "y" are not common to the files, only "a". In cases like this, more work would be necessary to ensure that a line was common to all files, and not simply replicated the appropriate number of times in total among some of the files ... cheers, drl

radoulov · June 16, 2008, 2:22pm

Yes,
it seams quite easy to fix:

awk 'END { 
  for (r in __) 
    if (__[r] == ARGC -1) 
      print r 
  }
!_[FILENAME,$0]++ { 
  __[$0]++ 
  }' filename1 [fileaname2 .. filenamen]