Searching for similar row(s) across multiple files

Hello Esteemed Members,

I need to write a script to search for files that have one or more than one rows similar.
Please note that there is no specific pattern that I am searching for. The rows can be different, I just need to find out two or more similar records in two or more files.
There are around 5000 such files that I need to search amongst.

The files are scattered in same directory but different sub-directories.

$/abc/xyz/ap/.prm
$/abc/xyz/dd/
.prm
$/abc/xyz/rt/*.prm

that is, the path until xyz is same and I need to check all the prm files in sub-folders under ap, dd, rt and so on.

My basic criteria is to check for all the prm files which have exactly the same rows.

Like in the example below, two different prm files are under diffenret sub-directories but they both contain a similar row

none >> host_buf.tpl      >> $CCWSCA_sbcdl/shr/dl_fnaccv_host_buf.raw
prompt>~/sbc_generated/v760 [49]> grep dl_fnaccv_host_buf */*.prm

cdw/fn_account_ca_view.prm:none >> host_buf.tpl      >> $CCWSCA_sbcdl/shr/dl_fnaccv_host_buf.raw

fn/fn_account_audit_view.prm:none >> host_buf.tpl    >> $CCWSCA_sbcdl/shr/dl_fnaccv_host_buf.raw

How do I find all such files?

I tried searching for answers to this kind of script and created the following

filecnt=$( find /sbc_generated/v77_0/*/*.prm -type f )
  awk -v 5000=$filecnt  ' 
          {arr[$0]++; next} 
          END{for (i in arr) { 
            if(arr==n) {
                     print arr
            }
          } '  $( find /sbc_generated/v77_0/*/*.prm -type f) > common_rec

but it is giving me the following error

exec(2): Could not load a.out due to swap reservation failure
or due to insufficient user stack size

Please help me rectify the above error and create a running script in k shell (ksh).

Thank you!

Hi.

What are you expecting for the output format? ... cheers, drl

Hello drl,

My question basically is to compare a large number of files and find all the records that exist in more than one file.

Is it possible to do this using sort, uniq and awk or do I need to write a script for the same. Any help would be highly appreciated.

Hi

If there is like 5000 files

Something like

filecnt=$( find /sbc_generated/v77_0/*/*.prm -type f )

can crash because of the huge size.

Yes, I understand Chirel, so what is the workaround?

What if I append the contents of all the files into one file, sort the records and remove the unique records from the resulting file.
Then I need to back track all the rows that have multiple instances and find out what are the original files to which they belong.

Will this work, if yes, then how do I write the same?

That's the idea i'm trying to do.

prefix each line of all file with "filename : ", then sort from field 3 to end
then uniq -d on the result ignoring the x chars at the start of lines.
The output will be the something like

filename1 : pattern
filename2 : pattern
etc

But i'm having issue with the uniq -d part.

Actually i have

find . -name '*.prm' -exec  awk '{printf("%-50.50s : %s\n",FILENAME,$0);}' {} \;|sort -k3|uniq -s53 -D

Edit: well no issue, it's working like a charm :slight_smile:

1 Like

Thanks for helping me out Chirel. Can you please explain me what does the awk part do?

awk '{printf("%-50.50s : %s\n",FILENAME,$0);}'

This will print each line of the file ($0) prefixing with the filename justified on 50 char max.

1 Like