Sdiff doesn't try and compare to closest match

jamilpasha · February 27, 2017, 7:25am

In the example below i would want the extensions to match.

Is there any other utility or script to achieve this. Kindly help.

Example:

sdiff sourceFileNames targetFileNames
17021701P.blf | 17021901P.ibk
17021701P.chn | 17021901P.irk
17021701P.bmr | 17021901P.dyd
17021701P.dpf | 17021901P.blf
17021701P.dpi | 17021901P.blr
17021701P.drk | 17021901P.bmr
17021701P.gcd | 17021901P.dpf
17021701P.gcm | 17021901P.dpi
17021701P.gcp | 17021901P.drk
17021701P.gcr | 17021901P.gcd
17021701P.idx | 17021901P.stb
17021701P.ltm | 17021901P.stf
17021701P.mfd | 17021901P.tna
17021701P.ipf | 17021901P.gcm
17021701P.mgr | 17021901P.gcp
17021701P.mrl | 17021901P.gcr
17021701P.stb | 17021901P.idx
17021701P.stf | 17021901P.ltm
17021701P.tna | 17021901P.mfd
17021701P.ibk | 17021901P.ipf
16021701P.irk | 17021901P.mgr
17021701P.dyd | 17021901P.mrl

Corona688 · February 27, 2017, 10:01am

There is no 'strip file extensions then compare' utility that I know of. You may have to make the data match for a diff utility to count it as a match.

awk -F"." '{ print $1 > OUT }' OUT="/tmp/file1" sourceFileNames OUT="/tmp/file2" targetFileNames
sdiff /tmp/file1 /tmp/file2
rm -f /tmp/file1 /tmp/file2

RudiC · February 27, 2017, 10:41am

If compared line by line, none of your extensions would match. With a recent shell providing "process substitution" you could try

sdiff <(cut -d. -f2 file1 | sort) <(cut -d. -f2 file2 | sort)
blf                                blf
                                  >    blr
bmr                                bmr
chn                                  <
dpf                                dpf
dpi                                dpi
drk                                drk
dyd                                dyd
.
.
.

drl · February 27, 2017, 11:07am

Hi.

Did you mean ignore rather than match?

If so there are utilities that allow mis-matches, e.g. agrep , cgrep , which could be useful.

However, if the expected output had been posted, it would probably have answered questions like this. Please do so now and in the future.

Best wishes ... cheers, drl

jamilpasha · February 27, 2017, 11:46am

Thank you for the prompt response guys!! I appreciate all your help!!

I must have been little clear on my requirement and output that I am expecting!!

I was writing a script to compare the files using sdiff or comp utility!! Before doing that I wanted the script to be generic where user can update a config file which expects the file name patterns (ex- file starts with 20170212 and end with .txt) This would be one set which I would call source. Similarly the config to be updated with file start and end patter for target. (ex- 20170213 and ends with .txt)

I may get following files in the list-

Source-

20170212abc.txt
20170212xyz.txt
20170212jam.txt

Target-

20170213abc.txt
20150213xyz.txt (2015 intentional)
20170213pas.txt

Without expecting users to do a mapping manual to do a comparison of these files. I would want to map them based on nearest match and run the comparison.

Expected map-

20170212abc.txt --> 20170213abc.txt
20170213xyz.txt --> 20150213xyz.txt
20170212jam.txt --> Missing
Missing --> 20170213pas.txt

Hope I was able to explain my requirement!!

You guys are awesome!! Thanks again!!

RudiC · February 27, 2017, 12:25pm

Please DON'T edit posts (here: post#1) after people have answered pulling the rug from under their feet! And, seriously, start using code tags!

What is a nearest match in your definition? Do you want to ignore digits and compare the alphabtic part of the file names?

jamilpasha · February 27, 2017, 12:34pm

---------- Post updated at 12:34 PM ---------- Previous update was at 12:31 PM ----------

@RudiC- sorry for editing the original post!! It was for a reason!!

I'm new here and will ensure to use code tags wherever applicable..

Nearest match to me would be looking at some specific string in the file names that are common! Like the one I explained in above example..

While I am trying to work on a generic utility that can work across projects I do not want to create an additional filters to exclude and match!! Hope I was able to answer your query!!

Corona688 · February 27, 2017, 5:41pm

If users are arbitrarily tagging things in filenames, this will be very difficult. I've fought this problem several times in various guises (similar filenames, similar company names, similar phone numbers..)

You can make solutions which work "mostly right" catching 98/100 so you can eyeball the rest, but you won't get it good enough to trust it to work without supervision. Anyone with an especially obnoxious work habit can throw it out of whack, too.