Compare Only "File Names" in 2 Files with file lists having different directory structure

I have a tar arcive

arch_all.tar.gz

and 4 batched tar archive . These batches are supposed to have all the files form arch1.all.tar.gz

arch1_batch1.tar.gz
arch1_batch2.tar.gz
arch1_batch3.tar.gz
arch1_batch4.tar.gz

my issue is that the directory structure in "arch_all.tar.gz" is different than the directory strutcure in batch1 2 3 and 4 . I need to find missing files in batch1 2 3 and 4.

example:

in arch1.all.tar.gz

-rw-r--r-- oracle/oracle 40203 2016-12-25 14:59 usr/data/output/export_12-25-2016/File_31339155.xml
-rw-r--r-- oracle/oracle 40203 2016-12-25 14:59 usr/data/output/export_12-25-2016/File_31339156.xml

The same file is named as

-rw-r--r-- oracle/oracle 40203 2016-12-26 13:21 export_12-26-2016_BATCH1/File_31339155.xml

I was able to create a combined file with lists from batch1 batch2 batch3 and batch4

QUESTION:

I need to write a shell script that can help me grep only the filenames from these 2 files and show me the difference if any?

arch_all.tar.gz has almost 2700 more files that all the 4 batches combined.

Example:arch_all:

-rw-r--r-- oracle/oracle 40203 2016-12-25 14:59 usr/data/output/export_12-25-2016/File_31339155.xml
-rw-r--r-- oracle/oracle 40203 2016-12-25 14:59 usr/data/output/export_12-25-2016/File_31339156.xml

Combined file with files from batch1 batch2 batch3 and batch4:

-rw-r--r-- oracle/oracle 40203 2016-12-26 13:21 export_12-26-2016_BATCH1/File_31339155.xml

Since the combined file is missing file

File_31339156.xml

I expect to see "File_31339156.xml" as the output.

Can you please help?

Thanks

Hi,

can you try something like this ?

tar tf all.tar.gz | grep ".xml" > all-xml-file-list

#if you dont have file list from batch ,create it
rm -f batch-file-list
for i in bat*.tar.gz 
do
tar tf $i | grep ".xml" >> batch-file-list
done

echo "get missing list"
grep -v -f batch-file-list all-xml-file-list

Note that in tar files i look only for xml files, you might need to modify a bit.

1 Like

I don't see that greet_sed's suggestion makes any attempt to extract just the last component of any of the pathnames in your two files. You didn't show us how your tar archives are created and you haven't bothered to tell us what operating system or shell you're using. The following awk script should work even if your archives contain directories in addition to regular files, but if your archives only contain regular files, the code could be simplified:

awk -F/ '
!$NF {	next
}
NR == FNR {
	files[$NF]
	next
}
{	delete files[$NF]
}
END {	for(file in files)
		print file
}
' arch_all combined

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

1 Like

Small perhaps theoretical note:

!$NF {	next
}

is used to skip directories, but it would also skip files with names like 0 , 00 or +0 .

A safer method would be to use:

$NF=="" { 
  next
}

I am on Bash.

I am not a unix expert, can you please give me the code to print the final result with difference in both the list files assuming the first file name is all-xml-file-list.lst and combinted batch file with list is "batch-file-list.lst". Please note the *.lst files do have file names with directory structure in them.

Thanks

---------- Post updated at 05:15 AM ---------- Previous update was at 04:58 AM ----------

Please ignore my previous update. I was able to pull only the file names using your AWK script. Now I am using the compare using grep -f -v option.

Thanks

---------- Post updated at 05:22 AM ---------- Previous update was at 05:15 AM ----------

grep -v -f batch-file-list batch-file-list.lst > /tmp/difference.lst

Am I using the correct command to print the difference in /tmp/difference.lst? It's been running for a while

I am completely at a loss from your above statements. In your first post in this thread you said you had two files (one that you referred to as arch_all and one that you said "Combined file with files from batch1 batch2 batch3 and batch4" which my script assumed was named combined ). If you had given the names of those two files (in that order) as the names of the files on the last line of the script I provided, the output would have been the output you requested! I.e., the names of the files in 1st input file (after discarding the directories in which those files were located) that were not found in the 2nd input file (after discarding the directories in which those files were located). So, what are you now trying to do with grep -v -f that wasn't already done by the code I provided???

2 Likes

Your script worked flawlessly. I was able to get the difference in 1 shot.

Thanks

---------- Post updated at 09:05 PM ---------- Previous update was at 09:04 PM ----------

Thanks for your help. I had to use Don's script as it handled the stripping of directory component in files names