Compare two files when pattern matched

imranrasheedamu · August 19, 2016, 5:30am

I have two files say FILE1 and FILE2.

FILE1 contains 80,000 filename in sorted order and another file FILE2 contains 6,000 filenames is also in sorted order.

I want to compare the filename for each file and copy them in to a folder when filename is matched.

File1.txt contain 80,000 filenames

./list1.txt
./list.txt
./temp.txt
./1_April_2011_Front0.txt
./1_April_2011_Front10.txt
./1_April_2011_Front11.txt
./1_April_2012_Front12.txt
./1_April_2011_Front13.txt
./1_April_2011_Front14.txt
./1_April_2011_Front15.txt
./1_April_2011_Front16.txt
./1_April_2011_Front17.txt
./1_April_2011_Front18.txt
./1_April_2011_Front19.txt
./1_April_2011_Front1.txt
./5_April_2012_Page323.txt
./6_August_2012_Page328.txt
./10_February_2014_Sportz6.txt
.....
.....

File2.txt contain 6,000 filenames without extension (.txt)

1_April_2012_Front16
5_April_2012_Page323
6_August_2012_Page328
15_August_2012_Sportz10
10_February_2014_Sportz6
.....
.....

Similar filenames copied to a folder name "output"

desired output

5_April_2012_Page323.txt
6_August_2012_Page328.txt
10_February_2014_Sportz6.txt

I tried this code but do not get my desired output

counter=0;
for file in `cat FILE1.txt | awk -F'[/_.]' '{print $3$4$5$6}'` 
do
x=`echo "$file"` 
while read eachline
do
y=`echo "$eachline" | cat temp.txt | awk -F'[/_.]' '{print $1$2$3$4}'`
if [ "$x"=="$y" ]
then
cp -v $file /home/imran/Script/data
counter=$((counter+1))
break
fi
done < FILE2.txt
echo $counter
done

I have tried in this way also

counter=0;
for f in `awk 'NR>2{print}' FILE1.txt` 
   do
     f3=$(echo $f|awk -F'/' '{print $2}');
     f6=$(echo "${f3%%.*}");    
   for g in `awk 'NR>=1{print}' FILE2.txt`
        do
           if [ "$f"=="$g" ]
           then
           cp $f /home/imran/Script/data
           counter=$((counter+1))    
           break;
           fi
       done
             echo $counter
  done

Please help

RudiC · August 19, 2016, 5:36am

Does "similar" mean "identical except for the .txt ending"?
Will EVERY single entry in file2 exist in file1 (with leading "./" and trailing ".txt")?

RudiC · August 19, 2016, 5:49am

Given my above assumptions apply, try

awk 'NR == FNR {T[$1]; next} {FN = $0; gsub (/^.*\/|.txt$/, _)} $0 in T {system ("echo cp " FN " /some/where")}' file2 file1
cp ./5_April_2012_Page323.txt /some/where
cp ./6_August_2012_Page328.txt /some/where
cp ./10_February_2014_Sportz6.txt /some/where

If happy, remove the echo command from the system() call.

imranrasheedamu · August 19, 2016, 8:17am

Thank you so much RudiC Sir!!

---------- Post updated at 05:47 PM ---------- Previous update was at 05:34 PM ----------

RudiC Sir!! Could you please explain your command

RavinderSingh13 · August 19, 2016, 8:50am

Hello imranrasheedamu,

Could you please let me know if following may help you here.

awk 'NR == FNR                         #### NR and FNR are the awk's inbuilt variables so condition NR==FNR willbe TRUE only when first file(file2) here will be read. Because FNR's value will be reset whenever a new file is being read but NR's value will be keep on increasing till the all files will be completed reading.
{T[$1];                                #### creating an array named T whose value is $1(first field).
next}                                  #### putting next(awk's inbuilt keyword) to skip all further statements now.
                                       #### All following statements will be read when second file named file1 is being read.
{FN = $0;                              #### creating a variable named FN whose value is $0(complete line).           
gsub (/^.*\/|.txt$/, _)}               #### gsub(awk's in-built functionality to globally subtitute the pattern in any line or variable, line here in this case. It will globally subsitutue everything till / (as per your requirement) with NULL.
$0 in T                                #### Now every line(which is formed by above subsitute command now) is present in array named T(which was created while file2 was getting read in NR==FNR condition).
{system ("echo cp " FN " /some/where") #### using system command(which is use to execute shell commands inside awk) executing echo command which will write the actually commands which we want to perform like cp source_file  Target_file in this case.
}' file2 file1                         #### Mentioning Input_files named file2 and file1 here.

Thanks,
R. Singh

RudiC · August 19, 2016, 8:51am

awk '
NR == FNR       {T[$1]                                  # for the first file (NR id. to FNR), collect the names to search in T array
                 next                                   # stop processing this line; read next one
                }
                {FN = $0                                # second file only: save total file path in FN variable
                 gsub (/^.*\/|.txt$/, _)                # remove leading path info and ".txt" ext. from file name
                }
$0 in T         {system ("echo cp " FN " /some/where")  # IF the reduced file name is found in pattern array T, run the 
                                                        # system command to cp FN (full file path) to destination (echo inserted for safety)
                }
' file2 file1

Don_Cragun · August 19, 2016, 9:40am

Each call to system() in awk will invoke a shell which will then invoke cp . If there are 6000 files to be copied, invoking one shell for the copies instead of 6000 should be considerably faster. Consider this small change to RudiC's suggestion:

awk '
NR == FNR       {T[$1]
                 next
                }
                {FN = $0 
                 gsub (/^.*\/|.txt$/, _)
                }
$0 in T         {print "cp", FN, "/some/where"
                }
' file2 file1 | sh

And, if the cp utility on your system has a -t destination_directory option (which is an extension not covered by the standards), you could make even more gains greatly reducing the number of times cp is invoked by using xargs :

awk '
NR == FNR       {T[$1]
                 next
                }
                {FN = $0 
                 gsub (/^.*\/|.txt$/, _)
                }
$0 in T         {print FN
                }
' file2 file1 | xargs cp -t "/some/where"

RudiC · August 19, 2016, 12:14pm

I considered that as well. cp , at least some versions, allows to copy multiple input files to a target directory. That could be done like

awk '
BEGIN           {printf "echo cp"                       # prepare shell statement
                } 
NR == FNR       {T[$1]                                  # for the first file (NR id. to FNR), collect the names to search in T array
                 next                                   # stop processing this line; read next one
                }
                {FN = $0                                # second file only: save total file path
                 gsub (/^.*\/|.txt$/, _)                # remove leading path info and ".txt" ext. from file name
                }
$0 in T         {printf " %s ", FN                      # IF the reduced file name is found in pattern array T, print the FN (full file path)
                }
END             {print " /some/where"                   # finish shell statement
                }
' file2 file1 | sh
cp ./5_April_2012_Page323.txt ./6_August_2012_Page328.txt ./10_February_2014_Sportz6.txt /some/where

It may overrun system limits if too many files are to be copied, though.

Don_Cragun · August 19, 2016, 4:46pm

rudic:

I considered that as well. cp , at least some versions, allows to copy multiple input files to a target directory. That could be done like

awk '
BEGIN           {printf "echo cp"                       # prepare shell statement
   } 
NR == FNR       {T[$1]                                  # for the first file (NR id. to FNR), collect the names to search in T array
   next                                   # stop processing this line; read next one
   }
   {FN = $0                                # second file only: save total file path
   gsub (/^.*\/|.txt$/, _)                # remove leading path info and ".txt" ext. from file name
   }
$0 in T         {printf " %s ", FN                      # IF the reduced file name is found in pattern array T, print the FN (full file path)
   }
END             {print " /some/where"                   # finish shell statement
   }
' file2 file1 | sh
cp ./5_April_2012_Page323.txt ./6_August_2012_Page328.txt ./10_February_2014_Sportz6.txt /some/where

It may overrun system limits if too many files are to be copied, though.

We could squeeze a few more source file operands into a cp command if we drop the leading ./ from the file operands:

$0 in T         {printf " %s.txt", $0                      # IF the reduced file name is found in pattern array T, print the filename

If imranrasheedamu tells us that cp -t target is not available and the above script fails with E2BIG errors, we could also make some simple modifications to the above script to put no more than x source file operands in each cp command where x is 50, 100, or some other conservative number based on the maximum filename length, the size of the combined environment variables, and ARG_MAX on the system. But I don't see any reason to spend the time to do that unless imranrasheedamu lets us know that it is needed.