File comparision line by line

rkrish123 · June 17, 2013, 5:15pm

Hi,

I want to compare 2 files and get output file into seperate folder.

Both file names will change daily with timestamp (ex: file1_06_17_2013_0514), so i can't mention the file names in the script to compare,

but i need to compare these 2 files daily and generate output to another folder.

Can you please help to prepare script to get this done?

Thanks.

juzz4fun · June 17, 2013, 5:18pm

What have you tried so far?
Also, specify the input output files (at least samples, if files are big)

rkrish123 · June 17, 2013, 5:35pm

i have copied the files to the folder, doing comparision manually by using "comm" command and place output file in to other folder....but daily we can't do it manually, need to automate to get that files and compare.

RudiC · June 18, 2013, 4:16am

Try using find with time stamps.

rkrish123 · June 20, 2013, 12:23pm

thanks for reply RudiC, but how come we find out the date? files can be y'day or today or 3days back files?

Only here, it has to take 2 files in that folder and compare.

Don_Cragun · June 20, 2013, 12:53pm

What we are missing is a description of how you determine which two files you want to compare when you do it manually.

If today's year, month, and day are 2013, 06, and 20, what are the names of the files you want to compare when you run your script today? Are there other files in your directory whose names might be "similar" to the names of the files you want to compare but should be ignored?

rkrish123 · June 20, 2013, 3:56pm

The files can be any date, it may contain today day or y'day....

i can't hardcode the filenames as filename changes daily with new time stamps...

Aggr_New_06_15_2013_1944.txt
Aggr_New_06_19_2013_1944.txt

These 2 files are in /tmp/test folder, i have to compare and get difference in other file like Aggr_New.txt in same folder or any other folder.

Aggr_New_06_15_2013_1944.txt has below records

ACCOUNT|497|1108000-new|Jun|2045.00
ACCOUNT|497|Mmnfy-new|Jun|1903.00
ACCOUNT|497|922857-new|Jun|2045.00

Aggr_New_06_19_2013_1944.txt has below records

ACCOUNT|497|1108000-new|Jun|2045.00
ACCOUNT|497|Mmnfy-new|Jun|1903.00
ACCOUNT|497|922857-new|Jun|2045.00
ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

Output:

ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

Don_Cragun · June 20, 2013, 4:20pm

Let me try one more time: I understand that your files' names change daily. I understand that you can't hard code the name of the files to be compared because you want to select a different pair of files to be compared each day. I assume that there could be lots of files in the /tmp/test directory.

Please explain how you select the two files you want to compare out of all of the files that are present in /tmp/test.

franksunnn · June 20, 2013, 4:30pm

rkrish123:

The files can be any date, it may contain today day or y'day....

i can't hardcode the filenames as filename changes daily with new time stamps...

Aggr_New_06_15_2013_1944.txt
Aggr_New_06_19_2013_1944.txt

These 2 files are in /tmp/test folder, i have to compare and get difference in other file like Aggr_New.txt in same folder or any other folder.

Aggr_New_06_15_2013_1944.txt has below records

ACCOUNT|497|1108000-new|Jun|2045.00
ACCOUNT|497|Mmnfy-new|Jun|1903.00
ACCOUNT|497|922857-new|Jun|2045.00

Aggr_New_06_19_2013_1944.txt has below records

ACCOUNT|497|1108000-new|Jun|2045.00
ACCOUNT|497|Mmnfy-new|Jun|1903.00
ACCOUNT|497|922857-new|Jun|2045.00
ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

Output:

ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

I have a question.
e.g.
Assuming that there are 10 lines in the file_A and 5 lines in the file_B, and the content shown below:
file_A

file_B

please tell me what's your desired output?

rkrish123 · June 20, 2013, 4:58pm

Don: i have only 2 files in that folder...these 2 files need to compare and generate output to other folder.

Frank: with that example, i have to get only "B05" as output.

franksunnn · June 20, 2013, 5:10pm

diff -c file1 file2 | gawk '/^+/ {print $2}' >> file_diff

However, I feel it's wired. Maybe I have no idea about your real need.

Don_Cragun · June 20, 2013, 7:38pm

rkrish123:

The files can be any date, it may contain today day or y'day....

i can't hardcode the filenames as filename changes daily with new time stamps...

Aggr_New_06_15_2013_1944.txt
Aggr_New_06_19_2013_1944.txt

These 2 files are in /tmp/test folder, i have to compare and get difference in other file like Aggr_New.txt in same folder or any other folder.

Aggr_New_06_15_2013_1944.txt has below records

ACCOUNT|497|1108000-new|Jun|2045.00
ACCOUNT|497|Mmnfy-new|Jun|1903.00
ACCOUNT|497|922857-new|Jun|2045.00

Aggr_New_06_19_2013_1944.txt has below records

ACCOUNT|497|1108000-new|Jun|2045.00
ACCOUNT|497|Mmnfy-new|Jun|1903.00
ACCOUNT|497|922857-new|Jun|2045.00
ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

Output:

ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

If Aggr_New_06_15_2013_1944.txt and Aggr_New_06_19_2013_1944.txt are the only two files in this directory with names ending in .txt , running the following grep command:

grep -vF -f *.txt

produces the output you want. You can redirect the output to go wherever you want it. Of course if the two files had names that sorted in the opposite order, the results would be completely different; and, you still haven't specified how to determine which file comes first in the comparison.

Hope this helps,
Don

rkrish123 · June 21, 2013, 1:34am

Thanks Don, its working...but its taking long time compare to "comm" command.
till now we are using manually comm -2 -3 file1 file2 > file3
This command working faster than grep...our file size is large.

Don_Cragun · June 21, 2013, 2:17am

I'm very glad that comm is working for you. And, you're very lucky that it is. The behavior of comm is only defined if both input files are sorted in collating order in the current locale. Neither of your sample input files are sorted, so if comm is producing the output you wanted, it is coincidence.

If you had said that your input files were large and sorted, I would have considered comm before fgrep; but since your sample input files were small and unsorted, I used grep -F instead of sorting both input files and then using comm on the resulting sorted files.

You still haven't specified how you decide which of the files in your directory is supposed to be "file1" and which is supposed to be "file2" in your example above. So, it is also just luck that *.txt happened to give the results you wanted when the results would have been very different if the order of the operands had been reversed.

So, with your real input files, is the command sequence:

for i in *.txt
do      sort -o $i $i
done
comm -2 -3 *.txt > SomeFile_In_SomeOtherDirectory

still faster than:

grep -vFf *.txt > SomeFile_In_SomeOtherDirectory

Both of the above command sequences should produce the same output.

rkrish123 · July 3, 2013, 2:43pm

Thanks Don for your help...

I am using "grep -vF -f *.txt" command to compare,

but here is some issue, comparison is not working correctly...i want to compare file1 with file2, only column 2/3 in file2 should compare, if there is any difference should get output in file3.

---------- Post updated at 01:43 PM ---------- Previous update was at 01:37 PM ----------

see example i am expecting...

ACCOUNT|497|1108000-new|Jun|2045.00
ACCOUNT|497|Mmnfy-new|Jun|1903.00
ACCOUNT|497|922857-new|Jun|2045.00

Aggr_New_06_19_2013_1944.txt has below records

ACCOUNT|497|1108000-new|Jun|2045.00
ACCOUNT|497|Mmnfy-new|Jun|1903.00
ACCOUNT|497|922857-new|Jul|2045.00
ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

Right now i am getting output:

ACCOUNT|497|922857-new|Jul|2045.00
ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

But Actual Output expecting should be like below...

ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

Script should compare only 2/3 columns in file2...please help on it.

Don_Cragun · July 3, 2013, 4:00pm

I give up. After two weeks of wasting our time, you now completely change the requirements used to determine the output you want. And you still haven't answered the basic question of how we are supposed to know which of the two files in a directory is the "1st" file and which is the "2nd" file.

If you want help on this new problem, start a new thread. In that thread:

Give a complete description of what your input files look like (expected sizes, field separators, sample contents, filename format).
Give a complete description of the processing that needs to be performed to produce the results you want.
Show sample output that should be produced when your input files are processed as you described.

rkrish123 · July 3, 2013, 4:49pm

1.Give a complete description of what your input files look like (expected sizes, field separators, sample contents, filename format).

One of our unix script will get 2 files from windows server and place in unix folder1(Everyday it will clean the folder and place only 2 files). these 2 files names change daily as file names contains date timestamp(File names like 'Jun_Agg_06_19_2013_1944.txt'),file size depends on data we get daily, it varies from 5MB -60MB filesize. so now we need to compare these two files and get output in 3rd file.

Input files contains only 5 columns with '|' delimeter, column1, 4,5 can be same, but we used to get difference in input files in column2 and 3, based on column 2 and column 3, file1 should compare with file2 and get result in file3.

2.Give a complete description of the processing that needs to be performed to produce the results you want.

Explained complete process above

3.Show sample output that should be produced when your input files are processed as you described.

see example below..

File1:

ACCOUNT|497|1108000-new|Jun|2045.00
ACCOUNT|497|Mmnfy-new|Jun|1903.00
ACCOUNT|497|922857-new|Jun|2045.00

File2:

ACCOUNT|497|1108000-new|Jun|2045.00
ACCOUNT|497|Mmnfy-new|Jun|1903.00
ACCOUNT|497|922857-new|Jul|2045.00
ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

Right now i am getting output:

ACCOUNT|497|922857-new|Jul|2045.00
ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

But Actual Output should be like below...

ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

Files should compare based on 2nd and 3rd columns and get output in to file3.

mjf · July 3, 2013, 10:08pm

$ awk -F"|" 'FNR==NR{a[$2$3]++;next}!a[$2$3]' file1.txt file2.txt

ACCOUNT|497|922865-new|Jun|4509.00
ACCOUNT|497|922987-new|Jun|3249.00
ACCOUNT|497|1|Jun|867.00

rkrish123 · July 4, 2013, 12:05am

Thanks mjf, its working perfectly for comparison, can do manually.

But my requirement is to automate that file comparison, how can we automate this by using above script as i can't hard coded the file names as daily file names will change with time stamp?? please help.

vidyadhar85 · July 4, 2013, 2:05am

Could you please let us know the example file names. provide the names of both files and the part of the filename which would change daily.