Comparing two huge files

kmkbuddy_1983 · September 3, 2008, 3:36am

Hi,

I have two files file A and File B. File A is a error file and File B is source file. In the error file. First line is the actual error and second line gives the information about the record (client ID) that throws error. I need to compare the first field (which doesnt start with '//') of file A with fifth field of file B. It field values in file A and file B matches i need to write it to output file as below.

File A
// 223 missing
223,Jan,ee,bla,bla

// data not found
254-11,Jan,ee,bla,bla

// data rejected
214-1,Jan,ee,bla,bla

File B
aaaa,bbbb,ccc,dddd,20054-11,fff,ggg...
aaaa,bbbb,ccc,dddd,254-11,fff,ggg...
aaaa,bbbb,ccc,dddd,2545456-1,fff,ggg...

output:
// data not found
254-11,Jan,ee,bla,bla

if First field of File A and Fifth field of File B (254-11) matches, then i need to write the records from file A (current line and the previous line) to a output file as above.

I could achieve it very easily using awk and grep with if loop. Problem is files are hugh. Nearly 1 million records are in both the files. script run for 3-4 hours. I would appreciate if some one could help me in giving good logic or better script which could complete the task in few minutes.

Note: File A and File B look exactly in the same format. Caution about the blanks in file A and Client ID fomat 000 or 000-0 or 000-00.

RahulJoshi · September 3, 2008, 3:44am

for compare:
comm file1 file2

for diffrence:
diff file1 file2

kmkbuddy_1983 · September 3, 2008, 4:15am

Hi Rahul,

Thank you for your reply. Command comm with the above files gives wrong output because i need to compare field 1 of file A and field 5 of file B and output the current and previous line from file A.

comm command will compare the files line by line. None of the lines will match except field 1 of file A and field 5 of file B.

Regards,
Mahesh k

Digby · September 3, 2008, 4:57am

By chance, I came on here with exactly the same problem.

I think that join may come in useful here:
> join -1 1 -2 5 -t, $fileA $fileB > $requiredFile

The thing is the files need to be sorted by the fields you plan to join... and I can't sort my files :-s.

Don't know if this is of any use.

kmkbuddy_1983 · September 3, 2008, 5:10am

Hi,

Thank you for your suggestion. In file A, Actual field is second line. Records seperated by ','. The first line is error or message where words are sepearated by space. so i cannot say first field because the fields are not proper.

Regards,
Mahesh K

Digby · September 3, 2008, 5:12am

can you extract the required identifiers into a different file with awk and/or grep?

kmkbuddy_1983 · September 3, 2008, 5:21am

yes buddy, i can do that. I have extracted the field 1 to file C and Field 5 to file D. I used comm command to compare and find the exact match. It just run for few seconds.

The output file is also quite big. I count number of lines in output file, put the while loop, get the ID from output file 1 by 1 and grep the File a and generate the exact output.

This is my problem. the above task run for 2-3 hours due to big loop. I dont know how to over come this problem to optimize my script.

Digby · September 3, 2008, 5:31am

you don't need to touch file B

Here's how I'd do it... I think it should be very quick.

osscl1head01 1447>cat fileA
// 223 missing
223,Jan,ee,bla,bla

// data not found
254-11,Jan,ee,bla,bla

// data rejected
214-1,Jan,ee,bla,bla
osscl1head01 1448>cat fileB
aaaa,bbbb,ccc,dddd,20054-11,fff,ggg...
aaaa,bbbb,ccc,dddd,254-11,fff,ggg...
aaaa,bbbb,ccc,dddd,2545456-1,fff,ggg...
osscl1head01 1449>grep . fileA | grep -v / | awk -F, '{print $1}' > fileC
osscl1head01 1450>cat fileC
223
254-11
214-1
osscl1head01 1451>join -1 1 -2 5 -t, fileC fileB > fileD
osscl1head01 1452>cat fileD
254-11,aaaa,bbbb,ccc,dddd,fff,ggg...
osscl1head01 1453>

EDIT You need to sort both the input files to join by the identifier, but that *should* be straight forward enough.
sort +4 -t, fileB > fileBsorted
sort fileC > fileCsorted

You can probably use awk to repair the structure of fileD if that is important.

kmkbuddy_1983 · September 3, 2008, 5:45am

hey dude
I appreciate your help. Once you found that Id '254-11' is common in both the files. You need to grep the File A to get ouptut as below. current line and previous line of the match. You are joining the fields of matched record of both the files. That is not what i need.

output:
// data not found ---- (previous line in file A)
254-11,Jan,ee,bla,bla ---- (current line in file A)

I am not touching file B after finding the match using comm.

Digby · September 3, 2008, 5:55am

I realize that's what you want to do, but with these large files grepping every query against every line isn't feasible (at least for my files). Even with a modest number of queries it is painfully slow.

I am joining the matching lines in the two files, but as fileC only contains the field to be matched, the outputted line is effectively the fileB line. Note that join only outputs lines that match, so it is what you (and I) need (I think).

The only problem is that the fields of the fileB line have been rearranged.
awk -F, '{print $2,$3,$4,$5,$1,$6....}' could sort this out. If you've got a very large number of fields in file B then I guess a perl or sed command could come in handy, but I don't know exactly how to write it.

If this wouldn't result in your required output, then I'm afraid I'm misunderstanding your problem.

Digby · September 3, 2008, 7:21am

Sorry dude, I just reread your post and realised what output you're looking for.
You want the the final output to be from file A. :o

Could you convert it to a single line format and then use join?

daisy 1860>perl -pe 's/\n/:/g' fileA | perl -pe 's/::/\n/g' | perl -pe 's/:$/\n/g' | awk -F: '{print $2":"$1}' | sort > fileAA
daisy 1861>awk -F, '{print $5}' fileB | sort > fileBB
daisy 1862>join -1 1 -2 1 -t, fileAA fileBB | awk -F: '{print $2":"$1}' | perl -pe 's/:/\n/' | perl -pe 's/^\//\n\//'

// data not found
254-11,Jan,ee,bla,bla

kmkbuddy_1983 · September 3, 2008, 7:44am

It worked great. Awsome dude. You are really great. hats off to your weighted brain. I am your fan from today. trust me...