Comparing two fixed width file

anshul_er · May 18, 2011, 8:53am

Hi Guys

I am checking the treads to get the answer but i am not able to get the answer for my question.

I have two files. First file is a pattern file and the second file is the file i want to search in it. Output will be the lines from file2.
File1:

P2797f12af                    44751228
P2b1204d0f                    33470964
P2b1205f76                    35815429
P2797f0250                    8219027

File 2:

P2797ea6c0                    1942611  SAN   SAN
P2797f12af                    44751228 SAN   SAN
P2b1204d0f                    33470964 SAN   SAN
P2b1205f76                    35815429 SAN   SAN
P2797f0250                    8219027  SAN   SAN

Output:

P2797f12af                    44751228 SAN   SAN
P2b1204d0f                    33470964 SAN   SAN
P2b1205f76                    35815429 SAN   SAN
P2797f0250                    8219027  SAN   SAN

I am able to do this using below command:

fgrep -f file1 file2

But it is giving me an error of out or memory as my file size is more than 1 million.
I also tried splitting it:

split -l 10000 file1 file1.split.
for CHUNK in file1.split.* ; do
        fgrep -f "$CHUNK" file2
done
rm file1.split.*

It is also taking a lot of time. First loop is done really quick but for the next loop to start it is taking long time. :wall:

Can you please let me know if i am doing something wrong here. Or can you please provide me any awk command to do this stuff.

You guys are great... looking forward for your reply.

ctsgnb · May 18, 2011, 9:02am

Ooops

---------- Post updated at 03:02 PM ---------- Previous update was at 02:58 PM ----------

what is your file size ?

anshul_er · May 18, 2011, 9:02am

Fgrep not working as files are huge.
Getting below error:
fgrep: not enough memory.

ctsgnb · May 18, 2011, 9:03am

What are the size of your file1 and file2 ?

anshul_er · May 18, 2011, 9:09am

both file will be containing around 400,000 lines.

ctsgnb · May 18, 2011, 9:15am

Could you please provide their size ? (not their number of lines)

anshul_er · May 18, 2011, 9:20am

file1: 99417680
file2: 20430220
In bytes
File 2 may also be larger that the file 1.

ctsgnb · May 18, 2011, 9:27am

If your command version is 32bit, it will use process that cannot address more than 2^32 bytes

Which is around 4GB

And if working/coded with signed number the range it can work with will be 4G wide but from -2G to 2G so the file size cannot exceed 2G.

# type fgrep
fgrep is /usr/bin/fgrep
# file /usr/bin/fgrep
/usr/bin/fgrep: ELF 32-bit MSB executable SPARC Version 1, dynamically linked, stripped
# bc
2^32
4294967296
# printf "%d\n" 4294967296 2>/dev/null
2147483647

so you should split your big file into piece that have a size that your fgrep will be able to handle or you need to run a 64bit version of fgrep on a 64bit plateform so it can address 2^64 so it can deal with largefile.

Also see

man largefile

anshul_er · May 18, 2011, 9:44am

i did try by splitting it.
For the first loop it worked fine. But while going to second loop it is taking a long time.
split -l 10000 file1 file1.split. for CHUNK in file1.split.* ; do fgrep -f "$CHUNK" file2 done rm file1.split.*

Can you please check if i am doing something wrong here. Or can you provide me some awk command.

ctsgnb · May 18, 2011, 10:01am

Give a try with

awk 'NR==FNR{a[$1$2];next}($1$2 in a)' File1 File2

use nawk or /usr/xpg4/bin/awk instead of awk if you run SunOS or Solaris.

anshul_er · May 19, 2011, 12:59am

it worked.. thanks a lot