Match col 1 of File 1 with col 1 File 2 and create a 3rd file

sogi · June 29, 2009, 11:40pm

Hello,

I have a 1.6 GB file that I would like to modify by matching some ids in col1 with the ids in col 1 of file2.txt and save the results into a 3rd file.

For example:

File 1 has 1411 rows, I ignore how many columns it has (thousands)
File 2 has 311 rows, 1 column

Would like to create

File 3 with 311 rows (thousands of columns)

What is the fastest way to do this without consuming too much memory?

Thank you!

rakeshawasthi · June 30, 2009, 12:14am

Fastest way is syncsort but i dont know if you would have that....
then try grep. dont use awk.

sogi · June 30, 2009, 1:31am

I used this:

grep -A1 -A1 -f file1.txt file2 > file3

but it is taking forever and I don't know if it is going to be correct at the end
I don't know what -A1 -A1 mean (I'm assuming that is col1 File1 col1 File2)

Help please!

rakeshawasthi · June 30, 2009, 1:38am

give some sample input of both the files
and desired output, and
conditions how the two files will be joined.
PS:- Use code tags

sogi · June 30, 2009, 2:10am

Both files have no headings

input of file 1 (has one 1 column, as shown below):

MXY2344
MXY2455
.
.
.
.
.
.
.
MXY9150 <--- row #364

input of file 2 (this file has 2,498,588 columns with single digit numbers, starting with column 1 as shown below, each column is separated by a space)

MXY2344
MXY2455
.
.
.
.
.
.
.
MXY9150 <--- row #364
.
.
.
.
.
.
.
.
.
.
.
MXY9423 <--- row #1411

desired output file 3 (with only #364 rows with the ids matched between file1 and file2 and 2,498,588 columns)

MXY2344
MXY2455
.
.
.
.
.
.
.
MXY9150 <--- row #364

Thank you for any help!

---------- Post updated at 11:10 PM ---------- Previous update was at 11:03 PM ----------

I just checked the results I obtained with grep -A1 -A1 -f file1.txt file2 > file3

and they are wrong. Instead of getting only 364 rows, I get 367 and some of the ids of file 1 are missing in the output file 3. I want to match the ids from file1 (my "golden" list) in file2 and output that in file 3

sogi · June 30, 2009, 7:35pm

Both files have no headings

input of file1.txt (has one 1 column, as shown below):

MXY2344
MXY2455
.
.
.
.
.
.
.
MXY9150 <--- row #364

input of file2.ped (this file has more than 2 million columns with single digit numbers, starting with column 1 as shown below, each column is separated by a space)

MXY2344
MXY2455
.
.
.
.
.
.
.
MXY9150 <--- row #364
.
.
.
.
.
.
.
.
.
.
.
MXY9423 <--- row #1411

desired output file 3 (with only #364 rows with the ids matched between file1 and file2 and 2,498,588 columns)

MXY2344
MXY2455
.
.
.
.
.
.
.
MXY9150 <--- row #364

Thank you for any help!

---------- Post updated at 11:10 PM ---------- Previous update was at 11:03 PM ----------

I used grep -A1 -A1 -f file1.txt file2 > file3 but that did not work.

I only got one reply for this thread yesterday saying to use grep, so that's why I'm posting this again in hopes somebody would help.

Thank you!

vidyadhar85 · June 30, 2009, 7:56pm

If you want to grep the data from file2 which are present in file1

grep -f file1 file2 > file3
or
awk 'FILENAME=="file1"{A[$0]=$0}
FILENAME=="file2"{if(A[$1]==$1){print}}' file1 file2 > file3

sogi · June 30, 2009, 11:13pm

I already tried that grep code and did not work either. I don't know if it is because of the file extension of file2.ped (ped is text file that can handle millions of columns)

---------- Post updated at 08:13 PM ---------- Previous update was at 05:41 PM ----------

The memory is exhausted when using these command lines.