Common records

jacobs.smith · May 9, 2012, 1:53pm

Hi, I have the following files,

A M 2 3
B E 4 5
C I 5 6
D O 4 5

A M 3 4
B E 5 2
F U 7 9
J K 2 3

OUTPUT
A M 2 3 3 4
B E 4 5 5 2

thanks in advance,

bartus11 · May 9, 2012, 1:56pm

awk 'NR==FNR{a[$1"-"$2]=$0;next}$1"-"$2 in a{print a[$1"-"$2],$3,$4}' file1 file2

jacobs.smith · May 15, 2012, 1:46pm

Hi,

If I use the same command,

awk 'NR==FNR{a[$1"-"$2]=$0;next}$1"-"$2 in a{print a[$1"-"$2],$3,$4}' file1 file2

I am getting around 293 records. But when I do it

awk 'NR==FNR{a[$1"-"$2]=$0;next}$1"-"$2 in a{print a[$1"-"$2],$3,$4}' file2 file1

I am having around 370 records.

My file1 has 8219 records and file2 has 762 records.

Corona688 · May 15, 2012, 1:53pm

Post some of your actual input data instead of a mockup sample, it may be different than you expected.

jacobs.smith · May 15, 2012, 2:06pm

FILE1

0610009B14Rik	NR_037995	38	0
0610040J01Rik	NM_029554	21	0
1110012J17Rik	NM_001114098	394	0
1110017D15Rik	NM_001048005	95	0
1110032A04Rik	NM_001164210	147	0
1110059M19Rik	NM_026841	53	0
1190003J15Rik	NM_029821	40	0
1300014I06Rik	NM_025831	56	0
1300017J02Rik	NM_027918	3	0
1500009C09Rik	NR_037698	828	0
1500015O10Rik	NM_024283	366	0
1500016L03Rik	NR_038057	414	0
1600029D21Rik	NM_029639	15	0
1600029I14Rik	NR_028123	10	0
1700001C02Rik	NM_029285	24	0
1700001G11Rik	NR_038077	1	0
1700001L19Rik	NM_027035	406	0
1700003E16Rik	NM_027948	27	0
1700003M02Rik	NM_027041	2	0
1700007K13Rik	NM_027040	26	0
1700009J07Rik	NR_015547	4	0

FILE2

0610010O12Rik	NM_001081365	0	1
1300017J02Rik	NM_027918	0	17
1500015O10Rik	NM_024283	0	1
1700003G18Rik	NR_029433	0	1
1700011H14Rik	NM_025956	0	2
1700016D06Rik	NM_024271	0	3
1700047M11Rik	NR_015458	0	7
1700061J05Rik	NM_028522	0	1
1810010H24Rik	NM_001163473	0	4
2010005H15Rik	NM_029733	0	4
2010107G23Rik	NM_027251	0	23
2200002K05Rik	NM_026955	0	15
2310005G13Rik	NM_183281	0	6
2510049J12Rik	NM_001101431	0	12
2610034M16Rik	NM_027001	0	10
2610528J11Rik	NM_025572	0	6
4632428C04Rik	NR_033631	0	2
4930412F15Rik	NM_175517	0	4
4930511M11Rik	NM_029141	0	9
4930528F23Rik	NM_029197	0	9
4930555I21Rik	NM_030189	0	1
4930579C15Rik	NM_027089	0	1
4930579G22Rik	NM_026916	0	1
4931428L18Rik	NR_033445	0	4

bartus11 · May 15, 2012, 2:16pm

And for those two sample files my code is outputting two lines regardless of whether file1 is first or not. Can you post some sample data for which my code is not working?

jacobs.smith · May 15, 2012, 2:26pm

I am not sure which part of the input files are being read by your solution.

My files have around 8K records which is out of bound to be posted.

Thanks anyways.

drl · May 15, 2012, 2:38pm

Hi.

Using join (with the help of sed and sort) I also get 2 lines:

1300017J02Rik	NM_027918 3 0 0 17
1500015O10Rik	NM_024283 366 0 0 1

although I don't know if they are the same as bartus11 got, nor if they are indeed correct because no one posted any results for the second sets, expected or obtained.

I note that your first data sets had space delimiters, and the second sets had TABs ... cheers, drl