Help to retrieve data from two files matching a string

shaliniyadav · September 23, 2011, 4:14am

Hello Experts,

I have come back to this forum after a while now, since require a better way to get my result.. My query is as below..

I have 3 files -- 1 Input file, 2 Data files .. Based on the input file, data has to be retreived matching from two files which has one common key..
For EX:

Input file

 
919868191075
919868191028

Data File1

 
 
subsD,00 0A 02 48 4A 6B 92 01 58 06 5E 00,hlr,common,404686600492447,919868191075,919868191075,00000000,FALSE,FALSE,3,10,11|
subsD,01 40 46 85 50 47 15 87 40 00 00 00,hlr,common,404685504715874,919868191008,919868191008,00000000,FALSE,FALSE,1,10,11|
subsD,01 40 46 86 60 04 72 06 80 00 00 00,hlr,common,404686600473790,919868191085,919868191085,00000000,FALSE,FALSE,1,10,11|
subsD,00 0A 01 4A FB A6 C4 01 CD 0E 9A 00,hlr,common,404685505866829,919868191028,919868191028,00000000,FALSE,FALSE,6,10,11|

Data file2

imsiD,404686600667311,,7F0413BE=subsD|70400F2=00 0A 02 48 4A 6B 92 01 58 06 5E 00
imsiD,404686600369463,,7F0413BE=subsD|70400F2=01 40 46 85 50 47 15 87 40 00 00 00
imsiD,404685507343909,,7F0413BE=subsD|70400F2=01 40 46 86 60 04 72 06 80 00 00 00
imsiD,404685504666094,,7F0413BE=subsD|70400F2=01 40 46 85 50 46 66 09 40 00 00 00
imsiD,404686600708986,,7F0413BE=subsD|70400F2=00 0A 01 4A FB A6 C4 01 CD 0E 9A 00

Output should be like

 
Number Data1 Data2
919868191075 000A02484A6B920158065E00 404686600667311
919868191028 000A014AFBA6C401CD0E9A00 404686600708986

First column is the string from input file(919868191075) , next is the 2nd field in CSV file1 matching string in 6th field(000A02484A6B920158065E00 ), last one is 2nd field in csv file2 here matching string would be output retreived from first file(404686600667311
)...

Thanks in advance.. Looking forward for your help...

jayan_jay · September 23, 2011, 4:59am

for i in `cat inputfile`
do
        val1=`grep $i datafile1 | awk -F, '{print $2}'`
        echo "$i $val1 `grep "$val1" datafile2 | awk -F, '{print $2}'`"
done

ygemici · September 23, 2011, 5:07am

printf "%10s%25s%25s\n" "Number" "Data1" "Data2"
while read -r nrs ; do
for d1 in "$(sed -n "/$nrs/p" file1|sed 's/subsD,\([^,]*\),.*/\1/')"; do
d2=$(sed -n "/$d1/s/imsiD,\([^,]*\),.*/\1/p" file2);done
d1n=$(echo "$d1"|sed 's/ //g');printf "%10s %30s %20s\n" "$nrs" "$d1n" "$d2"
done<numbers
    Number                    Data1                    Data2
919868191075       000A02484A6B920158065E00      404686600667311
919868191028       000A014AFBA6C401CD0E9A00      404686600708986

shaliniyadav · September 23, 2011, 6:31am

@jayan_jay : Thanks a lot.. It does work for entries if exists in datafile, if it doesnt match then getting error --

 
919968531998 00 0A 02 48 4A 6B 92 01 58 06 5E 00 404686600667311
grep: RE error 41: No remembered search string.
7878787878787 
919968754673 00 0A 01 4A FB A6 C4 01 CD 0E 9A 00 404686600708986

In actual file their are millions of lines...

@ygemi: Thanks a lot , i wil check and get back to you

hartz · September 23, 2011, 8:05am

I think join is the easiest (and most elegant) solution for this. Join requires that the files be sorted by the joining column, so ... some pre-work.

#!/bin/sh
sort input > sorted.input
sort -t, -k6 file1 > sorted.file1
tr = , <file2 | sort -t, -k6 > sorted.file2
join -t, -o 1.1,2.2,2.5 -1 1 -2 6 sorted.input sorted.file1 | sort -t, -k2 > sorted.temp1
echo "Number\tData1\tData2"
join -t, -o 1.1,1.2,2.2 -1 2 -2 6 sorted.temp1 sorted.file2 | sed 's/ //g; s/,/\t/g'

The question is: how do you want to handle lines which do not match? You can handle this in different ways, depending on whether there is a line in "input" with no line in file1, and then again the other way round if there is no line in input but some lines in file1 or file2. If you are sure there will always be pairs, then my work here is done.

But you have already hinted at lines that wasn't atched, so have a look at the options for -e as well as -a 1 and/or -a 2 in the man page of the join command to see what is possible. It is called UNPAIRABLE lines.

If the options confuse you, explain what you need and someone will surely help.