awk, associative array, compare files

shruthi123 · October 18, 2011, 12:37pm

i have a file like this

< '393200103052';'H3G';'20081204'
< '393200103059';'TIM';'20110111'
< '393200103061';'TIM';'20060206'
< '393200103064';'OPI';'20110623'
> '393200103052';'HKG';'20081204'
> '393200103056';'TIM';'20110111'
> '393200103088';'TIM';'20060206'

Now i have to generate a file which should have records like this
For: '393200103052' filed 2 differs :'H3G' 'HKG'
I have to genreate one more file like this,these are the records which are not exist with '>' symbol

< '393200103059';'TIM';'20110111'
< '393200103061';'TIM';'20060206'
< '393200103064';'OPI';'20110623'

if anyone help,advance thankful

Corona688 · October 18, 2011, 12:42pm

It's better to post a descriptive topic than "help needed please".

What rationale picks out H3G and HKG as different? They look the same to me.

How should the program be able to know which records don't exist? Is there some other file which contains all the records which must exist?

shruthi123 · October 18, 2011, 1:26pm

I had following two files file1.txt and file2.txt
File1.txt

Field1|Filed2|Filed3
'393200103001';'TIM';'20080205'
'393200103017';'TIM';'20040521'
'393200103025';'OPI';'20041025'
'393200103032';'OPI';'20080218'
'393200103048';'OPI';'20101122'
'393200103052';'H3G';'20081204'
'393200103059';'TIM';'20110111'
'393200103061';'TIM';'20060206'
'393200103064';'OPI';'20110623'

File2.txt

Field1|Filed2|Filed3
'393200103001';'TIM';'20080205'
'393200103017';'TIM';'20040521'
'393200103025';'OPI';'20041025'
'393200103032';'OPI';'20080218'
'393200103048';'OPI';'20101122'
'393200103052';'HKG';'20081204'
'393200103056';'TIM';'20110111'
'393200103088';'TIM';'20060206'

My requirement is that i have to generate three more files like
missed_file1.txt-which should have the records missing that are in file2 but not in file1.
missed_file2.txt--which should have the records missing that are in file1 but not in file2.
common.txt- which should have like this
For '393200103052' filed2 differs file1:'H3G' file2:'HKG'

I think its more clear.I had done the thing but problem was my script is working fine for small files but original files have millions of records.so its taking very much time.

---------- Post updated at 12:26 PM ---------- Previous update was at 12:22 PM ----------

awk -F '[;$]+' '
        # load first file into array indexed by field 1
        NR == FNR {
                for (i=2; i<=NF; i++) {
                         file1[$1,$2,$3,i] = $i 
                }
                # store the number of fields for this index
                file1nf[$1] = NF
                next
        }
        {
                if (!file1nf[$1]) {
                             print ""$1";"$2";"$3"" >> "missed_file1.txt"
                            next
                }
                    }
' file1.txt file2.txt

same script i had used to generate missed_file2.txt by changing the last line like thsi
file2.txt file1.txt

rdcwayx · October 18, 2011, 9:18pm

awk -F \; 'NR==FNR{a[$1];next} !($1 in a)' file1.txt file2.txt > missed_file1.txt

'393200103056';'TIM';'20110111'
'393200103088';'TIM';'20060206'

awk -F \; 'NR==FNR{a[$1];next} !($1 in a)' file2.txt file1.txt > missed_file2.txt

'393200103059';'TIM';'20110111'
'393200103061';'TIM';'20060206'
'393200103064';'OPI';'20110623'

awk -F \; 'NR==FNR{a[$1]=$2;next} $1 in a && a[$1]!=$2{print $1, $2,a[$1]} '  file2.txt file1.txt  > common.txt

'393200103052' 'H3G' 'HKG'

rdcwayx · October 18, 2011, 9:19pm

awk -F \; 'NR==FNR{a[$1];next} !($1 in a)' file1.txt file2.txt > missed_file1.txt

'393200103056';'TIM';'20110111'
'393200103088';'TIM';'20060206'

awk -F \; 'NR==FNR{a[$1];next} !($1 in a)' file2.txt file1.txt > missed_file2.txt

'393200103059';'TIM';'20110111'
'393200103061';'TIM';'20060206'
'393200103064';'OPI';'20110623'

awk -F \; 'NR==FNR{a[$1]=$2;next} $1 in a && a[$1]!=$2{print $1, $2,a[$1]} '  file2.txt file1.txt  > common.txt

'393200103052' 'H3G' 'HKG'

shruthi123 · October 19, 2011, 1:23am

Its really working awesome.But my question is my original file1.txt and file2.txt have records in millions,so does it take huge time or efficient enough.

---------- Post updated at 11:21 PM ---------- Previous update was at 11:12 PM ----------

i need to compare the same for filed3 also.Iam java developer so no idea about shell script please help me regarding this also.I mean to say if filed3 filed2 changes i need to show both changes.
File1.txt
'393200103088';'TIM';'20060207'
File2.txt
'393200103088';'TIM';'20060208'
common.txt(Filed3 has changed for filed1)
'393200103088' '20060201' '2006208'

---------- Post updated at 11:39 PM ---------- Previous update was at 11:21 PM ----------

I had generated file3.txt by the following command.
diff file1.txt file2.txt > file3.txt
If you look into file3.txt
The records missing from file2 started with < symbol.
And the records missing from file1 started with > symbol.
And it also has coommon records(coomon first field)using this i have to generate same three files as i have mentioned in previous thread.
1.misssed_file1.txt
2.missed_file2.txt
3.common.txt
< '393200103052';'H3G';'20081204'
< '393200103059';'TIM';'20110111'
< '393200103061';'TIM';'20060206'
< '393200103064';'OPI';'20110623'
> '393200103052';'HKG';'20081204'
> '393200103056';'TIM';'20110111'
> '393200103088';'TIM';'20060206'
I think here > and < symobls are helpful to generate required files.
please help me regarding this.

---------- Post updated 10-19-11 at 12:23 AM ---------- Previous update was 10-18-11 at 11:39 PM ----------

my orginial file sizes
file1.txt-604 mb
file2.txt-422 mb
i had tested your code with original files its taking much time to process the files.please help me how i can optimize time.

Corona688 · October 19, 2011, 2:03am

How long is it taking? You aren't going to process 600mb of data in a split second.

shruthi123 · October 19, 2011, 2:08am

no iam not able to procces in split second.I have started process one hour back its still in process.For missed_file1.txt genearation its self is in still process.

awk -F \; 'NR==FNR{a[$1];next} !($1 in a)' flie1.txt file2.txt > missed_file1.txt

can you please look at file3.txt, can we genreate three files using file3.txt.

shruthi123 · October 20, 2011, 3:32am

time taken to process the large for files for just to generate missed_file1.txt taken 9hours.so i need very efficient process using file3.txt can you please help if possible.

Corona688 · October 20, 2011, 2:04pm

You'll have to sort them first, then compare them line by line. Does the order of the output have to be the same as the order of the input?

Working on something.

---------- Post updated at 12:04 PM ---------- Previous update was at 10:29 AM ----------

$ cat missing.sh

#!/bin/sh

FA="data1"
FB="data2"

# Create temp files data1.1/data2.1 containing only first column.
# This lets us feed it into the 'comm' utility, which produces
# output we can quickly and easily process in awk.
awk -v FS=";" '{ print $1 >FILENAME ".1" }' ${FA} ${FB}
comm ${FA}.1 ${FB}.1 |
        awk -v FA="${FA}" -v FB="${FB}" -f missing.awk

# Delete temporary files
rm -f ${FA}.1 ${FB}.1


$ cat missing.awk
# Two tabs means third column, $1 is a token common to both files
/^\t\t\047/     {
                        getline AS<FA;  split(AS, A, ";");
                        getline BS<FB;  split(BS, B, ";");

                        for(N=2; N<=3; N++)
                        if(A[N] != B[N])
                        printf("for %s field%d differs %s:%s %s:%s\n",
                                A[1], N, FA, A[N], FB, B[N]) > "common.txt"
                }
# One tab means second column, $1 is only found in FB
/^\t\047/       {       if(getline <FB) print > "missed_" FB       }
# No tabs means first column, $1 is only found in FA
/^\047/         {       if(getline <FA) print > "missed_" FA       }

$ ./missing.sh

$ cat missed_data1

'393200103059';'TIM';'20110111'
'393200103061';'TIM';'20060206'
'393200103064';'OPI';'20110623'

$ cat missed_data2

'393200103056';'TIM';'20110111'
'393200103088';'TIM';'20060206'

$ cat common.txt

for '393200103052' field2 differs data1:'H3G' data2:'HKG'

$