Slow Running Script (Reading 8000 lines)

u20sr · July 12, 2013, 12:19pm

Slow runnin script. The problem seems to be the sed calls.
In summary the script reads list of users in file1. For each
username search two files (file 1 & file2) for the username
and get the value in the next line after "=". Compare these
values with each other.

If the same then output to a file and if not output to another file.
For approx 8000 lines in file1 it takes approx 15 minutes to run?
not very good. Any suggestions on removing the bottleneck?

Script:

#!/bin/ksh
compareusernames() {
p_name="$1"
result_one=`sed -n "/$p_name/{n;p;}" $file2 | cut -d= -f2 | tr -d ' '`
result_two=`sed -n "/$p_name/{n;p;}" $file3 | cut -d= -f2 | tr -d ' '`
if [[ "$result_one" = "$result_two" ]] then
        echo "$p_name" >> matches.out
else
        echo "$p_name" >> no_matches.out
fi
 
i=0
while read v_name
do
compareusernames "$v_name"
((i=$i+1))
done < $file1


}

File1:

user1
user2
user3
user4

File2:

name=user1
gud=100
name=user2
gud=200
name=user3
gud=300

File3:

name=user1
gud=100
name=user2
gud=xxx
name=user3
gud=xxx

Yoda · July 12, 2013, 1:06pm

Try using this awk program instead:

awk '
        BEGIN {
                F = "file1"
                while (( getline line < F ) > 0 )
                {
                        A1[line]
                }
                close (F)

                F = "file2"
                while (( getline line < F ) > 0 )
                {
                        n = split ( line, V, "=" )
                        if ( V[2] in A1 )
                        {
                                i = V[2]
                                getline line < F
                                n = split ( line, V, "=" )
                                A2 = V[2]
                        }
                }
                close (F)

                F = "file3"
                while (( getline line < F ) > 0 )
                {
                        n = split ( line, V, "=" )
                        if ( V[2] in A1 )
                        {
                                i = V[2]
                                getline line < F
                                n = split ( line, V, "=" )
                                A3 = V[2]
                        }
                }
                close (F)
        }
        END {
                for ( k in A1 )
                {
                        if ( A2[k] == A3[k] && A2[k] && A3[k] )
                                print k > "matches.out"
                        else
                                print k > "no_matches.out"
                }

        }
' /dev/null

Let me know how long it took to complete execution.

Corona688 · July 12, 2013, 1:14pm

You are reading and processing two data files, through six, processes every line. Not very efficient.

I'd try a language like awk to make recalling the data much easier, but the format of your data files is very difficult too. Is that fixed? Whichever way you choose, it's much easier to just have lines of username gud in them.

Don_Cragun · July 12, 2013, 2:08pm

I note that the original ksh script (when fixed to remove the syntax errors and properly terminate the function) includes user4 in matches.out and Yoda's awk script includes user4 in no_matches.out. When an entry in File1 does not appear in File2 or File3, should that entry be:

added to matches.out,
added to no_matches.out,
ignored, or
issue a diagnostic saying the entry was not found?

I don't understand why Yoda didn't use FS="=" instead of splitting lines after reading them, but until I know how to handle the issue above, I'm not going to post my awk script.

u20sr · July 23, 2013, 7:39am

don_cragun : Good point. These should be sent to a file - something like no_results.out

Also, if there is a null returned after the = in file2 then this should be sent to no_gud.out

MadeInGermany · July 23, 2013, 12:46pm

step 1

result_one=`sed -n "/$p_name/ {n;p;q;}" $file2 | cut -d= -f2 | tr -d ' '`

step 2

result_one=`awk -F= 'm==1 {print $2; exit} $2~/'$p_name'/ {m=1}' $file2 | tr -d ' '`

Don_Cragun · July 23, 2013, 4:16pm

Your requirements are still ambiguous. The following awk script puts a name found in File1 into one of four files:

in no_results.out if the name does not appear in File2 and does not appear in File3,
in no_gud.out if the name appears in File2 or File3 but not in both, or if the gud=value line in either file has an empty value string,
in no_matches.out if the name appears in File2 and File3 and the value in gud=value in both files is not empty but the values are different, or
in matches.out if the name appears in both files, neither value is empty, and both values are identical.

If this isn't what you want, please restate your exact requirements.

awk -F= '
FILENAME != lf {
        f++
        lf = FILENAME
}
$1 == "name" {
        u = $2
        next
}
$1 == "gud" {
        f == 1 ? r1 = $2 : r2 = $2
}
f == 3 {if(!($1 in r1) && !($1 in r2)) print > "no_results.out"
        else    if(!($1 in r1) || r1[$1] == "" ||
                   !($1 in r2) || r2[$1] == "") print > "no_gud.out"
        else    if(r1[$1] != r2[$1]) print > "no_matches.out"
        else    print > "matches.out"
}' File2 File3 File1

As always, if you try running this on a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of /bin/awk or /usr/bin/awk .

u20sr · July 24, 2013, 9:25am

Working flawless. Running time is under a second.

Thank you.

I can see the barebones logic but struggling a bit with the syntax. Could you explain a bit more what the lines are actually doing.

Don_Cragun · July 24, 2013, 11:40am

Here is a fully commented version of the script:

awk -F= '               # Set field separator to "="
# 1st two input files consist of pairs of lines using the format:
#               name=user
#               gud=value
# in this order.
# So with FS set to "=", $1 will be "name" or "gud" and $2 will be the name of
# the user or the value associated with that user name.

# 3rd input file consists of lines using the format:
#               user

# f     number of files seen
# lf    name of last file seen
# u     user from name=user line in 1st two input files
# r1 recorded value from 1st file from gud=value line for name u
# r2 recorded value from 2nd file from gud=value line for name u
FILENAME != lf {        # If this is the first time we have a line from this file
        f++             # increment the number of files seen and
        lf = FILENAME   # save the name of the current file for comparison.
}
$1 == "name" {          # If this is a name= line in 1 of the 1st 2 files,
        u = $2          # save the user name, and
        next            # skip to next input line.
}
$1 == "gud" {           # If this is a gud= line in 1 of the 1st 2 files...
        # If we are processing a line from the 1st file, set r1[] for the
        # current user name to the value found on this line from the 1st input
        # file; otherwise set r2[] for the current user name to the value found
        # on this line from the 2nd input file.
        f == 1 ? r1 = $2 : r2 = $2
}
f == 3 {                # If this line is from the 3rd input file...
        # If the user on this line was not in either of the 1st 2 files, save
        # the name in no_results.out.
        if(!($1 in r1) && !($1 in r2)) print > "no_results.out"
        # otherwise,
        # if the user on this line was not in one of the 1st 2 files or if the
        # value associated with this user was an empty string in one of the
        # files, save the name in no_gud.out.
        else    if(!($1 in r1) || r1[$1] == "" ||
                   !($1 in r2) || r2[$1] == "") print > "no_gud.out"
        # otherwise,
        # if the value for the user is different in the 1st 2 files, save the
        # name in no_matches.out.
        else    if(r1[$1] != r2[$1]) print > "no_matches.out"
        # otherwise,
        # the user is in both files and the values match; save the name in
        # matches.out.
        else    print > "matches.out"
}' File2 File3 File1    # The input files are File2, File3, and File1

u20sr · July 24, 2013, 12:51pm

Extremely helpful - much appreciated.