Performance problem in Shell Script

sakthisivi · August 18, 2015, 5:28pm

Hi,
I am Shell script beginner.
I wrote a shell programming that will take each line of a file1 and search for it in another file2 and give me the output of the lines that do not exist in the file2.
I wrote it using do while nested loop but the problem here is its running for ever . Is there a way I can improve the performance of the script.
Both of my files contains 700K records each.

sea · August 18, 2015, 6:55pm

Hi sakthisivi, welcome to the forums.

To enhance your script, please share what you have tried so far.

Greetings

daPeach · August 18, 2015, 7:10pm

Hi,

no need of a shell script to do such a thing, grep can do it, alone.

RudiC · August 19, 2015, 2:07am

And, share small but representative samples of the files.

Corona688 · August 19, 2015, 12:10pm

[edit] I answered the wrong thread: :o

sakthisivi · August 19, 2015, 12:33pm

can you tell me how can i do it with a grep command?

Scrutinizer · August 19, 2015, 12:37pm

Typically, one would use:

grep -vxFf file2 file1

or try awk:

awk 'NR==FNR{A[$0]; next} !($0 in A)' file2 file1

These two approaches only work if the lines are exactly the same, with no leading or trailing whilespace in one file, that is missing in the other...

--
Otherwise you could try this adaptation of the awk approach:

awk '{p=$0; $1=$1} NR==FNR{A[$0]; next} !($0 in A){print p}' file2 file1

--
On Solaris use /usr/xpg4/bin/grep and /usr/xpg4/bin/awk

Corona688 · August 19, 2015, 12:38pm

from man grep:

GREP(1)                     General Commands Manual                    GREP(1)



NAME
       grep, egrep, fgrep - print lines matching a pattern

SYNOPSIS
       grep [OPTIONS] PATTERN [FILE...]
       grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]

...

       -F, --fixed-strings
              Interpret  PATTERN  as  a  list  of  fixed strings, separated by
              newlines, any of which is to be matched.

...

       -f FILE, --file=FILE
              Obtain patterns  from  FILE,  one  per  line.   The  empty  file
              contains zero patterns, and therefore matches nothing.

...

       -v, --invert-match
              Invert the sense of matching, to select non-matching lines.

MadeInGermany · August 19, 2015, 12:49pm

grep -F or fgrep:

fgrep -xvf file2 file1

For an unknown reason (hash implementation?) awk is faster than grep

awk 'NR==FNR {s[$0]; next} !($0 in s)' file2 file1

Corona688 · August 19, 2015, 12:56pm

For 700K records, either will do. Handy as awk is, using grep doesn't require one to learn an entire new programming language

sakthisivi · August 19, 2015, 4:40pm

i think the above grep command compares 1st line with first line and second line with second line and third line with third line in both the files but I need to compare 1st line of the first file with all the lines in the second file and second line of the first file with all the lines in the second file respectively and print only the lines in the first file that are not matched with the second file.
Thanks for your help on this.

RudiC · August 19, 2015, 4:53pm

You are mistaken - it tries to match every pattern (or entire line in case you use fgrep ) in the pattern file- the one supplied to the -f option - to every line in the the files you present as "targets". But - you must make sure that the patterns are formed in a way that they can be matched in the target files. Here e.g. DOS line terminators can be a killer!

sakthisivi · August 21, 2015, 2:44pm

Thank you.
AWK command worked like charm.
Is there any links where i can get understand these ?

---------- Post updated at 01:44 PM ---------- Previous update was at 01:43 PM ----------

@RudiC.
Thanks for the explanation.