Removing Lines if value exist in first file

svn · August 28, 2009, 3:43pm

I tried a few ways to resolve this using a bash script w/ a loop, no luck.

File1: roughly 6,000 account numbers such as:
1111
1512
1113
123

I also have a dozen or so csv files, w/ the account number in the 4th field. What I would like to do is remove all lines if the account number is contained in File1.

So file looking like this:
1,55,John,2000,abc
1,55,Jane,1113,abc
1,55,Smit,9565,def
1,55,Phil,123,adf
1,55,John,205480,sdf

Would end up looking like this:
1,55,John,2000,abc
1,55,Smit,9565,def
1,55,John,205480,sdf

Any help would be appreciated!

sb008 · August 28, 2009, 3:57pm

cat filter.sh

#/usr/bin/bash

INFILE=${1}

IFS=","

cat ${INFILE} | \
while read A B C ID D
do
  grep "^${ID}" excludes > /dev/null
  if [ ${?} -ne 0 ]
  then
    echo "${A},${B},${C},${ID},${D}"
  fi
done

cat filter2.sh

#/usr/bin/bash

INFILE=${1}

cat ${INFILE} | \
while read LINE
do
  ID=`echo "${LINE}" | cut -f4 -d","`

  grep "^${ID}" excludes > /dev/null
  if [ ${?} -ne 0 ]
  then
    echo "${LINE}"
  fi
done

cat excludes

cat infile

1,55,John,2000,abc
1,55,Jane,1113,abc
1,55,Smit,9565,def
1,55,Phil,123,adf
1,55,John,205480,sdf

Execute:

filter1.sh infile
or
filter2.sh infile

or

filter1.sh infile > outfile
or
filter2.sh infile > outfile

svn · August 28, 2009, 4:14pm

Works great - I used your 2nd suggestion. Processed huge files a lot faster that I would have thought.

You really helped me out of a jam! Thanks again.

Azhrei · August 28, 2009, 6:49pm

No way would I use a shell for that job! The following Perl script is probably a hundred times faster and more efficient!

Put the following into a file and add execute permission. Run the script and give it two filenames. The first should be the list of account numbers to exclude and the second should be the list that has the account number in field 4.

#!/usr/bin/perl

my @a, %exclude;
my $file = shift;
open(EXCLUDE_LIST, "< $file") or die;
chomp( @a=<EXCLUDE_LIST> );
close(EXCLUDE_LIST);
@exclude{@a}=@a;

while (<>) {
    print unless exists $exclude{ (split(/,/))[3] };
}

svn · August 29, 2009, 11:16am

Tried the Perl version - yeah, does work a LOT faster. Not familiar w/ Perl - had one question - if the account number changes position, i know i can edit the [3] to let's say [0] if it is located in the first column.

My question is how can i make that account number position a variable so that i can pass it at the same time I'm specifying the file names? Unfortunately, some of the files I am scrubbing against, the variable changes position.

Thanks again!

ripat · August 29, 2009, 11:41am

awk alternative:

awk -F',' 'NR==FNR{_[$0]=1}NR>FNR&&!_[$4]{print}' exclude infile

svn · August 29, 2009, 12:58pm

For the awk option - how do i put it into a bash script? When running manually, works great. As soon as I put that same line into a script, the $0 and $1 doesn't utilize the arguments specified in that line of code.

ripat · August 29, 2009, 1:31pm

One possibility:

filter.sh

#!/bin/bash
awk -F',' 'NR==FNR{_[$0]=1}NR!=FNR&&!_[$4]{print}' $1 $2  > $3

$ filter.sh exclude infile outfile

protocomm · August 29, 2009, 1:59pm

or always with awk...

awk -F "," 'NR==FNR{a[$1]=$1;next} !($4 in a) {print $0}' file1 file2

---------- Post updated at 09:59 AM ---------- Previous update was at 09:43 AM ----------

awk alternative...

awk -F "," 'NR==FNR{a[$1];next} !($4 in a) {print $0}' file1 file2

ripat · August 29, 2009, 2:28pm

Indeed. Corrected version:

awk -F',' 'NR==FNR{_[$0]=1;next}!_[$4]{print}' exclude infile

Merci.

Azhrei · August 29, 2009, 4:00pm

Now you're starting to get tricky.

How about a command line that looks like this:

filter excludeFile 4 file1 file2 file3...

Would that work? The exclude file comes first, followed by the numeric field number (starting at 1 for the first field), followed by a list of one or more files that use that particular field number. If you use a negative field number, it will count from the end of the line instead of the front, so a field of -2 would mean the second to the last field on every line, even if each line had a different number of fields.

In the code below I have told Perl to rename the original files so that they end in .bak and then write the changes to the original name. For the command above, you'd end up with file1 and file1.bak for example.

If that works for you, try the following. Note the extra -I.bak option on the first line and the extra $field variable.

#!/usr/bin/perl -I.bak

my @a, %exclude;
my $file = shift;
open(EXCLUDE_LIST, "< $file") or die;
chomp( @a=<EXCLUDE_LIST> );
close(EXCLUDE_LIST);
@exclude{@a}=@a;

my $field = shift;
if ($field =~ /\D/) {
    $field = 4;
}
die "Field specifier may not be zero.\n" unless $field;
$field-- if $field > 0;
while (<>) {
    print unless exists $exclude{ (split(/,/))[$field] };
}

If there are other options you want to add (such as using a different delimiter between fields), then it's time to start using Getopt::Std and specifying options using the same techniques other commands use: a dash followed by a letter.

@ripat: That's a cute trick with the NR==FNR for awk. I'm going to have to remember that one. Only useful for a single file, but still... (The file handling in awk is terrible!)

---------- Post updated at 04:00 PM ---------- Previous update was at 03:52 PM ----------

It's a really bad idea to use variables without putting double quotes around them! I can screw up that awk command pretty bad by passing the script a filename with a space or wildcard character in it, especially as the third parameter.

Please put double quotes around ALL variable substitutions. Out of a thousand uses it will only be wrong 3-4 times, so you've got a 99.6% chance of getting it right. Those are pretty good odds.

Scrutinizer · August 29, 2009, 4:22pm

While I agree that Perl is usually well suited for this type of application, I do not think this generalization is accurate. The shell scripts above are fine but there is room for some significant speed optimizations. If we use ksh (ksh93s+) instead of bash and a method that resembles the one in your Perl script, I think there would not be a real big difference in speed.

filter.ksh93

#!/usr/bin/ksh
typeset -A EXCLUDED
EXCLUDE_LIST=$(< $1)
INFILE=$2
for excl in $EXCLUDE_LIST; do
  EXCLUDED[$excl]=1
done
IFS=","
while read a b c id d; do
  if [[ ${EXCLUDED[$id]} -ne 1 ]]; then
    echo "${a},${b},${c},${id},${d}"
  fi
done < $INFILE

./filter.ksh93 excludes infile > outfile

Azhrei · August 29, 2009, 5:43pm

Hmm. Let's take a look at your script and its efficiency/performance and compare that to the Perl script, shall we?

First, the perl script loses big time in terms of startup cost; initializing the interpreter and compiling the script are overhead that can never be reclaimed (although it can be amortized if the data files are large enough). The perl script also loses (slightly) in that it's less readable to people unfamiliar with the language (although the OP was able to correctly determine how to change the field used for his particular case). The final lossage comes from the wordiness of my perl example -- it could've been done more concisely but I was at least partially concerned about the OP being able to understand its overall operation.

(I'm modifying your Korn shell script to add some performance and usage benefits, but it remains essentially the same.) Your Korn shell script does not have the startup cost, but as a true interpreter it will have to constantly be reparsing the loop body every time through the loop, so if there are a significant number of iterations it will be a performance problem. There's also the problem of single and double quotes occurring in the input; the Korn shell's read will handle paired quotes correctly (as it interprets the quotes) while perl will need help from a regular expression to do the work (or the Text::Balanced module). The reason I mention this as a problem is that a single apostrophe will screw up the Korn script but have no impact on the perl script (as the perl script ignores the issue entirely!).

filter.ksh93

#!/usr/bin/ksh
typeset -A EXCLUDED
while read excl; do
  EXCLUDED[$excl]=1
done < "$1"
IFS=","
while read -A fields; do
  if (( ${EXCLUDED[${fields[3]}]} != 1 )); then
   echo "${fields
[*]}"
  fi
done < "$2"

In any case, there is no comparison between the two languages when processing more than a few hundred lines of data. I wrote a Korn script to do some text processing for a client (similar to this task) that took 28+ minutes to process 300k records. The same task in Perl took a little over 2 minutes. That's 10k records per minute for the shell script and 150k records per minute for the perl script. I attribute the difference to the efficiencies of pseudo-compiling and the nature of the I/O between the two scripts (the perl script was in "paragraph" mode, reading 10-20 lines at a time while the shell script had to do one line at a time and maintain a FSM).

ripat · August 30, 2009, 3:57am

@ Azhrei

Check your ksh snippet as it throws an error with my ksh93 when evaluating your conditional expression:

if (( ${EXCLUDED[${fields[0]}]} != 1 )); then

error:

./ex.korn: line 8:   != 1 : arithmetic syntax error

Which is normal as it tries to evaluate a string (empty string) in a arithmetic expression. Try with:

if [[ ${EXCLUDED[${fields[0]}]} == "" ]]; then

which is working well.

Talking about performance I did a test on large sample files:
excluded (cardinality: 50000 lines)
infile (cardinality: 29000 lines)

Results:

jeanluc@ibm:~/scripts/test$ time ./ex.pl excluded infile > /tmp/out.pl
real	0m0.214s
user	0m0.176s
sys	0m0.032s

jeanluc@ibm:~/scripts/test$ time ./ex.korn excluded infile > /tmp/out.korn
real	0m1.154s
user	0m1.060s
sys	0m0.088s

jeanluc@ibm:~/scripts/test$ time ./ex.awk excluded infile > /tmp/out.awk
real	0m0.093s
user	0m0.072s
sys	0m0.016s

As often the case in data file crunching awk is fast and terse.

Azhrei · August 30, 2009, 10:44am

ripat:

Check your ksh snippet as it throws an error with my ksh93 when evaluating your conditional expression:
if (( ${EXCLUDED[${fields[0]}]} != 1 )); then
error:
./ex.korn: line 8:   != 1 : arithmetic syntax error
Which is normal as it tries to evaluate a string (empty string) in a arithmetic expression.

Yep, you're right. Putting double quotes around the EXCLUDED expression should fix it and since it's a numeric comparison that's probably a better solution that using the [] as a string comparison. But I suppose they're about the same in this case.

That seems a pretty reasonable size, although the OP didn't say how many records of each he actually has.

While playing with my ksh script I was surprised to find that the associative arrays in ksh don't have the content limit that indexed arrays have, i.e. an indexed array can only have up to 4096 elements but I was able to put 500k elements into an associative array without problems...

Results:

jeanluc@ibm:~/scripts/test$ time ./ex.pl excluded infile > /tmp/out.pl
real    0m0.214s
user    0m0.176s
sys    0m0.032s

jeanluc@ibm:~/scripts/test$ time ./ex.korn excluded infile > /tmp/out.korn
real    0m1.154s
user    0m1.060s
sys    0m0.088s

jeanluc@ibm:~/scripts/test$ time ./ex.awk excluded infile > /tmp/out.awk
real    0m0.093s
user    0m0.072s
sys    0m0.016s

As often the case in data file crunching awk is fast and terse.

I'm guessing that each of these test times is a "second run" test so that the I/O of bringing the executable and libraries into memory has been factored out. If that's the case, I must admit that I'm quite surprised by the awk results. I've not seen awk produce faster runtimes compared to perl when large datasets are processed, yet your numbers show a 2:1 advantage for awk.

I wonder if it would be faster to remove the chomp() in the first perl loop and simply do the lookup by adding a \n in the main loop? I have to assume that concatenating a newline on each record of input would have to be slower than stripping the newline from the exclude list, except I notice that you've got an exclude list 60% larger than the input file so maybe that's a factor?

Very interesting. Thank you for posting those numbers as I now have something to investigate between other projects!

Scrutinizer · August 31, 2009, 5:23pm

Thank you too Azhrei, that was an interesting discussion.
I had ran some tests as well against a large inputfile BTW.

Seconds  terminal file  /dev/null
------------------------------------
ksh93    27,15    30,23    25,03
perl     12,89     3,54     3,29
awk       8,15     1,75     1,69

You were right, Perl is much faster with bigger inputs, although not 100 times faster. I thought the differences would be smaller. Well they were small with terminal output, but not with file output. I think the difference between terminal output and output to file is remarkable in the case of Perl. I tried shcomp that compiles ksh93 but that did not make it any faster..

Azhrei · August 31, 2009, 7:37pm

Interesting numbers. Thanks. But you didn't say what your input file sizes were for your tests...?

The terminal output will be constrained by the efficiency of the terminal driver and, assuming you're using a network connection, the efficiency of your network stack. The output to /dev/null is a great test since it eliminates output costs from the processing entirely, although the OP still must perform output to save his data. It gets ugly to test I/O to a filesystem, however, since a fragmented fs will be slower. Ideally, multiple runs would be performed with the first set of results thrown away, and with successive executions overwriting the original data in the file (preventing new data block allocations from being required). Such testing isn't appropriate for the problem at hand, although they might make for some more interesting discussion.

And none of the above factor in usability-related issues, such as the Perl script automatically creating .bak files from the input files.

I think it's time to look at specifics of each execution platform. For example, what version of Korn shell are you using (for example, turn on vi editing mode and type <Esc>^V to see the version number), what version of Perl (just perl -v), and what version of awk (not sure for this one?). Then I'm going to look into this some more when I get some time later in the week.

ripat · September 1, 2009, 1:02am

Interesting thread indeed.

Simply awk --version and while you are at it, give mawk a try if available on your system. mawk interpretor is often much faster than straight awk.

Azhrei · September 1, 2009, 1:13am

Doesn't work on Mac OS 10.4.11, but strings found version 20040207; probably the BSD version on my machine and you've got the GNU version. I wonder if that's a significant factor? I'd be willing to bet that no one has tweaked at the BSD version of awk since BSD 4.4Lite was released.

I've got Korn shell "M 1993-12-28 p", and my Perl is 5.8.6 with some security patches. But no answer from anyone else on versions.

Scrutinizer · September 1, 2009, 4:44pm

I used 425412 infile records against 215 excludes (on a laptop Pentium III 850 MHz)

Perl: v5.8.8 built for i486-linux-gnu-thread-multi
Ksh: 93s+20071105-1 The real, AT&T version of the Korn shell
awk: mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

compiled limits:
max NF 32767
sprintf buffer 1020