Incredibly inefficient cat | grep script

Hi there,

I have 2 files that I am trying to work on.

File 1 contains a reference list of unique subscriber numbers ( 7 million entries in total)

File 2 contains a list of the subscriber numbers and their tariff (15 million entries in total). This file is in the production system and hasn't had old subscribers removed for some time so more than half of the entries need removed.

I created the following couple of lines to try to obtain the active 7 million subscriber numbers and tariffs from the behemoth 15 million list

 cat accurate_list.csv | while read ref
> do
> grep $ref production_list.csv >> new_msisdn_list.csv
> done

While this is actually working, it's only producing around 20 entries per second, which will take days to complete.

I'm afraid I'm a complete noob and can't come up with anything more inventive. I'm sure awk/sed/perl or probably any other number of languages would be perfect for something like this.

Anyway any suggestions are gratefully received.

Thanks
Cludgie

What's the exact contents of each file?

Right now, you're literally running millions of greps, comparing each ID to every other ID. That's a huge waste of time, obviously. You only really need to compare each ID to itself.

Because it seems to me that you're just looking for IDs that are in both files. If each ID only occurs once in each file, pull the IDs out of each file, combine them into one file, sort them, and only count IDs that are duplicated in the combined file.

Or you can figure out how to use the "join" utility, though that may not be any faster than what you're already doing, although it does have the huge advantage of not having to fork() and exec() a new process millions of times.

Normally,

grep -f accurate_list.csv  production_list.csv

would be ideally fitted for the job, but I'm afraid the sheer numbers will blast it.
How about using split to get to smaller accurate lists, or, if both files are sorted, even smaller production files, and give it a try?

The other question is, how do you want your output?

Right now, the way you're doing it, it's going to bundle them in groups of particular ID's. If you do it in "bulk", it may not end up sorted that politely.

If that's not a problem, I wrote this perl script to do 'split' like things without the temp file mess:

#!/usr/bin/perl

# Die screaming instead of silently if child fails
$SIG{PIPE} = sub { die("SIGPIPE"); };

my $lines=1000;

my $running=1, $ccount=0, @l;

while($#ARGV >= 0)
{
        if($ARGV[0] eq "-l") {
                $lines=$ARGV[1]+0;
                if($lines <= 0)
                {
                        printf(STDERR "Invalid size\n", $ARGV[0]);
                        exit(1);
                }
                shift;  shift;
                next;
        }

        last;
}

if($#ARGV < 0)
{
        printf(STDERR "pipesplit.pl:  Designed to read from standard input, \n");
        printf(STDERR "and split into multiple streams, running one command per loop\n");
        printf(STDERR "and writing into its STDIN.  \@FNAME\@ is available as an\n");
        printf(STDERR "sequence number if needed.\n\n");
        printf(STDERR "syntax:  pipesplit.pl [-l lines] command arguments \@FNAME\@ ...\n");
        exit(1);
}

#print $ARGV[0], "\n";
#exit(0);

my @l=@ARGV;

while($running) {
        my $n, $fname=sprintf("%08d", $ccount++);
        my $fr=\$fname;

        # Use given arguments as a command, with @FNAME@ substituted
        # for an incrementing number like 00000001

        open(OUT, "|-",
                map { my $v=$_; $v =~ s/\@FNAME\@/${$fr}/; $v } @l
        );

        for($n=0; $n<$lines; $n++)
        {
                my $line=<STDIN>;

                if(length($line) == 0) {
                        $running=0;
                        last;
                }

#               print STDERR "chunk $ccount line $line";

                print OUT $line;
        }

        close(OUT);
}

printf(STDERR "Wrote %d chunks of %d lines\n", $ccount-1, $lines);

You would use it like

./linesplit.pl -l 5000 grep -F -f - production_list.csv < accurate_list.csv > new_msisdn_list.csv
4 Likes

I'm not sure what thes files look like. But if they are sorted on the id field it sounds like this is just what the "join" command does. Here is a sample run:

$
$
$ cat file1
8  user 092 kjhuhggty
4  user 343 nbvnvcvc
9  user 391 jllklklkj
6  user 549 rewrewer
2  user 654 kjlkjl
7  user 760 jbjftgd
1  user 777 hkhghgh
3  user 888 hghfgfhgf
5  user 984 nbvnbvmn
$
$
$ cat file2
391
654
760
777
888
999
$
$
$ join -1 3 -2 1 -o "1.1 1.2 1.3 1.4"  file1 file2
9 user 391 jllklklkj
2 user 654 kjlkjl
7 user 760 jbjftgd
1 user 777 hkhghgh
3 user 888 hghfgfhgf
$
$

Off topic a little (apologies to the OP), but I love this util Corona688!

I was wondering if it could be done as a bash script. I came up with the script below.

While testing I did discover that if lines split evenly it does run the command 1 more time than needed with empty input (e.g. -l 10 for a 100 line file).

#!/bin/bash
lines=1000
ccount=0

while getopts l: opt 2> /dev/null
do
    case $opt in
        l) lines=$OPTARG ;;
        *) echo "Illegal option -$opt" >&2
           exit 1
        ;;
    esac
done
shift $((OPTIND-1))

if [ $# -le 0 ]
then
    cat >&2 <<EOF
linesplit:  Designed to read from standard input
            and split into multiple streams, running one command per loop
            and writing into its STDIN.  @FNAME@ is available as an
            sequence number if needed.

syntax:  linesplit [-l lines] command arguments @FNAME@ ...
EOF
    exit 1
fi

while [[ ${PIPESTATUS[0]} != 33 ]]
do
    ((ccount++))
    printf -v fname "%08d" $ccount
    for((n=0; n<lines; n++))
    do
        read line && echo "$line" || exit 33
    done | eval ${@//@FNAME@/$fname}
done

printf "Wrote %'d full chunks of %'d lines\n" $((ccount-1)) $lines >&2

eg:

$ printf "%s\n" $(seq 1 100) | ./linesplit.pl -l 10 wc -l
10
10
10
10
10
10
10
10
10
10
0
Wrote 10 chunks of 10 lines

$ printf "%s\n" $(seq 1 100) | ./linesplit.pl -l 3000 wc -l
100
Wrote 0 chunks of 3000 lines
1 Like

@achenle - The smaller reference file contains the 12 digit numeric reference for every active subscriber we have (format=000000000000). The much larger production file contains the 12 digit reference and an alphanumeric tariff code separated by a coma. Tariff codes have the following A000000 or AA00000 or A000000-AA00000 or AA00000-AA00000. The longer tariff codes indicate an additional service or add-on. So the format is 000000000000,A000000 or any other tariff/add-on variant

My aim is to extract the reference and tariff info from the production file, using the references contained in the accurate list.

@corona688 - The output of should be the same format as the production file which is reference,tariff ie 000000000000,AA00000. I don't think sorting is an issue at this point.

Guys thanks very much, I'll give it a go.

With your new information, you can speed up Corona's perl solutions with awk:

pipesplit.pl -l 5000 awk 'FILENAME=="-" {A[$1]; next} ($1 in A)' - FS="," production_list.csv < accurate_list.csv

The hash lookups are faster than an RE that can match anywhere in the lines.
Also, increasing the 5000 by a factor 10 will increase speed (and memory consumption) by the same factor.
BTW I find the following order more intuitive:

< accurate_list.csv pipesplit.pl -l 5000 awk 'FILENAME=="-" {A[$1]; next} ($1 in A)' - FS="," production_list.csv
1 Like

Thank you! Not my first crack at that problem, just the first one good enough to bother reusing.

I wrote it in Perl since a while read LINE loop isn't terribly efficient when thousands to millions of lines are involved. [edit: Actually, I wrote perl since the original read raw binary...] This probably doesn't matter when the grep is going to be taking so much longer anyway. :slight_smile: I see what you mean about that bug. I should be opening the process after a line is read, not before.

That's a nice translation, and putting the args in getopt makes it look so much more standard/official.

I really don't like the look of that eval, though. Eval would have made my life easier in perl too, but I worked hard to avoid it, to keep things safe and sane -- it will do weird things like eating quotes, evaluating accidental expressions, mangling things containing $, executing backticks, stopping at # considering it a comment, etc. Even if it takes 30 lines of code to do the same thing without it, that'd be better. Or maybe the value should be exported to the environment. I'll see what I can do...

This replaces the eval with a safer text replacement loop. I've also turned fname into FNAME since I'm exporting it to the environment as an afterthought. Otherwise your code looks terrific.

#!/bin/bash
lines=1000
ccount=0

# Convert "command" "-flag" "@FNAME@" into "command" "-flag" "000001"
# uses $FNAME external.  Result is in ARGS array.
parsecmd() {
        local N=0
        ARGS=()
        while [ "$#" -gt 0 ]
        do
                ARGS[$((N++))]="${1//@FNAME@/${FNAME}}"
                shift
        done
}

while getopts l: opt 2> /dev/null
do
    case $opt in
        l) lines=$OPTARG ;;
        *) echo "Illegal option -$opt" >&2
           exit 1
        ;;
    esac
done
shift $((OPTIND-1))

if [ $# -le 0 ]
then
    cat >&2 <<EOF
linesplit:  Designed to read from standard input
            and split into multiple streams, running one command per loop
            and writing into its STDIN.  @FNAME@ is available as an
            sequence number if needed.

syntax:  linesplit [-l lines] command arguments @FNAME@ ...
EOF
    exit 1
fi

while [[ ${PIPESTATUS[0]} != 33 ]]
do
    ((ccount++))
    printf -v FNAME "%08d" $ccount
    export FNAME
    parsecmd "$@" # Convert @FNAME@ into $FNAME, put in ${ARGS[@]}

    for((n=0; n<lines; n++))
    do
        read line && echo "$line" || exit 33
    done | "${ARGS[@]}"
done

printf "Wrote %'d full chunks of %'d lines\n" $((ccount-1)) $lines >&2

I was thinking of the [command] parameter as being a lot like that implemented by ssh: which also has all the eval issues you mentioned. I find the double expansion of ssh a pain, but at least there is a precedent for this sort of thing (su -c also comes to mind).

Perhaps I'm missing something by without the eval how would you do this:

$ printf "%s\n" {00..99} | ./linesplit -l 10 wc -l \> out_@FNAME@.txt

I also find it very cool that you can do stuff line this:

./linesplit 'tr -d '^M' | grep -vi ignore > out_@FNAME@.txt'

Can you explain what the parsecmd is for I cant see how it would be different to:

done | ${@//@FNAME@/$FNAME}

Edit: Another thought the read should include the -r option to stop it treating backslash in the input specially.

It's very traditional to get a shell when you do a shell login, but that's kind of a self-fulfilling prophecy.

sudo does not perform an old-fashioned tty LOGIN process and does not behave like that:

sudo 'echo $HOSTNAME'

sudo:  echo $HOSTNAME: command not found

Further, most common "command modifying" utilities like nice, nohup, xargs, env etc don't. The only single exception I can think of off-hand is 'watch'.

Furthermore, permitting expansion to happen in the same shell is asking for trouble. Expansion inside ssh won't cause ssh itself to blow up from syntax errors... Kind of bad form. I'd want to put it in an external shell at the very least.

That problem would be more easily solved with standard split I think.

Why not put tr before linesplit? I see what you're getting at, but it'd be simple enough to put it in an awk command.

The awk command suggested by MadeInGermany would mess up when re-parsed unless you re-quote and escape it extravagantly.

It substitutes inside individual tokens instead of cramming into one string, substituting, and splitting back apart. This preserves splitting.

Hello all,

I finally got it working. I used this:

gawk 'BEGIN { FS=","} NR == FNR{a[$0];next} $1 in a' new_msisdn_list.csv production_list.csv > output.txt

real   0m36.061s
user  0m30.694s
sys   0m0.704s

Thank you all who replied. I appreciate your help.

1 Like