File 1 contains a reference list of unique subscriber numbers ( 7 million entries in total)
File 2 contains a list of the subscriber numbers and their tariff (15 million entries in total). This file is in the production system and hasn't had old subscribers removed for some time so more than half of the entries need removed.
I created the following couple of lines to try to obtain the active 7 million subscriber numbers and tariffs from the behemoth 15 million list
cat accurate_list.csv | while read ref
> do
> grep $ref production_list.csv >> new_msisdn_list.csv
> done
While this is actually working, it's only producing around 20 entries per second, which will take days to complete.
I'm afraid I'm a complete noob and can't come up with anything more inventive. I'm sure awk/sed/perl or probably any other number of languages would be perfect for something like this.
Right now, you're literally running millions of greps, comparing each ID to every other ID. That's a huge waste of time, obviously. You only really need to compare each ID to itself.
Because it seems to me that you're just looking for IDs that are in both files. If each ID only occurs once in each file, pull the IDs out of each file, combine them into one file, sort them, and only count IDs that are duplicated in the combined file.
Or you can figure out how to use the "join" utility, though that may not be any faster than what you're already doing, although it does have the huge advantage of not having to fork() and exec() a new process millions of times.
would be ideally fitted for the job, but I'm afraid the sheer numbers will blast it.
How about using split to get to smaller accurate lists, or, if both files are sorted, even smaller production files, and give it a try?
The other question is, how do you want your output?
Right now, the way you're doing it, it's going to bundle them in groups of particular ID's. If you do it in "bulk", it may not end up sorted that politely.
If that's not a problem, I wrote this perl script to do 'split' like things without the temp file mess:
#!/usr/bin/perl
# Die screaming instead of silently if child fails
$SIG{PIPE} = sub { die("SIGPIPE"); };
my $lines=1000;
my $running=1, $ccount=0, @l;
while($#ARGV >= 0)
{
if($ARGV[0] eq "-l") {
$lines=$ARGV[1]+0;
if($lines <= 0)
{
printf(STDERR "Invalid size\n", $ARGV[0]);
exit(1);
}
shift; shift;
next;
}
last;
}
if($#ARGV < 0)
{
printf(STDERR "pipesplit.pl: Designed to read from standard input, \n");
printf(STDERR "and split into multiple streams, running one command per loop\n");
printf(STDERR "and writing into its STDIN. \@FNAME\@ is available as an\n");
printf(STDERR "sequence number if needed.\n\n");
printf(STDERR "syntax: pipesplit.pl [-l lines] command arguments \@FNAME\@ ...\n");
exit(1);
}
#print $ARGV[0], "\n";
#exit(0);
my @l=@ARGV;
while($running) {
my $n, $fname=sprintf("%08d", $ccount++);
my $fr=\$fname;
# Use given arguments as a command, with @FNAME@ substituted
# for an incrementing number like 00000001
open(OUT, "|-",
map { my $v=$_; $v =~ s/\@FNAME\@/${$fr}/; $v } @l
);
for($n=0; $n<$lines; $n++)
{
my $line=<STDIN>;
if(length($line) == 0) {
$running=0;
last;
}
# print STDERR "chunk $ccount line $line";
print OUT $line;
}
close(OUT);
}
printf(STDERR "Wrote %d chunks of %d lines\n", $ccount-1, $lines);
I'm not sure what thes files look like. But if they are sorted on the id field it sounds like this is just what the "join" command does. Here is a sample run:
$
$
$ cat file1
8 user 092 kjhuhggty
4 user 343 nbvnvcvc
9 user 391 jllklklkj
6 user 549 rewrewer
2 user 654 kjlkjl
7 user 760 jbjftgd
1 user 777 hkhghgh
3 user 888 hghfgfhgf
5 user 984 nbvnbvmn
$
$
$ cat file2
391
654
760
777
888
999
$
$
$ join -1 3 -2 1 -o "1.1 1.2 1.3 1.4" file1 file2
9 user 391 jllklklkj
2 user 654 kjlkjl
7 user 760 jbjftgd
1 user 777 hkhghgh
3 user 888 hghfgfhgf
$
$
Off topic a little (apologies to the OP), but I love this util Corona688!
I was wondering if it could be done as a bash script. I came up with the script below.
While testing I did discover that if lines split evenly it does run the command 1 more time than needed with empty input (e.g. -l 10 for a 100 line file).
#!/bin/bash
lines=1000
ccount=0
while getopts l: opt 2> /dev/null
do
case $opt in
l) lines=$OPTARG ;;
*) echo "Illegal option -$opt" >&2
exit 1
;;
esac
done
shift $((OPTIND-1))
if [ $# -le 0 ]
then
cat >&2 <<EOF
linesplit: Designed to read from standard input
and split into multiple streams, running one command per loop
and writing into its STDIN. @FNAME@ is available as an
sequence number if needed.
syntax: linesplit [-l lines] command arguments @FNAME@ ...
EOF
exit 1
fi
while [[ ${PIPESTATUS[0]} != 33 ]]
do
((ccount++))
printf -v fname "%08d" $ccount
for((n=0; n<lines; n++))
do
read line && echo "$line" || exit 33
done | eval ${@//@FNAME@/$fname}
done
printf "Wrote %'d full chunks of %'d lines\n" $((ccount-1)) $lines >&2
@achenle - The smaller reference file contains the 12 digit numeric reference for every active subscriber we have (format=000000000000). The much larger production file contains the 12 digit reference and an alphanumeric tariff code separated by a coma. Tariff codes have the following A000000 or AA00000 or A000000-AA00000 or AA00000-AA00000. The longer tariff codes indicate an additional service or add-on. So the format is 000000000000,A000000 or any other tariff/add-on variant
My aim is to extract the reference and tariff info from the production file, using the references contained in the accurate list.
@corona688 - The output of should be the same format as the production file which is reference,tariff ie 000000000000,AA00000. I don't think sorting is an issue at this point.
The hash lookups are faster than an RE that can match anywhere in the lines.
Also, increasing the 5000 by a factor 10 will increase speed (and memory consumption) by the same factor.
BTW I find the following order more intuitive:
Thank you! Not my first crack at that problem, just the first one good enough to bother reusing.
I wrote it in Perl since a while read LINE loop isn't terribly efficient when thousands to millions of lines are involved. [edit: Actually, I wrote perl since the original read raw binary...] This probably doesn't matter when the grep is going to be taking so much longer anyway. I see what you mean about that bug. I should be opening the process after a line is read, not before.
That's a nice translation, and putting the args in getopt makes it look so much more standard/official.
I really don't like the look of that eval, though. Eval would have made my life easier in perl too, but I worked hard to avoid it, to keep things safe and sane -- it will do weird things like eating quotes, evaluating accidental expressions, mangling things containing $, executing backticks, stopping at # considering it a comment, etc. Even if it takes 30 lines of code to do the same thing without it, that'd be better. Or maybe the value should be exported to the environment. I'll see what I can do...
This replaces the eval with a safer text replacement loop. I've also turned fname into FNAME since I'm exporting it to the environment as an afterthought. Otherwise your code looks terrific.
#!/bin/bash
lines=1000
ccount=0
# Convert "command" "-flag" "@FNAME@" into "command" "-flag" "000001"
# uses $FNAME external. Result is in ARGS array.
parsecmd() {
local N=0
ARGS=()
while [ "$#" -gt 0 ]
do
ARGS[$((N++))]="${1//@FNAME@/${FNAME}}"
shift
done
}
while getopts l: opt 2> /dev/null
do
case $opt in
l) lines=$OPTARG ;;
*) echo "Illegal option -$opt" >&2
exit 1
;;
esac
done
shift $((OPTIND-1))
if [ $# -le 0 ]
then
cat >&2 <<EOF
linesplit: Designed to read from standard input
and split into multiple streams, running one command per loop
and writing into its STDIN. @FNAME@ is available as an
sequence number if needed.
syntax: linesplit [-l lines] command arguments @FNAME@ ...
EOF
exit 1
fi
while [[ ${PIPESTATUS[0]} != 33 ]]
do
((ccount++))
printf -v FNAME "%08d" $ccount
export FNAME
parsecmd "$@" # Convert @FNAME@ into $FNAME, put in ${ARGS[@]}
for((n=0; n<lines; n++))
do
read line && echo "$line" || exit 33
done | "${ARGS[@]}"
done
printf "Wrote %'d full chunks of %'d lines\n" $((ccount-1)) $lines >&2
I was thinking of the [command] parameter as being a lot like that implemented by ssh: which also has all the eval issues you mentioned. I find the double expansion of ssh a pain, but at least there is a precedent for this sort of thing (su -c also comes to mind).
Perhaps I'm missing something by without the eval how would you do this:
It's very traditional to get a shell when you do a shell login, but that's kind of a self-fulfilling prophecy.
sudo does not perform an old-fashioned tty LOGIN process and does not behave like that:
sudo 'echo $HOSTNAME'
sudo: echo $HOSTNAME: command not found
Further, most common "command modifying" utilities like nice, nohup, xargs, env etc don't. The only single exception I can think of off-hand is 'watch'.
Furthermore, permitting expansion to happen in the same shell is asking for trouble. Expansion inside ssh won't cause ssh itself to blow up from syntax errors... Kind of bad form. I'd want to put it in an external shell at the very least.
That problem would be more easily solved with standard split I think.
Why not put tr before linesplit? I see what you're getting at, but it'd be simple enough to put it in an awk command.
The awk command suggested by MadeInGermany would mess up when re-parsed unless you re-quote and escape it extravagantly.
It substitutes inside individual tokens instead of cramming into one string, substituting, and splitting back apart. This preserves splitting.