I have a file, named records.txt, containing large number of records, around 0.5 million records in format below:
28433005 1 1 3 2 2 2 2 2 2 2 2 2 2 2
28433004 0 2 3 2 2 2 2 2 2 1 2 2 2 2
...
Another file is a key file, named key.txt, which is the list of some numbers in the first column of file records.txt.
28433004
28815001
...
There are about 0.2 million numbers in key.txt. Now I am trying to pick out the records from records.txt based on key.txt. I tried scripts below:
pick_records.s
foreach line (`cat key.txt`)
awk -v key="$line" '$1==key {print $0;exit}' records.txt
end
I ran the scripts by: source pick_records.s > output.txt
The scripts did the job but ran slow. I am wondering if there is more efficient way to achieve this task.
Thanks.
awk 'FNR==NR {keys[$1];next} $1 in keys' key.txt records.txt
sea
April 27, 2015, 7:22pm
3
Hello and welcome to the forum.
Please use code tags for code block examples, as required by the forum rules.
Not sure on the script, as i dont know csh or ksh.
In bash i'd do:
while read line;do
awk -v key="$line" {/^key/} {print $0} records.txt
done<key.txt
Then execute it:
$SHELL pick_records.s > output.txt
Note that i find "pick_" quite irritating, since you dont pick (as in: select a single entry) anything of something, but run through everything (every line) found.
Half a million (or just 200'000) lines/entries do need their time to be done.
You could figure out a difference by adding time in front of executing the script.
time $SHELL pick-records.sh
Hope this helps (hth)
Hi vgersh99 and sea,
Thanks for your solutions.
I tried vgersh99's one line awk command. It worked great and solved my problem in seconds. Amazing:b:. Thanks again.
If you can edit your key.txt and prefix each record with ^
and suffix each with $
then maybe you can use this with grep using the appropriate flags:-
grep -f key.txt records.txt
Does this help? There might be problems over the size of key.txt
Robin
Hi rbatte1,
I tried the grep method too. It didn't work well. Thanks.