Quick way to select many records from a large file

zenongz · April 27, 2015, 6:55pm

I have a file, named records.txt, containing large number of records, around 0.5 million records in format below:

28433005 1 1 3 2 2 2 2 2 2 2 2 2 2 2
28433004 0 2 3 2 2 2 2 2 2 1 2 2 2 2
...

Another file is a key file, named key.txt, which is the list of some numbers in the first column of file records.txt.

28433004
28815001
...

There are about 0.2 million numbers in key.txt. Now I am trying to pick out the records from records.txt based on key.txt. I tried scripts below:

pick_records.s

foreach line (`cat key.txt`)
awk -v key="$line" '$1==key {print $0;exit}' records.txt
end

I ran the scripts by: source pick_records.s > output.txt

The scripts did the job but ran slow. I am wondering if there is more efficient way to achieve this task.

Thanks.

vgersh99 · April 27, 2015, 7:10pm

awk 'FNR==NR {keys[$1];next} $1 in keys' key.txt records.txt

sea · April 27, 2015, 7:22pm

Hello and welcome to the forum.

Please use code tags for code block examples, as required by the forum rules.

Not sure on the script, as i dont know csh or ksh.
In bash i'd do:

while read line;do
   awk -v key="$line" {/^key/} {print $0} records.txt
done<key.txt

Then execute it:

$SHELL pick_records.s > output.txt

Note that i find "pick_" quite irritating, since you dont pick (as in: select a single entry) anything of something, but run through everything (every line) found.

Half a million (or just 200'000) lines/entries do need their time to be done.
You could figure out a difference by adding time in front of executing the script.

time $SHELL pick-records.sh

Hope this helps (hth)

zenongz · April 28, 2015, 12:06pm

Hi vgersh99 and sea,

Thanks for your solutions.

I tried vgersh99's one line awk command. It worked great and solved my problem in seconds. Amazing:b:. Thanks again.

rbatte1 · April 28, 2015, 12:12pm

If you can edit your key.txt and prefix each record with ^ and suffix each with $ then maybe you can use this with grep using the appropriate flags:-

grep -f key.txt records.txt

Does this help? There might be problems over the size of key.txt

Robin

zenongz · May 5, 2015, 12:12pm

Hi rbatte1,

I tried the grep method too. It didn't work well. Thanks.