How to use grep in a loop using a bash script?

Dear all,

Please help with the following.

I have a file, let's call it data.txt, that has 3 columns and approx 700,000 lines, and looks like this:

rs1234  A  C
rs1236  T  G
rs2345  G  T

I have a second file, called reference.txt, which has one column with about 500,000 lines, and contains some, but not all of the values of column 1 in data.txt. e.g.

rs1234
rs2345
...

I want to 'grep' out all the lines in data.txt that have a match in reference.txt, so that I end with:

rs1234  A  C
rs2345  G  T

I have tried:

cat data.txt | grep -f reference.txt > output.txt 

But this was taking far too long.

I therefore thought I might need to loop it using a bash script. I had a go, but got nowhere with the following:

for i in reference.txt; do
grep "$i" data.txt
done

I am sure that this must be quite simple to do, but would be grateful for your help with this.

Thank you,

AB

Using awk:-

awk 'NR==FNR{A[$1];next}$1 in A' reference.txt data.txt

Using grep (no need to use cat):-

grep -f reference.txt data.txt
2 Likes