Extracting lines from a big csv file based on two columns in another file

samuel1 · October 19, 2024, 1:38pm

I have a big csv file with 70 columns (file1). I would like to extract the lines from file1 based on the two columns (column2 and column4 in file1) of another file (file2).
fle1.csv

11021A/G   120423  TAATTCCGG   25  0  1  1  1  1
11904C/G    178485  TGGCC           14   1   -   0  -   0
11890A/T    12785     TCCAATTGG  13   1 -    0   1  1

file2.csv

178485        14
2346784       3
34532          14

Desired output

11904C/G    178485  TGGCC     14   1   -   0  -   0

could you please help me?

vbe · October 19, 2024, 2:21pm

Welcome on board!

I suppose there are many ways to solve such task, mostly depending on your background and knowledge, we are here to help yo, not do the work for you... So what have you tried so far?
If you don't know how to begin, well the first thing is to say what you know in shel scripting or programming language, then think how do you think it could be solved?
E.g. I extract from file 1, column 2 and 4 and add a column for line number then compare with file 2... those matching I have the line numbers to print from the original file...
How do you see things with your knowledge ?

Paul_Pedant · October 19, 2024, 5:18pm

The example data is flawed because it works equally well if $2 matches, regardless of the value of $4. Presumably you intent that both column 2 and column 4 must match, but there is no example like 178485 .. 13 not being matched.

There is also the minor point that neither of the example files is actually CSV format.

samuel1 · October 19, 2024, 11:07pm

In the second row, both the second and fourth column matches.

I tried the following code. There is no data in the output file. The code is taken from linux - Extracting lines from a file based on two columns in another text file - Unix & Linux Stack Exchange

awk 'FILENAME == "file2" { remember[$1 $2]=1 ;} FILENAME != "file2" { if ( $1 $2 in remember ) print ; } ' file2 file1 > file3

Paul_Pedant · October 20, 2024, 8:51am

So it does. But 120423 and 34532 do not appear in file2 at all, so the 14 does not make any difference. You need to be less vague with "based on the two columns", maybe instead "and exactly matching both the corresponding columns", and also provide an example where 178487 matches and 14 does not.

The code you got (or adapted) from another forum is wrong for your case. It stores $1 $2 from file2, but for your line layouts it needs to check $2 $4 from file1.

It is wrong in another way too. If file2 had 234 56 and file1 had 23 456, it would still match because the code just runs the values together both times as 23456. The indexes should be $1,$2 and $2,$4, which inserts a separator between the two values which makes them different in combination.

You might look at the feedback on posts: the one you reference is a rare sight -- an accepted answer with no upvotes. Mind you, in this case the "already answered" thread is also buggy -- que sera, sera.

Usually, you would do { remember[$1 $2]=1 ; next; }, which saves you testing for the opposite condition in the line after.

It is clumsy to put the filenames into the awk code and also have to specify them on the command line. Maintainers will usually change one and not the other. The usual idiom is to make the test on the first line as FNR == NR.

This works because FNR is the line number in the current file, and NR is the total lines read so far. If these are the same, you must be reading from the first file (except a corner case if the first file is empty).

The default action for a condition is to print, so my take on this is:

awk 'FNR == NR { ++Hit[$1,$2]; next; } ($2,$4) in Hit' file2 file1

MadeInGermany · October 20, 2024, 10:49am

By default awk has a "whitespace" (multi space or tab) field separator that fits well here.

The Hit[$1,$2] looks like a two-dimensional array, but behind the scenes Hit is one-dimensional, and its index is a concatenated string $1 SUBSEP $2

It is not necessary to do something with Hit[$1,$2], just a reference defines(makes) it.
And that's enough for the ($2,$4) in Hit condition.

awk '(FNR == NR) { Hit[$1,$2]; next; } (($2,$4) in Hit)' file2 file1

For clarity I put the two conditions in parentheses.

samuel1 · October 20, 2024, 11:20am

Thank you for your code. But unfortunately, both of the codes do not given output.

samuel1 · October 20, 2024, 11:27am

Thank you for your code. Unfortunately, there is no output.

Neo · October 20, 2024, 11:48am

macos~ $ cat file1.csv
11021A/G   120423  TAATTCCGG   25  0  1  1  1  1
11904C/G    178485  TGGCC           14   1   -   0  -   0
11890A/T    12785     TCCAATTGG  13   1 -    0   1  1

macos ~ $ cat file2.csv
178485        14
2346784       3
34532          14

macos~ $ cat filter_csv.rb
require 'set'

# Initialize a set to store criteria from file2.csv
criteria = Set.new

# Read file2.csv and add (column2, column4) pairs to the set
File.foreach('file2.csv') do |line|
  columns = line.split
  # Ensure the line has at least two columns
  if columns.length >= 2
    column2, column4 = columns[0].strip, columns[1].strip
    criteria.add([column2, column4])
  end
end

# Read file1.csv and print lines that match the criteria
File.foreach('file1.csv') do |line|
  columns = line.split
  # Ensure the line has at least 5 columns
  if columns.length >= 5
    column2, column4 = columns[1].strip, columns[3].strip
    if criteria.include?([column2, column4])
      puts line
    end
  end
end

macos ~ $ ruby filter_csv.rb
11904C/G    178485  TGGCC           14   1   -   0  -   0
macos ~ $

Paul_Pedant · October 20, 2024, 12:37pm

Both my code, and @MadeInGermany , produce the required output, using the code we posted. I have no idea how this does not work for you.

I note your original code source redirects the output to > file3. Are you by any chance still doing that? It would obviously send the results to that file, and not to the terminal.

$ cat file2
178485        14
2346784       3
34532          14

$ cat file1
11021A/G   120423  TAATTCCGG   25  0  1  1  1  1
11904C/G    178485  TGGCC           14   1   -   0  -   0
11890A/T    12785     TCCAATTGG  13   1 -    0   1  1

$ awk '(FNR == NR) { Hit[$1,$2]; next; } (($2,$4) in Hit)' file2 file1
11904C/G    178485  TGGCC           14   1   -   0  -   0

$ awk 'FNR == NR { ++Hit[$1,$2]; next; } ($2,$4) in Hit' file2 file1
11904C/G    178485  TGGCC           14   1   -   0  -   0
$