Comparing two one-line files and selecting what does not match

maya3 · August 15, 2018, 12:45pm

I have two files. One is consisting of one line, with data separated by spaces and each number appearing only once.
The other is consisting of one column and multiple lines which can have some numbers appearing more than once.
It looks something like this:

file 1:

20 700 15 30

file2:

(The files are a result of some other processing and scripts so there could be some extra spaces or tabs that I cannot easily influence/remove)

I would like to print the lines from file2 that do not have a match in file1. It is very important that in case there aren't any lines in file2 that do not have a match in file1 (i.e. when the file2 doesn't contain any numbers that aren't already in file1), I get a completely empty file, and not spaces or any other characters.

I have found some ways to do it when both files are columns, but not when one of them is a one line. When I tried transforming the one line file into a one column file, I got some unwanted spaces in the output.

Thank you!

RudiC · August 15, 2018, 12:51pm

Welcome to the forum.

Please show the attempts you made and where you got stuck.

maya3 · August 15, 2018, 1:33pm

I tried with turning file1 into a column file with:

tr ' ' '\n' < file1 | awk '{print $0}' > file1_new

and then solving it by working with columns

awk '{k = $1} NR==FNR{a[k]; next} !(k in a)' file1_new  file2

However, I then got an empty line as the output (instead of the wanted empty file) when both files contained the same numbers (as described in the end of my original post). I would like to solve it without modifying file1, but I don't know how to approach and start there.

kenshinhimura · August 15, 2018, 1:39pm

comm -23 will work for you

maya3 · August 15, 2018, 2:19pm

thanks I've tried that now but I still have the same problem as when using the code from my second post.
With the files containing different numbers as in my first post, I get empty lines as first and last line.

Since my data is not in columns but in one line for file1, and they are a part of a cshell script and come as results in a loop,it would be difficult to be sure that it will never have any extra characters, I would rather keep them as a one line instead of converting to a column.

Is there a way to use indices with lines as with the columns in awk?

wbport · August 15, 2018, 5:04pm

You don't need to change the original files, but you can do whatever is needed to your own work files.

You were getting close by breaking many numbers on one line to one per line. From there, sort copies of both files the same way (file 2 may need a unique sort) and then run them through diff. This process won't work if diff outputs "c" lines with both "<" and ">", but if not you can take out lines containing a or d, then take out the first two characters of all other lines. For example:

diff file1 file2 |grep -v d |  sed 's/..//' >outputfile

maya3 · August 15, 2018, 5:28pm

Sorry I don't understand what do you mean by c lines and lines containing a or d?

Also, this code gave me the which has in its second line two numbers separated by comma which are in neither of the files, is that some counter of data entries?

RudiC · August 15, 2018, 6:13pm

How about

awk 'NR == 1 {for (n=split($0, T); n; n--) F1[T[n]]; next} !($1 in F1)' file[12]
10
10
200
200
50

wbport · August 15, 2018, 6:35pm

diff reports differences between two files and what has to happen to change the first file into the second. If a record appears in the first file but not the second, diff reports the line(s) on the first file a d and the line where they used to be on the 2nd file

A record on the 2nd file but not the first, a a reports the line number of the first file and the records added.

File 1 contains (actually 1 per line) 1 3 5 9 10 11 12 23 48 and
File 2 contains (actually 1 per line) 2 4 6 8 9 yy 10, the output from diff will be

1,3c1,4
< 1
< 3
< 5
---
> 2
> 4
> 6
> 8
4a6
> yy
6,9d7
< 11
< 12
< 23
< 48

Only lines starting with ">" or "<" appear in one file but not the other. You should never have a change where you have both symbols and a "---" line.

maya3 · August 16, 2018, 8:35am

@wbport thanks for the explanation, but this solution would require more data handling so I'll go with the other one

@RudiC, this works but may I ask if I understood the code well:

NR == 1 ... while it is reading the first line of the first file do everything in the curly brackets

The for loop changes the value of n from the total number of pieces resulting from split to 0

F1 is an associative array containing different pieces from array T as it goes through the loop, i.e. all the numbers from file1 that I need

next tells it go to the next line, which ends the NR == 1 condition, and starts reading file2 since there is only one line in file1

It then reads file2 where it checks for every line if it does not match any of the elements of array F1

RudiC · August 16, 2018, 9:07am

NR == 1 ... while it is reading the first line of the first file do everything in the curly brackets - YES
The for loop changes the value of n from the total number of pieces resulting from split to 0 - YES
F1 is an associative array containing different pieces from array T as it goes through the loop, i.e. all the numbers from file1 that I need - YES - in its index
next tells it go to the next line, which ends the NR == 1 condition, and starts reading file2 since there is only one line in file1 - YES
It then reads file2 where it checks for every line if it does not match any of the elements of array F1 - YES