I have two files. One is consisting of one line, with data separated by spaces and each number appearing only once.
The other is consisting of one column and multiple lines which can have some numbers appearing more than once.
It looks something like this:
file 1:
20 700 15 30
file2:
10
10
200
200
700
700
700
20
30
30
50
(The files are a result of some other processing and scripts so there could be some extra spaces or tabs that I cannot easily influence/remove)
I would like to print the lines from file2 that do not have a match in file1. It is very important that in case there aren't any lines in file2 that do not have a match in file1 (i.e. when the file2 doesn't contain any numbers that aren't already in file1), I get a completely empty file, and not spaces or any other characters.
I have found some ways to do it when both files are columns, but not when one of them is a one line. When I tried transforming the one line file into a one column file, I got some unwanted spaces in the output.
However, I then got an empty line as the output (instead of the wanted empty file) when both files contained the same numbers (as described in the end of my original post). I would like to solve it without modifying file1, but I don't know how to approach and start there.
thanks I've tried that now but I still have the same problem as when using the code from my second post.
With the files containing different numbers as in my first post, I get empty lines as first and last line.
Since my data is not in columns but in one line for file1, and they are a part of a cshell script and come as results in a loop,it would be difficult to be sure that it will never have any extra characters, I would rather keep them as a one line instead of converting to a column.
Is there a way to use indices with lines as with the columns in awk?
You don't need to change the original files, but you can do whatever is needed to your own work files.
You were getting close by breaking many numbers on one line to one per line. From there, sort copies of both files the same way (file 2 may need a unique sort) and then run them through diff. This process won't work if diff outputs "c" lines with both "<" and ">", but if not you can take out lines containing a or d, then take out the first two characters of all other lines. For example:
diff file1 file2 |grep -v d | sed 's/..//' >outputfile
Sorry I don't understand what do you mean by c lines and lines containing a or d?
Also, this code gave me the which has in its second line two numbers separated by comma which are in neither of the files, is that some counter of data entries?
diff reports differences between two files and what has to happen to change the first file into the second. If a record appears in the first file but not the second, diff reports the line(s) on the first file a d and the line where they used to be on the 2nd file
A record on the 2nd file but not the first, a a reports the line number of the first file and the records added.
File 1 contains (actually 1 per line) 1 3 5 9 10 11 12 23 48 and
File 2 contains (actually 1 per line) 2 4 6 8 9 yy 10, the output from diff will be
NR == 1 ... while it is reading the first line of the first file do everything in the curly brackets - YES
The for loop changes the value of n from the total number of pieces resulting from split to 0 - YES
F1 is an associative array containing different pieces from array T as it goes through the loop, i.e. all the numbers from file1 that I need - YES - in its index
next tells it go to the next line, which ends the NR == 1 condition, and starts reading file2 since there is only one line in file1 - YES
It then reads file2 where it checks for every line if it does not match any of the elements of array F1 - YES