Find unique lines based off of bytes

jl487 · September 11, 2013, 7:41am

Hello All,
I have two VERY large .csv files that I want to compare values based on substrings. If the lines are unique, then print the line.

For example, if I run a

diff file1.csv and file2.csv

I get results similar to

+_id34,brown,car,2006
+_id1,blue,train,1985
+_id73,white,speed_boat,1990
-_id34,brown,car,2006
-_id72,white,plane,2010
-_id73,white,speed_boat,1990

I want to compare the ids (string between "_" and ",") and if it's unique, then print the line so my output would be like the following:

Output:

+_id1,blue,train,1985
-_id72,white,plane,2010

I was thinking I could sue the cut command and delimit on the first "_" but didn't know how to compare all the values up until you reach the first comma.

Any suggestions?

disedorgue · September 11, 2013, 8:07am

Hi,
Awk command will better, but you can try:

$ cat comp.txt 
+_id34,brown,car,2006
+_id1,blue,train,1985
+_id73,white,speed_boat,1990
-_id34,brown,car,2006
-_id72,white,plane,2010
-_id73,white,speed_boat,1990

$ sed 's/[+-]_\([^,]*,\).*/\1/' comp.txt | sort | uniq -u | grep  -f - comp.txt 
+_id1,blue,train,1985
-_id72,white,plane,2010

Regards.

Ygor · September 11, 2013, 8:19am

Try...

diff file[12].csv | awk -F '_|,' '{a[$2]=$0;b[$2]++}END{for(i in a)if(b==1)print a}'

jl487 · September 11, 2013, 2:56pm

it works! Thanks!

jl487 · September 13, 2013, 11:03am

i think i may have found a flaw in this code. It seems as though if there is a space in any of the lines, it's ignored/thrown out, even if it's unique.

Can someone please help me?

apmcd47 · September 13, 2013, 11:46am

This appears to work for the example text:

fgrep -v -f<(sed 's/^._\(id[0-9][0-9]*\).*$/\1/' < comp.txt  | sort | uniq -d) comp.txt

so

fgrep -v -f file input

lists lines from input where they don't match the lines in file (each line is a substring, obviously)

<(...)

is a process substitution allowing another command to be used
instead of a file

sed 's/^._\(id[0-9][0-9]*\).*$/\1/'

gives us the list of _idxx

uniq -d

lists duplicated lines (uniq lines are thrown away).

Does this work for you?

Andrew