jl487
September 11, 2013, 7:41am
1
Hello All,
I have two VERY large .csv files that I want to compare values based on substrings. If the lines are unique, then print the line.
For example, if I run a
diff file1.csv and file2.csv
I get results similar to
+_id34,brown,car,2006
+_id1,blue,train,1985
+_id73,white,speed_boat,1990
-_id34,brown,car,2006
-_id72,white,plane,2010
-_id73,white,speed_boat,1990
I want to compare the ids (string between "_" and ",") and if it's unique, then print the line so my output would be like the following:
Output:
+_id1,blue,train,1985
-_id72,white,plane,2010
I was thinking I could sue the cut command and delimit on the first "_" but didn't know how to compare all the values up until you reach the first comma.
Any suggestions?
Hi,
Awk command will better, but you can try:
$ cat comp.txt
+_id34,brown,car,2006
+_id1,blue,train,1985
+_id73,white,speed_boat,1990
-_id34,brown,car,2006
-_id72,white,plane,2010
-_id73,white,speed_boat,1990
$ sed 's/[+-]_\([^,]*,\).*/\1/' comp.txt | sort | uniq -u | grep -f - comp.txt
+_id1,blue,train,1985
-_id72,white,plane,2010
Regards.
Ygor
September 11, 2013, 8:19am
3
Try...
diff file[12].csv | awk -F '_|,' '{a[$2]=$0;b[$2]++}END{for(i in a)if(b==1)print a}'
jl487
September 13, 2013, 11:03am
5
i think i may have found a flaw in this code. It seems as though if there is a space in any of the lines, it's ignored/thrown out, even if it's unique.
Can someone please help me?
apmcd47
September 13, 2013, 11:46am
6
This appears to work for the example text:
fgrep -v -f<(sed 's/^._\(id[0-9][0-9]*\).*$/\1/' < comp.txt | sort | uniq -d) comp.txt
so
fgrep -v -f file input
lists lines from input where they don't match the lines in file (each line is a substring, obviously)
<(...)
is a process substitution allowing another command to be used
instead of a file
sed 's/^._\(id[0-9][0-9]*\).*$/\1/'
gives us the list of _idxx
uniq -d
lists duplicated lines (uniq lines are thrown away).
Does this work for you?
Andrew