String compare

sbasetty · January 26, 2007, 8:08pm

Hi Friends,

Can anyone help me with comparing the records in twofiles,

I have two files (csv)

FILE1:

1023,SMITH JAMES , (203) 789-1249
1023,HARRY POTTER , (213) 789-1249
1023,JONES D, (903) 789-1249

FILE1:

1023,SMITH ,2037891249
1023,HARRY , 2137891249
1023,JONES, 7037891249

it should return only one row i.e 1023,JONES, 7037891249 as they are different,It has to supress the "(" chareacters and blank ones.

Thanks in advance for your help.

S

(203) 789-1249 and 2037891249,
I have to compare these

ghostdog74 · January 26, 2007, 8:43pm

here's an idea to start with:

[root@localhost test]# echo "1023,SMITH , (203) 789-1249" |sed 's/[-() ]//g'
1023,SMITH,2037891249

All ( , ) , - and spaces are stripped. you can do this for both files and then compare them.

[root@localhost test]# sed -i 's/[-() ]//g' file
[root@localhost test]# sed -i 's/[-() ]//g' file2
[root@localhost test]# diff file file2
3c3
< 1023,JONES,9037891249
---
> 1023,JONES,7037891249

radoulov · January 27, 2007, 8:56am

If I understand correctly:

awk 'NR==FNR{gsub(/[ \(\)-]/,"");x[$0];next}
{gsub(/ /,"")}!($0 in x)' file1 file2

Use nawk on Solaris.

nervous · January 27, 2007, 10:02am

Can you please explain your code.

Thanks in advance.
An awk student.

radoulov · January 27, 2007, 1:18pm

awk '
# If NR==FNR this is the first file, so get rid of 
#+ the "(",")","-"," " characters ("gsub" is global substitution),
#+ and populate the x array: x[$0].
NR==FNR{gsub(/[ \(\)-]/,"");x[$0];next}
# Otherwise, it's the second file, so 
#+ remove the spaces. Now we have
#+ the right formating. 
{gsub(/ /,"")}
# If the current record is not
#+ previously stored in the x array,
#+ print it (default action).
!($0 in x)' file1 file2

AnOTHER awk student

sbasetty · January 29, 2007, 2:45pm

Thank you all,

Small clarification on this:
How can we use sed on a perticular column (third column in this example),
sed 's/[-() ]//g' is processing all the columns.
I have two files to compare.

[root@localhost test]# echo "1023,SMITH , (203) 789-1249" |sed 's/[-() ]//g'
1023,SMITH,2037891249

sbasetty · January 29, 2007, 4:23pm

Hi Randuolov,

will this command displays only the changed data can you please explain.
when I run this command it is displaying same file with the data.
Thanks

radoulov:

awk '
# If NR==FNR this is the first file, so get rid of 
#+ the "(",")","-"," " characters ("gsub" is global substitution),
#+ and populate the x array: x[$0].
NR==FNR{gsub(/[ \(\)-]/,"");x[$0];next}
# Otherwise, it's the second file, so 
#+ remove the spaces. Now we have
#+ the right formating. 
{gsub(/ /,"")}
# If the current record is not
#+ previously stored in the x array,
#+ print it (default action).
!($0 in x)' file1 file2

AnOTHER awk student

radoulov · January 29, 2007, 5:20pm

Because the input data I was reading while writing the script
was different (the post was modified;
ghostdog74's post is showing the original sample)
Try this:

awk 'NR==FNR{ gsub(/[ \(\)-][A-Z]*/,"");x[$0];next}
{gsub(/ /,"")}!($0 in x)' file1 file2

sbasetty · January 30, 2007, 6:22pm

thank you very much

sbasetty · February 6, 2007, 2:48pm

Hi Radoulov,

How to restrict the gsub to start from a certain position,
Can we use the substr in conjunction with gsub,

nawk 'NR==FNR{ gsub(/[ -]/,"");x[$0];next}
{gsub(/[ \(\-]/,"")}!($1 in x)' file11.csv file22.csv

Is substituting all the "-" to spaces as a result the
If the first column has the "-" it is overidden.

Is displays 1023,JONES D,7037891249 from the example.

FILE1:

1-023,SMITH JAMES, (203) 789-1249
10-23,HARRY POTTER, (213) 789-1249
1-023,JONES D, (903) 789-1249

FILE2:

1-023,SMITH JAMES,2037891249
10-23,HARRY POTTER, 2137891249
1-023,JONES D,7037891249

Output should be:

1-023,JONES D,7037891249

As the phone number is different.

Thanks a lot for your help

S

radoulov · February 6, 2007, 3:59pm


$ cat file1
1-023,SMITH JAMES, (203) 789-1249
10-23,HARRY POTTER, (213) 789-1249
1-023,JONES D, (903) 789-1249

$ cat file2
1-023,SMITH JAMES,2037891249
10-23,HARRY POTTER, 2137891249
1-023,JONES D,7037891249

$ nawk 'NR==FNR{gsub(/[ \(\)-]/,"",$3);x[$0];next}
> {sub(/ /,"",$3)}!($0 in x)'  OFS="," FS="," file1 file2
1-023,JONES D,7037891249

sbasetty · February 6, 2007, 4:38pm

It worked like a magic

Thanks a lot

sbasetty · February 6, 2007, 4:47pm

can you please explain what you are doing in the "sub" to compare

tayyabq8 · February 7, 2007, 1:32am

sub is not there for comparisons, instead it substitutes the values in this way (/Matchpattern/SubstitutePattern/) like in this case sub(/ /,"",$3) it'll substitute any spaces in the third column with "" that means it'll remove spaces from third colum, in the same way gsub is functioning in this script, gsub(/[ -]/,"",$3) has match pattern /[-]/ ie match a ( or ) or - and replace it with "" null value means remove it, actual comparison is being done thru arrays and Radoulov has desdribed it earlier.

Regards,
Tayyab

radoulov · February 7, 2007, 4:24am

Check the manual for the differences between sub and gsub.