Matrix with Percentage

Hi ALL,

I have below example

INPUT 1 (i/p 1)|INPUT 2 (i/p 2)|OUTPUT (o/p)
Bharat Bazar|Bharat Bazar|True Positive
Binny's Sales|<BLANK>|False Negative
<BLANK>|Binny's|False Positive
<BLANK>|<BLANK>|True Negative
Bharat bazar|Bharat|True Positive
binny's|binny|True Positive

[/CODE]

where in the o/p should be either of the 4 based on the scenarios on first 2 columns which is self explainatory.

If the column 1 sub string is present in column 2 then it should be True Positive.

After all of these, It should give me the Percentage of all the scenarios based on the total number of records

My first 2 columns i.e. i/p1 and i/p2 are the inputs and the o/p is the result i'm generating based on i/p1 and i/p2.

i/p can be any of the strings as well as blank, similarly i/p2 can be string, sub string of i/p1 or blank.

o/p column should be generate by us. based on i/p1 and i/p2

<BLANK> is just a blank space, I gave it for meaningful representation, | is the delimiter used.

Is this a homework assignment? Homework and coursework questions can only be posted in this forum in a particular format described in special homework rules.

If this is homework, please repost it in the Homework & Coursework forum in the required format.

If not, please show us a sample of your input. Does it come from a file, is data manually entered, or what is the source? What is the format of the input data? Is it a CSV file using the pipe symbol as the field separator or is it some other format? Is there a heading line in the input?

Exactly what output you are trying to produce from your sample intermediate output (with the percentages you said you want, but did not describe).

What is your definition of a "blank space"? Is it an empty string? Is it a single <space> character? Is it a single character in the current locale's blank character class? Is it a string of one or more <space> characters? Is it a string of one or more characters in the current locale's blank character class? Is it a string of zero or more <space> characters? Is it a string of zero or more characters in the current locale's blank character class?

Please explain the logic used to determine when the output should be TRUE or FALSE and the logic used to determine when the output should be Positive or Negative.

If blank space is an empty string or a single <space>, why isn't the line in your intermediate output:

<BLANK>|<BLANK>|True Negative

be:

<BLANK>|<BLANK>|True Positive

instead. Shouldn't identical strings be Positive???

If blank space is a single <space> character, why isn't the line your intermediate output:

Binny's Sales|<BLANK>|False Negative

be:

Binny's Sales|<BLANK>|True Positive

instead. Shouldn't the <space> in the second field be considered a substring of the 1st field (matching the space in the middle of the 1st field?

What operating system and shell are you using?

What have you tried to solve this problem on your own?

1 Like

Don,

It's not home work assignment.

It is a requirement my company is having.

Lemme be more clear with my i/p and o/p

This is the i/p to my script

INPUT 1 (i/p 1)|INPUT 2 (i/p 2)
Bharat Bazar|Bharat Bazar
Binny's Sales| 
 |Binny's
 | 
Bharat bazar|Bharat
binny's|binny

[/CODE]

I'm expecting my o/p to be as shown below

INPUT 1 (i/p 1)|INPUT 2 (i/p 2)|OUTPUT (o/p)
Bharat Bazar|Bharat Bazar|True Positive
Binny's Sales| |False Negative
 |Binny's|False Positive
 | |True Negative
Bharat bazar|Bharat|True Positive
binny's|binny|True Positive

[/CODE]

As informed earlier, <BLANK> is just a empty space, I gave it initially. Now i hv replace with just space.

I have written below code for percentage calcualations, but m stuck with the logic for col1 and col2 with the scenarios below

  1. if col1 string is similar col2 string or even the substring then it shul be TRUE POSITIVE
  2. if col1 string is present and col2 string is not then it shul be false negative
  3. if col1 string is not present and col2 string is present then it shul be false positive
  4. if col1 string and col2 string not present then true negative

My 3rd o/p column shul be generated based on the i/p of col1 and col2 as shown above

My percentage calculation code is as shown below

 for (i=1 ; i<cc; i++) {
      p=(c/NR)*100;
      r=(i == 1) ? "" p : r OFS p;
   }
   print r

[/CODE]

I said:

but you decided not to show us what output you want for percentages. The code that you say produces the percentages you want is not at all helpful since the code depends on three variables ( cc , the array c , and the variable r ) none of which have defined values that you have shown us. If this is awk code and these variables haven't been otherwise defined, this code goes through the for loop zero times and prints an empty line. I assume that you do not want an empty line, but I still have no idea what output you do want (other than that your percentages are to be calculated to two decimal places and that you want to include the line containing the headings in your input file in your calculations).

As I said before, there is not such thing as an empty space. From your latest sample input, we see that your use of the term empty space should be treated as a synonym for single space character, but if that is the case, you still need to explain why a single space in field two is not a subset of field 1 in the line:

Binny's Sales| 

where the 8th character in field 1 is a single space character and field 2 is a single space character.

And, you still have not defined what string is similar means in your statement:

Please show us the EXACT output you are hoping to get from your script (including the percentages).

Please explain how we determine whether or not two strings are similar.

Please explain why a field that is a single space character is not a subset of a field containing a space character>.

And, please explain why two non-empty strings (each containing a single space) that are identical are classified under your rule #4 instead of being classified under your rule #1.

Don,

Sorry for the confusion created.

INPUT 1 (i/p 1)|INPUT 2 (i/p 2)|OUTPUT (o/p)
Bharat Bazar|Bharat Bazar|True Positive
Binny's Sales| |False Negative
 |Binny's|False Positive
 | |True Negative
Bharat bazar|Bharat|True Positive
binny's|binny|True Positive

[/CODE]

If col1's any string is present in col2 string then it is true positive

as in example above ,bharat is present in bharat bazar hence it is True Positive similarly binny is present in binny's hence true positive.

Plz ignore my percentage program for time being, If possible can u plz incorporate the % program along with the same.

I urgently encourage you to think twice before posting your original problems and formulate with great care, supplying meaningful details. (Which, BTW, you have been asked before, answering:

Step back for a second, forget all the context that you seem to imply, and look at your post#1 specification like the forum members do, not having any background on your problem. Do you think anybody could precisely understand what you request?

Rudi,

I'm sorry for this, Can you plz help ?

My code is as shown below

awk -F "|" '
BEGIN{IGNORECASE=1}
{ if ($2 == $3 && $2 != "" && $3 != "" ) { print $2 "|" $3 "|TRUE POSITIVE"; }
   else if ($3 == "" && $2 != "" && $2 != $3 ){print $2"|"$3"|FALSE NEGATIVE";}
   else if ($2 == "" && $3 != "" ){print $2"|""|FALSE POSITIVE";}
   else if ($2 == "" && $3 == "") { print $2 "|" $3"|TRUE NEGATIVE"; }
   else {print $2 "|" $3 "|TRUE POSITIVE";}
}' $1

[/CODE]

When i Execute my above script , I get below o/p which is incorrect.

Because state is not at all substring of country, If they are different completely it should be false positive

INPUT 1 (i/p 1)|INPUT 2 (i/p 2)|TRUE POSITIVE
Bharat Bazar|Bharat Bazar|TRUE POSITIVE
Binny's Sales| |TRUE POSITIVE
|Binny's|FALSE POSITIVE
||TRUE NEGATIVE
Bharat bazar|Bharat|TRUE POSITIVE
binny's|binny|TRUE POSITIVE
state|country|TRUE POSITIVE

[/CODE]

Let me know if there is any other clarrifications required.

OK. Seeing your code and the output it produces makes it very clear that none of the sample input files you have shown us in the past in this thread were used as input to this script. You have never said that your input files contain an additional field at the start of each line that is to be ignored by your script. Your descriptions up until now have said that you are looking at the stings in the first two fields in your input, but your script never looks at the first field in your input file; it only looks at the second and third fields in your input.

You have now shown us your script and the output it produces. Thank you for both.

Now, PLEASE show us three more things:

  1. the input that was given to your script to produce the output you supplied in post #7 in this thread,
  2. the output that you WANT your script to produce (including the percentages that you want to be produced), and
  3. a clear description of what determines whether or not the string in field three of your input file is "similar" to the string in field two of your input file. (Is the requirement that the string in field 3 is a substring of field 2 where the comparison performed is case insensitive?)
INPUT 1 (i/p 1)|INPUT 2 (i/p 2)
Bharat Bazar|Bharat Bazar
Binny's Sales| 
|Binny's
|
Bharat bazar|Bharat
binny's|binny
state|country

[/CODE]

Above one is my input.

o/p i want

INPUT 1 (i/p 1)|INPUT 2 (i/p 2)
Bharat Bazar|Bharat Bazar|TRUE POSITIVE
Binny's Sales| |FALSE NEGATIVE
|Binny's|FALSE POSITIVE
||TRUE NEGATIVE
Bharat bazar|Bharat|TRUE POSITIVE
binny's|binny|TRUE POSITIVE
state|country|FALSE NEGATIVE

[/CODE]

Don ,

my filed 3 should be sub string of field 2 and vice-versa, it should be case insensitive

Plz let me know of any more queries which needs to be answered, will be glad to do the same.

This is becomig ridiculous. Your script from post#7 applied to sample in post#9 yields

||FALSE NEGATIVE
||FALSE NEGATIVE
||FALSE NEGATIVE
||FALSE NEGATIVE
||FALSE NEGATIVE
||FALSE NEGATIVE
||FALSE NEGATIVE
||FALSE NEGATIVE

I'm not prepared to continue following this incredibly incomplete thread.

Rudi,

PFB the stuff, I have copy pasted my whole project, may be when u copy pasted with extra space or something

[sdp@blr-qe101 TDE]$ ./confusionmatrix1.sh test.txt 
INPUT 1 (i/p 1)|INPUT 2 (i/p 2)|TRUE POSITIVE
Bharat Bazar|Bharat Bazar|TRUE POSITIVE
Binny's Sales| |TRUE POSITIVE
|Binny's|FALSE POSITIVE
||TRUE NEGATIVE
Bharat bazar|Bharat|TRUE POSITIVE
binny's|binny|TRUE POSITIVE
state|country|TRUE POSITIVE
[sdp@blr-qe101 TDE]$ 
[sdp@blr-qe101 TDE]$ 
[sdp@blr-qe101 TDE]$ cat confusionmatrix1.sh 
awk -F "|" '
BEGIN{IGNORECASE=1} 
{ if ($1 == $2 && $1 != "" && $2 != "" ) { print $1 "|" $2 "|TRUE POSITIVE"; } 
   else if ($2 == "" && $1 != "") {print $1"|"$2"|FALSE NEGATIVE";} 
   else if ($1 == "" && $2 != "" ){print $1"|"$2"|FALSE POSITIVE";} 
   else if ($1 == "" && $2 == "") { print $1 "|" $2"|TRUE NEGATIVE"; } 
   else {print $1 "|" $2 "|TRUE POSITIVE";}
 
}' $1
[sdp@blr-qe101 TDE]$ cat test.txt 
INPUT 1 (i/p 1)|INPUT 2 (i/p 2)
Bharat Bazar|Bharat Bazar
Binny's Sales| 
|Binny's
|
Bharat bazar|Bharat
binny's|binny
state|country

[/CODE]

If you look at the code in your post #11 (which is included above) and compare it to the code you showed us in your post #7, you might note that (except for the lines:

awk -F "|" '
BEGIN{IGNORECASE=1} 
}' $1

(i.e., the 1st two lines and the last line of your code), EVERYTHING is different. The code above references fields 1 and 2, while the code in post #7 references fields 2 and 3. The differences in YOUR code between posts #7 and #11 are not differences in spacing. They are differences in YOUR code.

When you originally posted your data, empty fields were shown as the literal text <BLANK> , later posts of your data change <BLANK> to a single <space> character, and your latest posts have used empty fields. The following code assumes that they are empty fields, but I have no reason to believe that that assumption is correct. (That assumption just makes more sense in my view of what might be reasonable.)

The title of this thread is "Matrix with Percentage" and you have asked for us to provide code that computes percentages in three of your posts. You have been asked to show us sample output containing percentages as you want them to be displayed three times, but you still have not shown us any example of what you want. You did show us sample code. And, if that code were added to the code you have shown us, it would not produce any output at all. If you refuse to clearly explain what you want and you refuse to show us any example of what you want, how do you expect us to be able to produce code that will produce anything remotely similar to what you want?

Some of your sample output uses mixed case in the 3rd output field (i.e., True Positive , True Negative , False Positive , and False Negative . Your descriptions of the output that should be produce sometimes use all lowercase output, sometimes use mixed case output, and sometimes use all uppercase output. The following code uses all uppercase since that is what is used in your latest sample outputs and in all of the code you have shown us.

The awk that I'm using doesn't have an IGNORECASE variable that your code depends upon (although, if I understand what you're trying to do, case insensitivity would not matter with any of your sample data). The code below only uses features of awk that are required by the POSIX standards.

The descriptions you have supplied for what should appear in the 3rd output field still are not clear. The following code assumes that what you want(except for the header line in the output) is two words of output in the 3rd field where the 1st word is "TRUE" if the 2nd input field is equal to the 1st input field or if the 2nd input field is a non-empty substring of the 1st input field (when performing a case insensitive match); otherwise it is the word "FALSE". And the 2nd word of the 3rd output field is to be "POSITIVE" if the 2nd input field is not an empty string, or "NEGATIVE" if it is an empty string. For the output header line, the following code assumes that you want the output shown in your post #1 (again, because that kind of made sense to me when the output header in most of your other posts did not).

So, even though this code produces output that is not at all similar to the output you said you wanted in post #9 in this thread, maybe the following code with give you something you can work with to get what you want:

awk '
FNR == 1 {
	FS = OFS = "|"
	print $0, "OUTPUT (o/p)"
	next
}
{	
	f3 = (($1 == $2 || ($2 != "" && index(tolower($1), tolower($2)))) ? \
		"TRUE " : "FALSE ")
	f3 = f3 ($2 == "" ? "NEGATIVE" : "POSITIVE")
	print $0, f3
}' "$1"

which, when given the name of your sample input file you provided in post #11 in this thread as a command-line argument, produces the output:

INPUT 1 (i/p 1)|INPUT 2 (i/p 2)|OUTPUT (o/p)
Bharat Bazar|Bharat Bazar|TRUE POSITIVE
Binny's Sales| |TRUE POSITIVE
|Binny's|FALSE POSITIVE
||TRUE NEGATIVE
Bharat bazar|Bharat|TRUE POSITIVE
binny's|binny|TRUE POSITIVE
state|country|FALSE POSITIVE

I believe the output here is correct even though it differs from the output you said you wanted for the following reasons:

  • Line 1 includes a title for the 3rd output field (as shown in some of your requested output, but not in post #9).
  • Line 3 reports TRUE POSITIVE (instead of FALSE NEGATIVE because the <space> character in field 2 on that line is a substring of field 1 on that line AND field 2 is not an empty string.
  • And the last line of the output reports FALSE POSITIVE instead of FALSE NEGATIVE because the string country is not an empty string and is not a substring of the string state .

You have also been asked what operating system and shell you're using, but you haven't answered that question either. If you are trying to run this on a Solaris/SunOS system, change awk in the above script to /usr/xpg4/bin/aw or nawk .

Don,

I tried on my whole file after executing your script, It is not able to replace true positive in every place properly, one of the small examples shown below.
after executing your script

i/p

TRADER JOESE|Trader Joe's 
TRADER JOESE|Trader Joe's 
TARGET CORPORATION|Target 
IN-N-OUT-BURGER|In-N-Out Burger 
WALMART|Vudu 
ROSS STORES, INC|Ross Stores Inc. 
CHIPOTLE MEXICAN GRILL, INC|Chipotle 
SPORTS AUTHORITY|The Sports Authority 

[/CODE]

TRADER JOESE|Trader Joe's|FALSE POSITIVE 
TRADER JOESE|Trader Joe's|FALSE POSITIVE 
TARGET CORPORATION|Target|TRUE POSITIVE 
IN-N-OUT-BURGER|In-N-Out Burger|FALSE POSITIVE 
WALMART|Vudu|FALSE POSITIVE 
ROSS STORES, INC|Ross Stores Inc.|FALSE POSITIVE 
CHIPOTLE MEXICAN GRILL, INC|Chipotle|TRUE POSITIVE 
SPORTS AUTHORITY|The Sports Authority|FALSE POSITIVE

[/CODE]

trader joese does match with trader joe's but it shows FALSE POSITIVE not the TRUE POSITIVE as in AMAZON shown below

AMAZON.COM, INC|Amazon.com|TRUE POSITIVE
AMAZON.COM, INC|Amazon.com|TRUE POSITIVE
AMAZON.COM, INC|Amazon.com|TRUE POSITIVE
AMAZON.COM, INC|Amazon.com|TRUE POSITIVE

[/CODE]

I apologize for wasting your time trying to help you.

Where in the string TRADER JOESE do you find the apostrophe (or single-quote) that is in Trader Joe's ? There is no match for that character, so the field 2 value can't possibly be a substring of the field 1 value.

Similarly for:

IN-N-OUT-BURGER|In-N-Out Burger|

the <space> before Burger in the 2nd field is not present in the 1st field; so it is not a substring of the 1st field. A <space> character is not a match for a <hyphen> character.

None of the lines you have shown in post #13 meet the criteria you specified except the two lines that show the 3rd output field being TRUE POSITIVE (i.e., they do not have strings in field 2 that are substrings of field 1 with a case insensitive match).

If you want to change your requirements again as to what constitutes a match, YOU HAVE to ACTUALLY state your exact requirements instead of assuming that we can read your mind and guess what all of your unstated requirements might be.

Don,

I apologise for that, I just gave the part of my file as sample file, My whole file is of around 2 GB, This was quite unusual, dint expected.
Can you plz help with this?

My code is working perfectly for the criteria you gave. I have repeatedly asked you to clearly state the criteria that specifies when field 1 is similar to field 2. You have repeatedly ignored that request. I clearly stated the assumptions I used to determine when my code would consider field 1 similar to field 2. Saying that my code produces the wrong results without clearly explaining what rules need to be followed to get the answer that you will consider correct doesn't tell us what we have to do to satisfy you.

Don,

To tell u frankly even i was not aware of the same until i ran your script that it won't work on that part of data.
Can you plz check and help on the same ?

My file is of huge size and it is impossible to share that huge data here. I really apologise for all the chaos caused.

Nikhil,
As I have repeatedly stated. We don't need to see all of your data. (Even if we did see all of your data, we apparently have no idea what the results are supposed to be.) There is nothing for me to check. The code I supplied is working exactly as I intended it to work based on your varying descriptions of the results you want. If you won't describe what needs to be done differently (instead of just repeating that my code is not working), I have nothing to check and I cannot help you.

Unless you post precise directions that we can follow that accurately describe what steps need to be taken to get the results you want, there is no possible way for any of us to help you. I could make some more wild guesses that would change the results that you say are wrong, but doing so would change results in previous posts that you seem to believe are correct as they now stand.

PLEASE post precise directions that we can follow to accurately determine what results you want to be produced based on the contents of input fields 1 and 2! If you haven't done that in the next 72 hours, this thread will be closed. (I am allowing 72 hours instead of 24 since we are rapidly approaching a weekend.)

Moderator comments were removed during original forum migration.