Remove duplicate lines after ignoring case and spaces between

kraljic · August 7, 2015, 5:16am

Oracle Linux 6.5

$ cat someStrings.txt
GRANT select on MANHPRD.S_PROD_INT TO OR_PHIL;
GRANT select on  MANHPRD.S_PROD_INT TO OR_PHIL;
GRANT select on SCOTT.emp to JOHN;
grant select on scott.emp to john;
grant select on scott.dept to hr;

If you ignore the case and the empty space between the characters , there are only 3 distinct lines in the above .txt file and they are

### Distinct output
GRANT select on MANHPRD.S_PROD_INT TO OR_PHIL;
GRANT select on SCOTT.emp to JOHN;
grant select on scott.dept to hr;

How can I remove the duplicate lines after ignoring the case and the empty space between the characters and get the above mentioned distinct output ?

RudiC · August 7, 2015, 5:41am

Any attempts from your side?

---------- Post updated at 11:41 ---------- Previous update was at 11:40 ----------

Howsoever, try

awk '
                        {(gsub(/ +/," "))}
!T[toupper($0)]++
' file
GRANT select on MANHPRD.S_PROD_INT TO OR_PHIL;
GRANT select on SCOTT.emp to JOHN;
grant select on scott.dept to hr;

kraljic · August 7, 2015, 10:10am

Thank you very much Rudic. Your command works (although I didn't understand anything in it ).
Need to do some googling on the basics of awk.

It can be put in one line as well as shown below . Right ?

# awk '{(gsub(/ +/," "))}!T[toupper($0)]++' somestrings.txt
GRANT select on MANHPRD.S_PROD_INT TO OR_PHIL;
GRANT select on SCOTT.emp to JOHN;
grant select on scott.dept to hr;

Scrutinizer · August 7, 2015, 10:54am

Yes and it can even be reduced a little still and such that it also works with TAB characters:

awk '{$1=$1}!A[toupper($0)]++' file

Aia · August 8, 2015, 8:16pm

Display encountered lines in its original form.

perl -ne 'print unless $seen{uc(s/\s+//gr)}++' someStrings.txt

summer_cherry · August 10, 2015, 3:47am

cache={}
with open("a.txt") as file:
	for line in file:
		line=line.replace("\n","")
		key=" ".join([i.lower() for i in filter(lambda x: x!="",line.split(" "))])
		if key not in cache:
			print(key)
			cache[key]=1

RavinderSingh13 · August 10, 2015, 4:28am

kraljic:

Thank you very much Rudic. Your command works (although I didn't understand anything in it ).
Need to do some googling on the basics of awk.

It can be put in one line as well as shown below . Right ?
# awk '{(gsub(/ +/," "))}!T[toupper($0)]++' somestrings.txt
GRANT select on MANHPRD.S_PROD_INT TO OR_PHIL;
GRANT select on SCOTT.emp to JOHN;
grant select on scott.dept to hr;

Hello Kraljic,

Following is the explanation for command mentioned by RudiC sir.

 awk '
                        {(gsub(/ +/," "))}         ##### gsub is used for substitute operation, like here we are replacing the spaces which are unequal to a single spaces, like in row number 2 you have showed us in input space is NOt a single space. So that we can make equal length in between fields of each line.
!T[toupper($0)]++                                  ##### toupper is a utility by which we can covert any string/line to completly capital form. Here we are creating an array named T whose index is the complete line which  has been changed toupper cases now, !T[toupper($0)]++ means if the line haven't occur even a single time than make that specfici line's count as 1 and ! sign before aray T makes sure no lines should have count more than 1, so that we can have unique single time lines only. As we know awk works on 
                                                         condition and action format, means if any condition is RUE then action mentioned next to it should be perfoemed, here when any lines comes first time into array T then it will print it too because we haven't given any action and default action in awk is to print.
' file                                             ##### mentioning input file name here

Hope this helps.

Thanks,
R. Singh