Formatting a file - Remove Duplicate

freakygs · June 7, 2011, 1:24am

Hi I have a file in the following format. Basically the file contains tablename and their aliases:

TABLE1
TABLE1 A
TABLE2
TABLE2 B
TABLE3
TABLE4
TABLE4 C
TABLE4

Upon formatting an sql statement I am getting such output.

Problem: Whenever a tablename appears with alias, it has repeated entries in the file. One without alias, and one with alias. I want to remove the first occurance of the tablename (without aliases). but there might be some entries with tablename without any alias, don't wanna delete those. also there can be a repetition of same table, and one could be with alias one could be without alias.

Solution: Basically just want to delete the line which preceeds with tablename with alias. so the output should be like:

TABLE1 A
TABLE2 B
TABLE3
TABLE4 C
TABLE4

Hope I am clear enough.
Your help would be much appreciated. Thanks.

ni2 · June 7, 2011, 1:59am

Try sort -u

shell>less file
TABLE1
TABLE1 A
TABLE2
TABLE2 B
TABLE3
TABLE4
TABLE4 C
TABLE4

shell> sort -u file

TABLE1
TABLE1 A
TABLE2
TABLE2 B
TABLE3
TABLE4
TABLE4 C

I noticed that in your output you dropped TABLE1 but kept TABLE4. Was that intentional?

itkamaraj · June 7, 2011, 2:28am

try this.. not tested..

awk '{ first_line = $0; getline next_line; if (index($next_line,$first_line) > 0) print $next_line; else print $0; }' filename

freakygs · June 7, 2011, 2:46am

Yeah... Second call for TABLE4 is not having any alias, so that entry shouldn't be deleted

michaelrozar17 · June 7, 2011, 2:49am

try sed..

  sed 'N;/\(.*\)\n\(\1 .\)/{s//\2/}' inputfile | sort -u

ni2 · June 7, 2011, 3:23am

Try the sed example michaelrozar17 posted. Looks like it meets your requirement.