Removing Duplicate Variables : SED?

Blue_Solo · October 14, 2011, 5:56pm

I have a file that needs to index and remove all duplicate variables (keeping the first of the duplicated). Then in another file, from the index I need to find and replace all those duplicated variables we deleted with their primary variable (the one we kept). I think this is a SED question, but if there are any better solutions I would greatly appreciate it.

Example:

newmtl m2
Kd 1.000000 1.000000 1.000000;
Ka 0 0 0
illum 2
Ns 64
d 1.000000
map_Kd 951E01Other4040.bmp

newmtl m3
Kd 1.000000 1.000000 1.000000;
Ka 0 0 0
illum 2
Ns 64
d 1.000000
map_Kd 951E01Other4040.bmp

newmtl m4
Kd 1.000000 1.000000 1.000000;
Ka 0 0 0
illum 2
Ns 64
d 1.000000
map_Kd 951E01Other4040.bmp

newmtl m5
Kd 1.000000 1.000000 1.000000;
Ka 0 0 0
illum 2
Ns 64
d 1.000000
map_Kd 951E01Other4040.bmp

Notice how m2, m3, m4, and m5 are the same. I need to delete m3, m4 and m5 and keep m2. Then in another file I have I need to find all variables m3, m4 and m5 and replace them with m2 because they no longer exist.

newmtl mXXX is the name and map_Kd is what determines if it is a duplicate. In this case we have map_Kd 951E01Other4040.bmp in others we have map_Kd A1C039w20h200x1CCOLORS.bmp and over 20 other meanings for map_Kd.

Hope I explained that in enough understandable detail. Let me know your thoughts.
Thank you!

vgersh99 · October 14, 2011, 6:09pm

for starters:

$ nawk '!a[$NF]++' RS='' myFile
newmtl m2
Kd 1.000000 1.000000 1.000000;
Ka 0 0 0
illum 2
Ns 64
d 1.000000
map_Kd 951E01Other4040.bmp

Blue_Solo · October 14, 2011, 6:18pm

I am not familiar with this. How do I use this and what will it do?

vgersh99 · October 14, 2011, 6:23pm

run it posted and see the output (similar to what's been posted).
If 'nawk' is not available, use either awk or gawk.

Blue_Solo · October 14, 2011, 6:35pm

It is running each line as a command:

MacBook-Pro:~ user$ newmtl m2
-bash: newmtl: command not found
MacBook-Pro:~ user$ Kd 1.000000 1.000000 1.000000;
-bash: Kd: command not found
MacBook-Pro:~ user$ Ka 0 0 0
-bash: Ka: command not found
MacBook-Pro:~ user$ illum 2
-bash: illum: command not found
MacBook-Pro:~ user$ Ns 64
-bash: Ns: command not found
MacBook-Pro:~ user$ d 1.000000
-bash: d: command not found
MacBook-Pro:~ user$ map_Kd 951E01Other4040.bmp

vgersh99 · October 14, 2011, 7:31pm

blue solo:

It is running each line as a command:

MacBook-Pro:~ user$ newmtl m2
-bash: newmtl: command not found
MacBook-Pro:~ user$ Kd 1.000000 1.000000 1.000000;
-bash: Kd: command not found
MacBook-Pro:~ user$ Ka 0 0 0
-bash: Ka: command not found
MacBook-Pro:~ user$ illum 2
-bash: illum: command not found
MacBook-Pro:~ user$ Ns 64
-bash: Ns: command not found
MacBook-Pro:~ user$ d 1.000000
-bash: d: command not found
MacBook-Pro:~ user$ map_Kd 951E01Other4040.bmp

hmmmm.... Show exactly what/how are you're running, please!
Follow the directions posted previously!

Blue_Solo · October 14, 2011, 9:11pm

I opened terminal.app and pasted:

awk '!a[$NF]++' RS='' /Users/user/Desktop/LevelIndices2.mtl 
newmtl m2 
Kd 1.000000 1.000000 1.000000; 
Ka 0 0 0 illum 2 
Ns 64 
d 1.000000 
map_Kd 951E01Other4040.bmp

vgersh99 · October 14, 2011, 10:13pm

blue solo:

I opened terminal.app and pasted:

awk '!a[$NF]++' RS='' /Users/user/Desktop/LevelIndices2.mtl 
newmtl m2 
Kd 1.000000 1.000000 1.000000; 
Ka 0 0 0 illum 2 
Ns 64 
d 1.000000 
map_Kd 951E01Other4040.b.        mp

Just copy/paste the awk line. The rest was provided as the illustration of the output given your sample input.

Blue_Solo · October 14, 2011, 11:01pm

I installed nawk and ran this in terminal:

nawk '!a[$NF]++' RS='' /Users/user/Desktop/LevelIndices2.mtl

It returned every line of text that was in the file LevelIndices2.mtl. Am I supposed to fill in NF or RS='' with something?

By the way thanks for trying to help so far!

UPDATE:
Now I see what is going on. I ran this:

nawk '!a[$NF]++' /Users/user/Desktop/LevelIndices2.mtl

And I turned up these results:

newmtl m2
map_Kd 951E01Other4040.bmp
newmtl m3
newmtl m4
newmtl m5
newmtl m6
newmtl m7
newmtl m8
newmtl m9
map_Kd 952C57w20h200xC6COLORS.bmp
newmtl m10
map_Kd A1C039w20h200x1CCOLORS.bmp
newmtl m11
newmtl m12
newmtl m13
newmtl m14
newmtl m15
map_Kd 946418w20h200x87COLORS.bmp
newmtl m16
newmtl m17

and so on throughout the whole document. So the first newmtl mXXX before a .bmp is the one that needs to be kept (m2, m9, m10,m15...).

-Next step from here is to keep the newmtl mXXX, and the rest are the ones I need to delete.
-Then in another file all mXXX I deleted need to be found and replaced by the one that are kept in this file.

UPDATE 2:
There are a few imperfections; When a .bmp is separating another it doesn't index it.
Example:

newmtl m2
Kd 1.000000 1.000000 1.000000
Ka 0 0 0
illum 2
Ns 64
d 1.000000
map_Kd image.bmp

newmtl m3
Kd 1.000000 1.000000 1.000000
Ka 0 0 0
illum 2
Ns 64
d 1.000000
map_Kd anotherImage.bmp

newmtl m4
Kd 1.000000 1.000000 1.000000
Ka 0 0 0
illum 2
Ns 64
d 1.000000
map_Kd image.bmp

When the command is run it will show:
newmtl m2
map_Kd image.bmp
newmtl m3
map_Kd anotherImage.bmp
newmtl m4

Instead it should show:
newmtl m2
map_Kd image.bmp
newmtl m4
newmtl m3
map_Kd anotherImage.bmp

This is because there is a separate .bmp in between them.

Blue_Solo · October 15, 2011, 5:52pm

Is there a way to order all the newmtl mXXX based on map_Kd XXX.bmp so that they are all grouped together and right next to each other? This should fix the error with:
nawk '!a[$NF]++' /Users/user/Desktop/LevelIndices2.mtl

If it is any help I am attaching the files as .txt; Step1.txt is what should be done first, Step2.txt is what needs to be edited second.

binlib · October 15, 2011, 8:27pm

According to your original post, this may be what you wanted:

awk '!RS {
  if ($1 != "newmtl") next
  if ($NF in p) {
    d["\\<" $2 "\\>"] = p[$NF] #gawk
    #d[$2] = p[$NF] # too loose
  } else {
    p[$NF] = $2
    print $0 "\n" > "step1.new"
  }
  next
}
{
  for (i in d) gsub(i, d)
  print
}' RS='' step1 RS='\n' step2 > step2.new

By the way, the files you attached were DOS files. You may need to convert them into Unix files.

Blue_Solo · October 15, 2011, 9:03pm

Oh wow, thanks! The tip about the DOS to Unix was great.
So that deleted all the duplicates, but it did not rename the duplicates in the step2 to their kept duplicate.

Thanks so far!

binlib · October 16, 2011, 10:51am

I guess you are not using gawk. The problem with the replacement is that you can't just replace m2 with m1, for example, because it will replace m20 with m10. Since your step2 file always has the names as the last field, you can change

d["\\<" $2 "\\>"] = p[$NF] #gawk
to
d[" " $2 "$" ] = " " p[$NF]

for other awks that don't have \< and \> for beginging/end of word.

Blue_Solo · October 16, 2011, 5:49pm

binlib:

I guess you are not using gawk. The problem with the replacement is that you can't just replace m2 with m1, for example, because it will replace m20 with m10. Since your step2 file always has the names as the last field, you can change
d["\\<" $2 "\\>"] = p[$NF] #gawk
to
d[" " $2 "$" ] = " " p[$NF]
for other awks that don't have \< and \> for beginging/end of word.

Thank you so much! It finally works!