Difficulty cleaning references to duplicated images in HTML code

Hi,

I need to search and replace references to duplicated images in HTML code. There are several groups of duplicated images, which are visually the same, but with different filenames. I managed to find the duplicated files themselves, but now I need to clean the code too. I have a CSV file with each group of duplicated images organized:

Group ID,Duplicated image filename, Number of duplicates
0,13429.png,3 
0,18064.png,3
0,25025.png,3
1,14136.png,4
1,17382.png,4
1,19243.png,4
1,25389.png,4
2,21560.png,2
2,5529.png,2
3,3523.png,2
3,4811.png,2

and so on...

The references to duplicated images are scattered throughout hundreds of HTML files. The task is to get the <img> tags that references duplicates pointing to just one unique image in each group. I'm wondering if some script magic could get it done easily.

HTML (before): different files, same visual appearance

<!-- group 0 -->
<img src="13429.png" />...text...<img src="18064.png" />...text...<img src="18064.png" />

<!-- group 1 -->
<img src="14136.png" />...text...<img src="17382.png" />...text...<img src="19243.png" />...text...<img src="25389.png" />

<!-- group 2 -->
<img src="21560.png" />...text...<img src="5529.png" />

HTML (after): unique file in each group

<!-- group 0 -->
<img src="13429.png" />...text...<img src="13429.png" />...text...<img src="13429.png" />

<!-- group 1 -->
<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />

<!-- group 2 -->
<img src="21560.png" />...text...<img src="21560.png" />

I searched for some solutions here in the forum, with no success.

Any help you can give would be greatly appreciated.

Not sure I understand what you want to accomplish. Can I paraphrase it like so: replace in all files selected every occurrence of second ff member in group by first, i.e. 18064.png, 25025.png with 13429.png; 17382.png, 19243.png, 25389.png with 14136.png and so on?

@RudiC: Yes, that's correct. Sorry if I wasn't very clear.

OK, try this very crude approach, which may need serious polishing:

awk -F, 'NR==FNR {Ar[$1]=Ar[$1](Ar[$1]?"|":"")$2;
                  if (!Rr[$1])Rr[$1]=$2; next}
         {for (i in Ar) gsub (Ar, Rr)}
         1
        ' file file1
<!-- group 0 -->
<img src="13429.png" />...text...<img src="13429.png" />...text...<img src="13429.png" />

<!-- group 1 -->
<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />

<!-- group 2 -->
<img src="21560.png" />...text...<img src="21560.png" />
1 Like

Thanks, that worked! :slight_smile: Sorry for the newbie question, but how can I run it in more than one file at once?

You can, but how you do it depends on some other factors, like how to collect/find the input files, output concatenated or in separate files.
If all files are in the same directory which is your working directory, this will do:

awk '...' file.csv *.html

If you have them in a file.txt, try

awk '...' file.csv $(cat file.txt)

(not sure if this is a UUOC, and there's a better way)
If you need the output separated, try replacing the singular 1 in line 4 by

{print > FILENAME"new"}

Brilhant, RudiC, this is going to be extremelly useful! :b:

---------- Post updated 01-31-13 at 12:15 AM ---------- Previous update was 01-30-13 at 06:46 PM ----------

I managed to output the results in a new file with

{print >> "new"}

Is there a way to just overwrite the original files? It's necessary to replace them with the results anyway.

Files don't really work that way. It's also a big risk to overwrite your originals. A program bug wipes out your input and output both.

1 Like

Corona688 has already said it: throw away your originals only after being 101% sure your results are what they are supposed to be.

Once you are indeed sure you want to replace your originals use "mv" to move the results over the originals:

find /some/path/to/start -type f -name "*new" -print | while read file
     mv $file ${file%???}
done

This moves all files which names end in "new" to the same name less "new", i.e. "filenamenew" -> "filename".

I hope this helps.

bakunin

1 Like

@Corona688, @bakunin: Thanks for the clarification.