Difficulty cleaning references to duplicated images in HTML code


I need to search and replace references to duplicated images in HTML code. There are several groups of duplicated images, which are visually the same, but with different filenames. I managed to find the duplicated files themselves, but now I need to clean the code too. I have a CSV file with each group of duplicated images organized:

Group ID,Duplicated image filename, Number of duplicates

and so on...

The references to duplicated images are scattered throughout hundreds of HTML files. The task is to get the <img> tags that references duplicates pointing to just one unique image in each group. I'm wondering if some script magic could get it done easily.

HTML (before): different files, same visual appearance

<!-- group 0 -->
<img src="13429.png" />...text...<img src="18064.png" />...text...<img src="18064.png" />

<!-- group 1 -->
<img src="14136.png" />...text...<img src="17382.png" />...text...<img src="19243.png" />...text...<img src="25389.png" />

<!-- group 2 -->
<img src="21560.png" />...text...<img src="5529.png" />

HTML (after): unique file in each group

<!-- group 0 -->
<img src="13429.png" />...text...<img src="13429.png" />...text...<img src="13429.png" />

<!-- group 1 -->
<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />

<!-- group 2 -->
<img src="21560.png" />...text...<img src="21560.png" />

I searched for some solutions here in the forum, with no success.

Any help you can give would be greatly appreciated.

Not sure I understand what you want to accomplish. Can I paraphrase it like so: replace in all files selected every occurrence of second ff member in group by first, i.e. 18064.png, 25025.png with 13429.png; 17382.png, 19243.png, 25389.png with 14136.png and so on?

@RudiC: Yes, that's correct. Sorry if I wasn't very clear.

OK, try this very crude approach, which may need serious polishing:

awk -F, 'NR==FNR {Ar[$1]=Ar[$1](Ar[$1]?"|":"")$2;
                  if (!Rr[$1])Rr[$1]=$2; next}
         {for (i in Ar) gsub (Ar, Rr)}
        ' file file1
<!-- group 0 -->
<img src="13429.png" />...text...<img src="13429.png" />...text...<img src="13429.png" />

<!-- group 1 -->
<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />...text...<img src="14136.png" />

<!-- group 2 -->
<img src="21560.png" />...text...<img src="21560.png" />
1 Like

Thanks, that worked! :slight_smile: Sorry for the newbie question, but how can I run it in more than one file at once?

You can, but how you do it depends on some other factors, like how to collect/find the input files, output concatenated or in separate files.
If all files are in the same directory which is your working directory, this will do:

awk '...' file.csv *.html

If you have them in a file.txt, try

awk '...' file.csv $(cat file.txt)

(not sure if this is a UUOC, and there's a better way)
If you need the output separated, try replacing the singular 1 in line 4 by

{print > FILENAME"new"}

Brilhant, RudiC, this is going to be extremelly useful! :b:

---------- Post updated 01-31-13 at 12:15 AM ---------- Previous update was 01-30-13 at 06:46 PM ----------

I managed to output the results in a new file with

{print >> "new"}

Is there a way to just overwrite the original files? It's necessary to replace them with the results anyway.

Files don't really work that way. It's also a big risk to overwrite your originals. A program bug wipes out your input and output both.

1 Like

Corona688 has already said it: throw away your originals only after being 101% sure your results are what they are supposed to be.

Once you are indeed sure you want to replace your originals use "mv" to move the results over the originals:

find /some/path/to/start -type f -name "*new" -print | while read file
     mv $file ${file%???}

This moves all files which names end in "new" to the same name less "new", i.e. "filenamenew" -> "filename".

I hope this helps.


1 Like

@Corona688, @bakunin: Thanks for the clarification.