I have a xml file which contains image tag as follows:
<image><img src="wstc_0007_0007_0_img0001.jpg" width="351" height="450" alt="This is the cover page. Brazil � Japan � Korea � Mexico � Singapore � Spain" type="photograph" orient="portrait"/></image>
Now, i want to write a script that will count whether the number of characters within alt attribute are greater than 250 and if it is; the data within alt attribute should be truncated to contain only first 250 characters.
It would be really nice if anybody could provide me the way to do so.
This truncates the alt attribute by item, rather than characters:
altcheck() #@ USAGE: altcheck [max]
{ #@ Truncate the alt attribute in $line to $max characters
max=${1:-250}
right=${line#*alt=\"}
left=${line%"$right"}
alt=${right%%\"*}
right=${line#*"$alt"}
while [ ${#alt} -gt $max ]
do
alt=${alt% &*}
done
line=$left$alt$right
}
while IFS= read -r line
do
case $line in
*"<img"*alt=*) altcheck ;;
esac
printf "%s\n" "$line"
done < "$FILE"
I duplicated some of the text inside that alt area so I could show it trimmed down. And then trimmed it at 150.
Note that it does not necessarily break nicely, and does not address the quotation " character.
However, this logic appears to trim on that field.
> cat file164
<image><img src="wstc_0007_0007_0_img0001.jpg" width="351" height="450" alt="This is the cover page. Brazil � Japan � Korea � Mexico � Singapore � Spain" type="photograph" orient="portrait"/></image>
<image><img src="wstc_0007_0007_0_img0001.jpg" width="351" height="450" alt="This is the cover page. Brazil � Japan � Korea � Mexico � Singapore � Spain Brazil � Japan � Korea � Mexico � Singapore � Spain" type="photograph" orient="portrait"/></image>
> sed "s/alt/~alt/g" file164 | sed "s/type/~type/g" | awk -F"~" '{print $1,substr($2,1,150),$3}'
<image><img src="wstc_0007_0007_0_img0001.jpg" width="351" height="450" alt="This is the cover page. Brazil � Japan � Korea � Mexico � Singapore � Spain" type="photograph" orient="portrait"/></image>
<image><img src="wstc_0007_0007_0_img0001.jpg" width="351" height="450" alt="This is the cover page. Brazil � Japan � Korea � Mexico � Singapore � Spain Brazil � Japan � Kor type="photograph" orient="portrait"/></image>