Truncate the content within alt attribute to first 250 characters.

I have a xml file which contains image tag as follows:

<image><img src="wstc_0007_0007_0_img0001.jpg" width="351" height="450" alt="This is the cover page. Brazil � Japan � Korea � Mexico � Singapore � Spain" type="photograph" orient="portrait"/></image>

Now, i want to write a script that will count whether the number of characters within alt attribute are greater than 250 and if it is; the data within alt attribute should be truncated to contain only first 250 characters.

It would be really nice if anybody could provide me the way to do so.

You know, if it truncated it in the middle of an � or somesuch the result could be invalid HTML.

This truncates the alt attribute by item, rather than characters:

altcheck() #@ USAGE: altcheck [max]
{ #@ Truncate the alt attribute in $line to $max characters
  max=${1:-250}
  right=${line#*alt=\"}
  left=${line%"$right"}
  alt=${right%%\"*}
  right=${line#*"$alt"}
  while [ ${#alt} -gt $max ]
  do
    alt=${alt% &*}
  done
  line=$left$alt$right
}

while IFS= read -r line
do
  case $line in
    *"<img"*alt=*) altcheck ;;
  esac
  printf "%s\n" "$line"
done < "$FILE"

Hi Johnson!

Thanks for your help! . But above script is not working. Please help!!!. Thanks.

I duplicated some of the text inside that alt area so I could show it trimmed down. And then trimmed it at 150.
Note that it does not necessarily break nicely, and does not address the quotation " character.
However, this logic appears to trim on that field.

> cat file164
<image><img src="wstc_0007_0007_0_img0001.jpg" width="351" height="450" alt="This is the cover page. Brazil � Japan � Korea � Mexico � Singapore � Spain" type="photograph" orient="portrait"/></image>
<image><img src="wstc_0007_0007_0_img0001.jpg" width="351" height="450" alt="This is the cover page. Brazil � Japan � Korea � Mexico � Singapore � Spain Brazil � Japan � Korea � Mexico � Singapore � Spain" type="photograph" orient="portrait"/></image>

> sed "s/alt/~alt/g" file164 | sed "s/type/~type/g" | awk -F"~" '{print $1,substr($2,1,150),$3}'
<image><img src="wstc_0007_0007_0_img0001.jpg" width="351" height="450"  alt="This is the cover page. Brazil � Japan � Korea � Mexico � Singapore � Spain"  type="photograph" orient="portrait"/></image>
<image><img src="wstc_0007_0007_0_img0001.jpg" width="351" height="450"  alt="This is the cover page. Brazil � Japan � Korea � Mexico � Singapore � Spain Brazil � Japan � Kor type="photograph" orient="portrait"/></image>

What does "not working" mean? What does happen?

Where does the script fail?

Are there any error messages?