RPM Repo Cleanup

Greetings all,

I have inherited this offline Red Hat YUM repo that contains over 42000 packages. You read that right. There are 71 kernels alone. The process that I've inherited has us reposync on an Internet-connected-server then sneaker-net the delta to our offline repo where we do a yum --update.

The offline repo is massive and takes to much time just to update.

I'm trying to develop a script that will process the offline repo and keep perhaps the most recent 5 or 6 packages. For example, if I have:

kernel-version-1 through kernel-version-71. I want to strip out and keep kernel-version-67 through kernel-version 71. Then, of course, there is all the other packages like openssl, samba, poppler, etc etc etc.

If been experimenting on the command line with what ls and stat can do using their time arguments and along with the standard processing that grep and awk can do throwing in an xargs here and there, but haven't hit upon the golden nugget yet that makes me feel comfortable with proceeding.

I'm worried about accidentally removing critical RPM dependencies so that is why I want to keep 5 or 6 versions (maybe more) back and if there are less than 5 or 6 versions ... say only 3 ... then keep all those 3.

I've also thought about exploring what the rpm and yum commands can do by having them tell me what their dependencies are then keep those dependencies along with the RPM they support while at the same time making sure those RPMs are the most recent 5 or 6.

Any ideas or thoughts on approaching this issue?

---------- Post updated at 10:19 AM ---------- Previous update was at 09:58 AM ----------

Umm. Just stumbled across something called repomanage. This might be the ticket. I will explore this also.

Hi,

As a general approach, you could try something like this:

  1. Identify the first part of each filename before the version number and architecture (e.g. for foo-1.3.2-1024.1.rhel6.x86_64.rpm , you'd want to just capture the foo- part).

  2. For each base filename determined in Step 1, do an ls on all of them in chronological or alphabetical order (or whatever order you desire).

  3. From the output of Step 2, keep the last X lines of the output.

  4. Keep only those files you listed in Step 3, and remove/archive every other foo-*rpm file that was not on the list.

That's the first idea that springs to mind, anyway.

The second method I can think of (and the one I'd probably use myself in this type of situation) is to use find . If you could clearly identify a cut-off point in time beyond which you wanted to discard all RPMs, then something like this might work:

find /path/to/rpms -type f -name "*.rpm" -mtime +31 -exec rm -fv \{\} \;

This particular example would remove all files with names matching the pattern *.rpm older than 31 days. Note again that this will only be safe if you are sure that every part of this is safe to do: i.e. that there's nothing else you could accidentally match with *.rpm beneath /path/to/rpms/ other than the RPMs you want to remove; and that 31 days is an acceptable and safe cut-off point beyond which it is definitely safe to delete things.

Anyway, there's a couple of possibilities for you to consider.

Thanks drysdalk for the tips. That is indeed sound reasoning. I did find however, that the repomanage from the yum-utils RPM works wonders. This did the trick as far as I can tell. I'm still evaluating the results.

cp $(repomanage --keep=10 --new /path/to/the-large-repo) /path/to/new_repo 
createrepo /path/to/new_repo

But my new repo may still be two large, though it is now down to 38000+ packages vice 42000+. I think I will try --keep=5 as I originally intended.