What Cinderella did - sorting the bytes

Hi,

I messed up a gzip'ed tarball by tar's verbose output:

tar -cvzf /dev/stdout /home | split -d -b 4000m - backupPART

I partioned the harddisk and installed fedora 8. My 6 GB '/home' directory is gone. All I got is a messy tarball.

I hope, I can do the Cinderella job: good bytes (gzip'ed data) separating from bad bytes (tar's verbose listing of copied full path file names starting with /home/az and ending at <LINEFEED>).

I'm a newbie and I think textutils are not appropriate. Then I found tr and dd.

I need a little help to realise a nice command or script. My idea is:

read next byte from backup.mess
check if byte.mess == '/' then
check if next 7 bytes == 'home/az' then delete all bytes until <LINEFEED>

But I don't know how to script that. I'll be glad if you give me a hint. Thank you.

You should be ok. I create tarballs almost like that somewhat frequently. (That f and /dev/stdout was basicly a no-op, by default, stdout is where the output goes.) The v sends the listing to stderr and so it should have been displayed while the tar command was in progress. Meanwhile, the archive was going to stdout and it should have been ok. If this was not the case then every tar archive ever created with a v would have the same problem and the solution, if any, would be well known. If no solution had been found, then tar would have been rewritten decades ago to ignore the v during a c.

Hi.

Using tar benignly to obtain a table of contents should tell you if the tar file is broken ... cheers, drl

I'm using:
# tar --version
tar (GNU tar) 1.17
Copyright (C) 2007 Free Software Foundation, Inc.
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl.html&gt;
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by John Gilmore and Jay Fenlason.

and I can replicate that error:

# tar -cvzf /dev/stdout /etc | split -d -b 10m - etc.tgz.messy
tar: Entferne f�hrende �/� von Elementnamen
tar: Entferne f�hrende �/� von Zielen harter Verkn�pfungen
# head -2 etc.tgz.messy00
/etc/
/etc/redhat-lsb/
# tar -czf /dev/stdout /etc | split -d -b 10m - etc.tgz.clean
tar: Entferne f�hrende �/� von Elementnamen
tar: Entferne f�hrende �/� von Zielen harter Verkn�pfungen
# xxd etc.tgz.clean00 |head -2
0000000: 1f8b 0800 ac84 5147 0003 ec5c 7b73 dbb6 ......QG...\{s..
0000010: 96ef bfd6 a740 e4ec c876 2459 a41e 769c .....@...v$Y..v.
#

It seems to be tar's using <stdout> in stead of <stderr> for verbose output streaming. Tar's error messages (Entferne == German: 'remove') is <stderr> but didn't copy to the produced split (messed up) tarball.

Is there a utility for easy removing the file listings from the corrupted gzip'ed tarball?

Thank you.

Yes, it's corrupt (and the only backup I have):

# tar -xf backup
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Error exit delayed from previous errors
#

I found this in the info page for gnu tar:

So leaving off the f option and the /dev/stdout would have still sent the archive to the pipeline and the listing to the terminal. So yes, you do have a garbled archive. I don't see an obvious way to script your proposed solution. But worse, I see acouple of potential problems with it...

Problem one: "/home" probably appears somewhere in the good part of the archive and your solution would then drop bytes from the archive.

Problem two: The output of the listing was probably buffered because it was going to a non-tty, so blocks of lising may be interspersed and you may have lines split between blocks. So "/home/this/that" might not have fit in the buffer. So "/home/th" was put in and the buffer was written. Then "is/that{lf}" is placed in the buffer. But meanwhile output buffers of the archive are being written.

Sorry for the bad news, but I doubt that the archive can be salvaged.

Info is nice, I didn't know Info:-)

Actually I applied gzip ('z' option). I have a gzip'ed archive salted by file listings at random. A gzip'ed archive is like any gzip'ed file. Tar's listings didn't flow through the gzipper, I think. I assume that the gzipped (tar) file data is in good order, but plain tar's listings are randomly spread over the gzipped file.

Yes, I understand. The tar listing entries are cut resp. '/home/az/...[a-Z]...<LF>' can't be applied as a kind of search string.

On the other hand, the problem is reduced to identify the bad bytes consisting of tar's instiled file names in my messy backup file. I don't see why it should not be possible in principle. I don't know about the gzip algorithm or the gzip file format. Maybe there are checksums, byte set ranges or something like that is useful for recovery.

However, are there any compression recovery kits available (based on zlib)? I've found 'grzrecover', but it's crashing on my file:-)

Thank you.

I think you also need a z to extract it then. You might try that. I rather doubt it will work, but it's easy enough to try.

Examine the octal dump to find out how the listing is clustered. Is it arranged as lines? Is is always a buffer of exactly 8196 characters? You might be able to develop a workable algorithm with more information.

"Reading compressed archive is even simpler: you don't need to specify any additional options as GNU tar recognizes its format automatically." (GNU tar manual)

Allright, that will be my Sunday amusement tomorrow. Again, thank you for your support.

Maybe you could take a test directory of some data and tar it correctly; and then using the same test directory, tar it incorrectly.

Then, use a file comparison too to see exactly how the tar file in error was structured and if it is feasible to repair.