Fixing corrupted vcard files.

dotancohen · October 13, 2008, 8:34pm

KDE's Kontact PIM breaks quoted-printable vcard files because it
linebreaks in the middle of a word. Take this text for example:

NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A
 8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=94=D7=A9=D7=95=D7=A8=D7=94 =D7=94=D7=
 A9=D7=A0=D7=99=D7=94 =D7=9B=D7=\n

The whole thing should be on one line, and the spaces at the beginning
of each line shouldn't be there at all. I have a directory with 422
files corrupted like this.

Can a shell script go through a directory of files and replace each instance
of "newline-space" with nothing? The system is Ubuntu 8.04 with KDE if
it matters. Thanks.

Annihilannic · October 13, 2008, 10:42pm

Try this:

perl -pi.bak -e 'BEGIN { $/=""; } s/\n //gm' *.vcard

It should save backups of the files as filename.vcard.bak.

dotancohen · October 14, 2008, 7:56am

Thanks. I am trying to see what happens here:
perl: this is obvious
-pi.bak: simply copy the current file to it's name + .bak?
-e: there is no mention of this in man perl.
'BEGIN { $/=""; } s/\n //gm': the actual regex. I don't quite get it
*.vcard: go through all these files?

I actually need to change the regex so that it not only removes the space at the beginning of a line, but removes the newline character as well. The only newline characters that should remain are those not followed by a space. In php that would be str_replace("\n ", "", $string); however I cannot figure out the perl regex to modify it as such. And regexes are hard to google for!

I do appreciate the code example, but I am also trying to learn a bit (unusual, I know). I very much appreciate your assistance and patience.

Annihilannic · October 14, 2008, 6:35pm

As far as I can see my solution does what you describe:

$ cat testfile.vcard
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A
 8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=94=D7=A9=D7=95=D7=A8=D7=94 =D7=94=D7=
 A9=D7=A0=D7=99=D7=94 =D7=9B=D7=\n
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A
 8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=94=D7=A9=D7=95=D7=A8=D7=94 =D7=94=D7=
 A9=D7=A0=D7=99=D7=94 =D7=9B=D7=\n
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A
 8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=94=D7=A9=D7=95=D7=A8=D7=94 =D7=94=D7=
 A9=D7=A0=D7=99=D7=94 =D7=9B=D7=\n
$ perl -pi.bak -e 'BEGIN { $/=""; } s/\n //gm' *.vcard
$ cat testfile.vcard
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=94=D7=A9=D7=95=D7=A8=D7=94 =D7=94=D7=A9=D7=A0=D7=99=D7=94 =D7=9B=D7=\n
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=94=D7=A9=D7=95=D7=A8=D7=94 =D7=94=D7=A9=D7=A0=D7=99=D7=94 =D7=9B=D7=\n
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=94=D7=A9=D7=95=D7=A8=D7=94 =D7=94=D7=A9=D7=A0=D7=99=D7=94 =D7=9B=D7=\n

There aren't any hidden/funny characters in the input files are there? Check with cat -vet.

man perlrun is the page you really need to look at for the command-line options. -p and -i are separate options that I combined for the sake of brevity.

-p makes perl behave like awk, including supporting a BEGIN clause before processing any input. In that clause I've redefined the input record separator to be an empty string... this means that perl "slurps" the entire input file in one go rather than reading it line-by-line, which allows us to do regex matches against multiple lines. It is separate from the actual s/// command to do the search and replace.

s/// is documented on the man perlop page.

I'm glad to see you don't just want spoonfeeding (all too common around here!).

dotancohen · October 15, 2008, 7:49am

Thanks. I'm going through the docs as we speak. Perl is _complicated_! That does not seem to be my own opinion, either. Googling some example leads me to lots of frustrated people!

In any case, I probably should have posted the entire vcard file. Here it is, along with the results of the code:

hardy2@hardy2-laptop:~/test$ cat test.vcf
BEGIN:VCARD
FN:First Last
N:Last;First;;;
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:First Line.\nThe Second Line i
 s long so that it will wrap. Long\, long\, and wrapping!=\n\nThird Line.\n
UID:frh74xvYZ9
VERSION:2.1
END:VCARD

BEGIN:VCARD
FN;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=90=D7=90=D7=A4=D7=A8=D7=98=D
 7=99 =D7=9E=D7=A9=D7=A4=D7=97=D7=94
N;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=9E=D7=A9=D7=A4=D7=97=D7=94;=D
 7=90=D7=90=D7=A4=D7=A8=D7=98=D7=99;;;
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A
 8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=A9=D7=95=D7=A8=D7=94 =D7=A9=D7=A0=D7=
 99=D7=94 =D7=94=D7=99=D7=90 =D7=\n=90=D7=A8=D7=95=D7=9B=D7=94\, =D7=9B=D7=9
 3=D7=99 =D7=A9=D7=A0=D7=A8=D7=90=\n =D7=90=D7=95=D7=AA=D7=94 =D7=92=D7=95=D
 7=9C=D7=A9=D7=AA. =D7=90=D7=A8=D7=\n=95=D7=9B=D7=94\, =D7=90=D7=A8=D7=95=D7
 =9B=D7=94\, =D7=95=D7=92=D7=95=D7=9C=\n=D7=A9=D7=AA!\n=D7=A9=D7=95=D7=A8=D7
 =94 =D7=A9=D7=9C=D7=99=D7=A9=D7=99=D7=AA.\n
UID:KqbQKbfBaF
VERSION:2.1
END:VCARD

hardy2@hardy2-laptop:~/test$ perl -pi.bak -e 'BEGIN { $/=""; } s/\n //gm' *.vcf
hardy2@hardy2-laptop:~/test$ cat test.vcf
BEGIN:VCARD
FN:First Last
N:Last;First;;;
s long so that it will wrap. Long\, long\, and wrapping!=\n\nThird Line.\ni
UID:frh74xvYZ9
VERSION:2.1
END:VCARD

BEGIN:VCARD
7=99 =D7=9E=D7=A9=D7=A4=D7=97=D7=94INTABLE:=D7=90=D7=90=D7=A4=D7=A8=D7=98=D
7=90=D7=90=D7=A4=D7=A8=D7=98=D7=99;;;ABLE:=D7=9E=D7=A9=D7=A4=D7=97=D7=94;=D
=94 =D7=A9=D7=9C=D7=99=D7=A9=D7=99=D7=AA.\nA9=D7=AA!\n=D7=A9=D7=95=D7=A8=D7
UID:KqbQKbfBaF
VERSION:2.1
END:VCARD

hardy2@hardy2-laptop:~/test$

As can be easily seen, the lines still wrap, and worse, critical parts of the file are destroyed. I have been playing around with the line of code, but it is slow going and I could really use a hand with this. I do appreciate your patience and willingness to teach a noob.

Annihilannic · October 15, 2008, 7:17pm

I'm suspecting there are some funny line terminators in this file. Can you post the output of cat -vet test.vcf?

I agree about perl, it looks pretty horrible and I was a very slow adopter; but its brevity, power and ubiquity make it difficult to live without. I generally use awk when I can, but perl is ideal for this problem due to its convenient handling of multi-line regex.

dotancohen · October 16, 2008, 4:57am

hardy2@hardy2-laptop:~$ cat -vet test.vcf
BEGIN:VCARD^M$
FN:First Last^M$
N:Last;First;;;^M$
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:First Line.\nThe Second Line i^M$
 s long so that it will wrap. Long\, long\, and wrapping!=\n\nThird Line.\n^M$
UID:frh74xvYZ9^M$
VERSION:2.1^M$
END:VCARD^M$
^M$
BEGIN:VCARD^M$
FN;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=90=D7=90=D7=A4=D7=A8=D7=98=D^M$
 7=99 =D7=9E=D7=A9=D7=A4=D7=97=D7=94^M$
N;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=9E=D7=A9=D7=A4=D7=97=D7=94;=D^M$
 7=90=D7=90=D7=A4=D7=A8=D7=98=D7=99;;;^M$
NOTE;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:=D7=A9=D7=95=D7=A8=D7=94 =D7=A^M$
 8=D7=90=D7=A9=D7=95=D7=A0=D7=94.\n=D7=A9=D7=95=D7=A8=D7=94 =D7=A9=D7=A0=D7=^M$
 99=D7=94 =D7=94=D7=99=D7=90 =D7=\n=90=D7=A8=D7=95=D7=9B=D7=94\, =D7=9B=D7=9^M$
 3=D7=99 =D7=A9=D7=A0=D7=A8=D7=90=\n =D7=90=D7=95=D7=AA=D7=94 =D7=92=D7=95=D^M$
 7=9C=D7=A9=D7=AA. =D7=90=D7=A8=D7=\n=95=D7=9B=D7=94\, =D7=90=D7=A8=D7=95=D7^M$
 =9B=D7=94\, =D7=95=D7=92=D7=95=D7=9C=\n=D7=A9=D7=AA!\n=D7=A9=D7=95=D7=A8=D7^M$
 =94 =D7=A9=D7=9C=D7=99=D7=A9=D7=99=D7=AA.\n^M$
UID:KqbQKbfBaF^M$
VERSION:2.1^M$
END:VCARD^M$
^M$
hardy2@hardy2-laptop:~$

Annihilannic · October 16, 2008, 6:57pm

The ^M characters are carriage returns, which means this file is in Windows/DOS format, which uses Carriage Return and Line Feed characters at the end of each line. Unix and Unix-like operating systems just use Line Feed. My original solution was removing the Line Feed only, leaving you with a carriage return, which meant that when you catted the file it was returning to the beginning of the line at each CR and overwriting the previous line's text on your terminal.

Do VCF files need to be in DOS format? If so, use the following to only remove the ones as you originally described:

perl -pi.bak -e 'BEGIN { $/=""; } s/\r\n //g;' *.vcf

Otherwise, if you want to convert the files to Unix format in the process, add an extra search and replace to remove the CRs:

perl -pi.bak -e 'BEGIN { $/=""; } s/\r//g; s/\n //g;' *.vcf

I also removed the m from the s/// operators since it seems to be unnecessary when you have "slurped" the input data.