Why Parsing Can't be Done With sed ( or similar tools)

bakunin · September 18, 2015, 8:57am

Regularly we have questions like: i have an XML (C, C++, ...) file with this or that property and i want to extract the content of this or that tag (function, ...). How do i do it in sed ?

Yes, in some (very limited) cases this is possible, but in general this can't be done. That is: you can do it, but whatever your solution will look like there will be a chance that it fails. In a theoretical sense this impossible and hence every solution will have one (or more) shortcomings. It will - in a strict sense - not be a solution, just an approximation thereof.

Here is why:

sed (and similar pattern recognition engines) searches for text patterns. Let us do an example: this regexp searches for text in (round) brackets:

([^)]*)

It would find (this text) and (that text too), even this: (). We could with some effort tweak it to search for the pattern spanning multiple lines (like
this) or even

(
     like
     this
)

if we want.

Now, to some extent, language features are patterns too, no? Consider the last text example and the similarity strikes:

(                 if <clause> ; then
     this              this
     that              that
)                 fi

Yes... but the problem is that language constructs (unlike patterns) can contain themselves as parts - they can be nested:

if <clause> ; then
     this              
     if <other clause> ; then
          something_else
     fi
     that              
fi

What would our text-in-brackets regexp do with this input:

text (more text (and even more) text) and so on

Answer: it would collect the text from the first opening bracket up to the first closing bracket: (more text (and even more) . In all likeliness this is not what we want. We would probably want to get either the content of the outer bracket pair or the content of the inner bracket pair.

Exactly this ability - to understand and follow the level of nesting - is what sets apart a parser from a regexp engine.

In fact it would be possible to put provisions for a certain amount of nesting into the regexp. But we would have to anticipate the maximum level of nesting and so our (theoretical) regexp parser would still be limited to this maximum. Real world parsers work recursively. This means, they would work similar to a regexp engine, but when they encounter a nested construct (like the second opening bracket in the input stream above) they halt and spawn another instance of themselves which will parse the nested part. Once this spawned instance is finished they resume their work.

If you are interested in the theoretical background of all this: the theory of formal languages knows four types of languages (the so-called Chomsky hierarchy or "Chomsky-Sch�tzenberger hierarchy", after its developers Noam Chomsky and Marcel-Paul Sch�tzenberger).

Natural languages like English are of the type 0. It is possible to create a set of rules which will produce all sorts of sentences: they will always be grammatically correct although they might not say something meaningful or even coherent. Chomsky himself gave

Colorless green ideas sleep furiously.

as an example for such a sentence, which is (grammatically) correct, but meaningless.

The subsequent types 1,2 and 3 are continually more restricted than the type 0 which has no restrictions and each type adding some restrictions to the ones alreadyin place. Each member of a hierarchy can do approximately the same (you can say the same things in any language, for instance) but as you descend down the hierarchies the possibilities become more and ore limited.

Programming languages (well, most of them, there are exceptions) are of type 2 while regular expressions (or their counterpart "regular grammars") are of type 3. Because it is possible to something further up inthe hierarchy to create something lower in that hierarchy one can write books (type 0) about programming (type 2) but not a program which writes books and for the same reasons you canuse a programming language (type 2) to create a regexp engine (type 3) but not vice versa a regexp engine that reads (and "understands") a programming language.

bakunin