Compilers are used more than any other tool by a developer. Every time you tell Visual Studio to build your code, youre invoking csc.exe, which is the C# compiler. Without compilers, your C# code would be worthless. In this section, youll gain an understanding of what compilers do, how theyve been designed in the .NET world, and how they have changed in .NET 4.6.
What Do Compilers Do?
Its almost a tradition in the developers world to have a program print Hello world to get familiar with the fundamentals of a language, so thats where well start our discussion of compilers. Heres code that will do just that:
using System;
namespace HelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.Out.WriteLine("Hello world");
}
}
}
Figure shows you what youll see when you run the program.
Figure 1-1.
Running a simple Hello World program
Of course, your computer didnt execute that text. Theres a translation step that, most of the time, you probably dont think about, and thats what the compiler does. Its easy to say that youve compiled your code, but theres a lot that a compiler has to do to make your code actually execute. Lets do a simplistic examination of a compilers workflow to get a better understanding of its machinery.
First, the compiler scans your text and figures out all the tokens that are contained within. Tokens are the individual textural pieces within code that have meaning based on a languages specification. These can be member names, language keywords, and operators. For example, Figure shows what the line of code that prints out Hello world looks like when its tokenized.
Figure 1-2.
Breaking code into separate tokens
The compiler will find everything it can about that line of text and break it up into separate chunks. That includes the period between Console and Out , the tabs before the Console token, and the semicolon at the end of the line. The compiler also has to be smart enough to figure out when there are problems and report meaningful errors when its process is finished without stopping on that one error because there may be more issues in the code.
But the complexities of tokenizing code dont stop here. Now the compiler needs to figure out what those tokens really mean. A tab isnt important from an execution standpoint, but it may matter if youre debugging your code, as the compiler needs to make sure the debugging information ignores that whitespace correctly when a developer creates breakpoints in code. A semicolon means that the line of code is complete, so thats important to know, although youre not really doing any execution with that character. But what does the period mean? It may mean that youre trying to access a property on an object, or call a method. Which one is it? And if its a method, is it an extension method? If so, where does that extension method exist? Is there an appropriate using statement in the file that will help the compiler figure out where that method is? Or is the developer using a new feature in C#6, like using static , which needs to be accounted for? The compiler needs to figure out semantics for these tokens based on the rules of the C# language, and if youve ever read the C# specification, you know that this can be an extremely difficult endeavor.
Note
Youll find the C# specification at https://www.microsoft.com/en-us/download/details.aspx?id=7029 , although at the time of this writing, it was at version 5; C#6 features are not included.
Finally, the last job of the compiler is to take all the information its assembled and actually generate a .NET assembly . This assembly, in turn, contains whats known as an Intermediate Language (IL) that can be interpreted by the Common Language Runtime (CLR) along with metadata information, such as the names of types and methods. Transforming tokens into IL is a nontrivial job. If youve spent any time working with members in the System.Reflection.Emit namespace , you know its not easy to encode a method correctly. Forget just one IL instruction and you may end up creating an assembly that will crash horribly at runtime.
To summarize, Figure demonstrates what a compiler does with code, although keep in mind that this is a rudimentary view of a compilers internal components .
Figure 1-3.
General steps that a compiler takes to produce executables
Heres a brief description of each step:
Parsing finds each token in code and classifies it.
Semantics provides meaning to each token (e.g., is the token a type name or a language keyword?).
Emitting produces an executable based on the semantic analysis of the tokens.
Compilers are complex beasts. Whenever Ive done a talk on the Compiler API and asked the audience how many people have created and/or worked on a compiler, I rarely see even one hand go up. Most developers do not spend a significant amount of time developing and maintaining a compiler. They might have written one in a college class, but writing compilers is not an activity most developers ever do on a day-to-day basis. Developers are typically more concerned with creating applications for customers. Plus, creating a compiler that handles the specifications of a given programming language is typically difficult. Its a challenge for just two different implementations of a compiler for a language written by two different teams to work exactly the same. Therefore, developers who use a programming language will gravitate to a very small set of compiler implementations to reduce the chances of discrepancies.
Note
If youre interested in learning more about compilers, check out Modern Compiler Design (Springer, 2012) at http://www.springer.com/us/book/9781461446989 .