Programming Language Processors

Introduction

Nowadays, most programs are written in a high-level language such as C, Java, or Python. These languages are designed more for people, rather than machines, by hiding some hardware details of a specific computer from the programmer.

Simply put, high-level languages simplify the job of telling a computer what to do. However, since computers only understand instructions in machine code (in the form of 1's and 0's), we can not properly communicate with them without some sort of a translator.

This is why language processors exist.

The language processor is a special translator system used to turn a program written in a high-level language, which we call "source code", into machine code, which we call "object program" or "object code".

In order to design a language processor, a very precise description of lexicon and syntax, as well as semantics of a high-level language, is needed.

There are three types of language processors:

  • Assembler
  • Interpreter
  • Compiler

In the next few sections we'll go over each of these types of processors and discuss their purpose, differences, etc.

Assembly Languages and the Assembler

Most assembly languages are very similar to machine code (which is why they are specific to a computer architecture or operating system), but instead of using binary numbers in order to describe an instruction, it uses mnemonic symbols.

Each mnemonic symbol represents an operation code or instruction, and we typically need several of them in conjunction to do anything useful. These instructions can be used to move values between registers (in Intel86-64 architecture this command would be MOV), to do basic arithmetic operations on values such as addition, subtraction, multiplication, and division(ADD, SUB, MUL, DIV), as well as the basic logical operations like shifting a number left or right or negation (SHL, SHR, NEG). It can also use unconditional and conditional jumps, which is useful in order to implement a "for" loop, "while" loop, or an "if" statement (JMP, JE, JLE...).

For example, if the processor interprets the binary command 10110 as "move from one register into another register", an assembly language would replace it with a command, such as MOV.

Each register also has a binary identifier, such as 000. This can also be replaced with a more "human-like" name, such as EAX, which is one of the general registers in x86.

If we, say, wanted to move a value into a register, the machine code would look something like:

00001 000 00001010
  • 00001: Is the move command
  • 000: Is the register's identifier
  • 00001010: Is the value we want to move

In an assembly language, this can be written as something like:

MOV EAX, A
  • MOV is the move command
  • EAX is the register's identifier
  • A is the hexadecimal value we want to move (10 in decimal)

If we wanted to write down a simple expression EAX = 7 + 4 - 2 in machine code, it would look something like this:

00001 000 00000111
00001 001 00000100
00010 000 001
00001 001 00000010
00011 000 001
  • 00001 is the "move" command
  • 00010 is the "addition" command
  • 00011 is the "subtraction" command
  • 000, 001 are the registers' identifiers
  • 00000111, 00000100, 00000010 are the integer values we are using in this expressions

In assembly, this bunch of binary numbers would be written as:

MOV EAX, 7
MOV R8, 4
ADD EAX, R8
MOV R9, 2
SUB EAX, R9
  • MOV is the move command
  • ADD is the addition command
  • SUB is the subtraction command
  • EAX, R8, R9 are the registers' identifiers
  • 7, 4, 2: are the integer values we are using in this expressions

Although still not as readable as a high-level language, it's still a lot more humanly readable than the binary command. The hardware components of the CPU and registers are by far more abstract.

This makes it easier for a programmer to write source code, not needing to manipulate numbers in order to program. Translation to object code in machine language is simple and straightforward, done by an assembler.

Since the source code is already pretty similar to machine code, there's no need to compile or interpret the code - it's assembled as is.

Interpreted Languages and the Interpreter

Every program has a translating phase, and an execution phase. In interpreted languages, these two phases are intertwined - instructions written in a high-level programming language are directly executed without being previously converted to object code or machine code.

Both of the phases are done by an interpreter - a language processor that translates a single statement (line of code), executes it immediately and then moves on to the next line. If faced with an error, an interpreter terminates the translating process at that line and displays an error. It cannot move on to the next line and execute it unless the previous error is removed.

Interpreters have been used since 1952, and their job was to ease programming within limitations of computers at the time (for example, there was significantly less storage space in the first generation of computers than there is now). The first high-level interpreted language was Lisp, first implemented in 1958 on an IBM704 computer.

The most common interpreted programming languages nowadays are Python, Perl, and Ruby.

Compiled Languages and the Compiler

Unlike in interpreted programming languages, the translating phase and the execution phase in compiled programming languages are completely separated, and the translation is done by a compiler.

The compiler is a language processor that reads the complete source code written in a high-level language and translates it into an equivalent object code as a whole. Typically, this object code is stored in a file. If there are any errors in the source code, the compiler specifies them at the end of compilation, along with the lines in which the errors were found. After their removal, the source code can be recompiled.

Low-level languages are usually compiled because, being directly translated into machine code, they allow the programmer much more control over hardware components like memory or CPU.

The first high-level compiled programming language was FORTRAN, made in 1957 by a team led by John Backus at IBM.

The most common compiled languages nowadays are C++, Rust, and Haskell.

Bytecode Languages

Bytecode languages, also called "portable code" or "p-code" languages are the type of programming languages that fall under categories of both interpreted and compiled languages since they make use of both compilation and interpretation when translating and executing the code.

Bytecode is, simply put, a program code that has been compiled from source code into low-level code designed for a software interpreter. After the compilation (from source code to bytecode), it can be compiled further into machine code, which is recognized by the CPU, or it can be executed by a virtual machine, which then acts as the interpreter.

The bytecode is universal and can be transferred in the compiled state to other devices (with all of the advantages of compiled code). The CPU then converts it into the specific machine code for the device. That being said, you can compile the source code once and run it everywhere - granted the device has another layer, which is used to convert the bytecode into machine code.

The most well-known virtual machine for bytecode interpretation is the Java Virtual Machine (JVM), which is so common that several languages have implementations built to run on it.

Credit: ViralPatel

When the program is first run in a bytecode language, there is a delay while the code compiles into bytecode, but the execution speed is significantly increased compared to standard interpretative languages (since the source code is optimized for the interpreter).

One of the biggest advantages of bytecode languages is its platform independence, which used to be typical only for interpreted languages, while the programs are much faster than regular interpreted languages when it comes to execution.

Another thing worthy of mention here is just-in-time (JIT) compilation. As opposed to ahead-of-time (AOT) compilation, the code is being compiled as it's running. This essentially improves the compilation speed and utilizes the performance benefits of compilation with the flexibility of interpretation.

Then again, dynamic compilation doesn't always have to be better/faster than static compilation - it mostly depends on which kind of project you're working on.

The flagship languages that are compiled into bytecode are Java and C# and with them are languages such as Clojure, Groovy, Kotlin, and Scala.

Advantages and Disadvantages: Compiled vs Interpreted

Performance

Since a compiler translates an entire source code of a programming language into executable machine code for CPU, it takes a large amount of time to analyze the source code, but once the analysis and compilation are finished, the overall execution is much faster.

On the other hand, the interpreter translates the source code line by line, each one being executed as it gets translated, which leads to faster analysis of the source code, but the execution is significantly slower.

Debugging

The debugging is much easier when it comes to interpreted programming languages because the code is being translated until the error is met, so we know exactly where it is, and it is easier to fix.

Contrarily, debugging in a compiled language is much more tedious. If a program is written in a compiled language, it has to be manually compiled, which is an additional step in order to run a program. This may not seem like an issue - and it isn't with small programs.

Please keep in mind that massive projects can take tens of minutes and some even hours to compile.

Moreover, the compiler generates the error message after it has scanned the source code as a whole, so the error could be anywhere in the program. Even if the line of an error is specified, after changing the source code and fixing it, we need to recompile it and only then can the improved version be executed. This may not seem like an issue - and it isn't with small programs.

Please keep in mind that massive projects can take tens of minutes and some even hours to compile. Fortunately, many errors can be noticed before compilation with the help of IDEs, but not all of them.

Source Code vs Object Code

For interpreted programming languages, the source code is necessary for the execution. This means that the source code of the application is exposed to the user - like JavaScript is exposed in the browser.

Allowing users to fully read the source code may allow malicious users to manipulate and find loopholes in the logic. This can, to an extent, be limited using code obfuscation, but it's still a lot more easily accessible than compiled code.

On the other hand, once the program written in a compiled programming language is compiled into object code, it can be executed an infinite number of times, and the source code is not needed anymore.

This is why, when passing the program to a user, it is enough to just send them the object code, and not the source code, usually in the form of an .exe file on Windows.

Interpreted code is more susceptible to code injection attacks and the fact that they're not type-checked introduces us to a whole new set of programming exceptions and errors.

Conclusion

There is no "better" way of translating source code, and both compiled and interpreted programming languages have their advantages and disadvantages, as mentioned above.

In a lot of cases, the line between "compiled" and "interpreted" isn't clearly defined when it comes to more modern programming language, really, there's nothing stopping you from writing a compiler for an interpreted language, for an example.

Author image
About Mila Lukić
Serbia