Summary
- Syntax and semantics: the two sides of language specification
- Expressions and statements: the two main components of a program
- UB: Underspecified behavior, and why we want to avoid it in our specifications
Syntax and semantics
When we think about how to describe and implement programming languages, it’s useful to make a distinction between the rules about how code looks versus what it actually means. The former is called “syntax”, and the latter is “semantics”.
(Note that this is kind of different from the way we use the word “semantics” in everyday language, to mean something like trivial differences. In the context of programming languages, semantics means the meaning behind what the code is saying.)
To start thinking about this distinction, consider the following code:
List<Integer> x;
You know what this code does, right? It declares x as a variable of
type List<Integer>, where that is taking the generic container class
List and specifying each thing in the list is an Integer instance.
But I didn’t tell you what language we were using! In Java it works like we just said, but what about Python? Well, Python would interpret this as a chained sequence of two comparisons, returning the same thing as if we wrote
(List < Integer) and (Integer > x)
If List, Integer, and x all happened to be variables previously
declared with numeric values, this would work perfectly fine in Python
and produce a True or False value!
The point is, these two code snippets have exactly the same syntax but very different semantics in the two different languages. Being able to approach a piece of code like this, forget what you think the programmer intends, and focus on what it means according to the rules of a particular language, is something you will get better at this semester.
Expressions and Statements
As we start to think about how languages are constructed, it’s useful to think of code as a series of (often nested) expressions and statements.
The basic concept should be familiar to you from all the way back in SI210:
- An expression is a piece of code that evaluates to a single value of some type. Expressions can usually be assigned to variables.
- A statement is a piece of code that tells the program to “do something”. It may have some side effects like writing to the screen, but the statement itself doesn’t have a type and doesn’t produce a single value.
Here are some examples of some expressions and statements in Java:
-
Expressions
-
1234(integer literal)
-
x < y || x == z(comparison and boolean operations, results in a
booleanvalue) *x.foo(y.bar, "something")(function call, returns whatever the
foo()method inx’s class returns)
-
-
Statements
-
String fuzzy = "wombat";(variable declaration - there is a type involved, but this entire chunk of code has no value or type; you couldn’t assign this whole thing to another variable)
-
if (sky.equals("blue")) go("outside");(conditional statement - it does something but doesn’t evaluate to a single value)
-
class Taco { String protein; String[] toppings; boolean birria_dipped; }(The entire class declaration is a single compound statement.)
-
This is Java, but every programming language you have seen can similarly be thought of in terms of syntax and semantics, expressions and statements, at a high level.
Crucially, when you are defining a new programming language, it’s helpful to break things up in these terms. What kind of expressions can your language have? What does each kind of expression look like (syntax), and what does it mean (semantics)? Similarly for statements - which often contain sub-expressions or sometimes even sub-statements.
UB
In thinking about programming language definitions, sometimes there are program fragments whose semantics are unclear. Any time the language specification tells us that a piece of code is valid, but doesn’t say exactly what should happen when executing it, we will call that underspecified behavior, or UB.
(Note: In the context of C and C++ especially, there are three closely related concepts: undefined behavior, unspecified behavior, and implementation-defined behavior. In this class we will use UB to refer to all three, and I will say “underspecified behavior”.)
Most modern programming languages do their best to not have any UB. That is, the language spec tries hard to make sure there is only one way for any program to work — and if it’s not supposed to work, that’s specified too.
Here’s a classic and very simple example in C:
int x[2] = {10, 20};
printf("%d\n", x[2]);
Hopefully you see the programmer’s mistake - index 2 is out of bounds on the second line. We all agree, the programmer made an error here.
But the question for us is, what is supposed to happen when we run this program? The answer is simple, we don’t know! If you run this, might see a zero printed, or maybe some other random number, or (very very unlikely) you might even get a seg fault or some other error.
That difference in behavior when the program is executed was an intentional decision by the C designers. Basically they said, we don’t want to waste a clock cycle checking array bounds, so if you go out of bounds in memory access, literally anything could happen.
Of course, you could make the same mistake in Python:
x = [10, 20]
print(x[2])
The difference is that, in Python (as with Java and most modern
languages), the language specification requires that any correct
interpreter throws an exception (specifically an IndexError) when this
happens.
So notice, it’s still a programmer error, but it’s not up to chance or the interpreter design what is allowed to happen.
Here’s another more subtle one, also in C;
int x = 10;
int y = ++x - ++x;
Remember that the ++ operator adds one to the variable before
producing the updated value. From the looks of it, you might assume we
would get y == -1 at the end. But not with every compiler! Some C
compilers evaluate the right-hand side of an expression before the
left-hand side, so you might get 1. It’s even possible that both
increments happen simultaneously, so you would get 0. The C language
doesn’t mandate which part gets evaluated first, so it’s up to each C
compiler to decide how it should happen.
There were reasons to have UB in the C language spec: computers were very limited and they needed both the compilers and the compiled code to run very quickly. But for us, UB is always a bad thing. It leads to ambiguity and programmer errors, and indication of a poorly thought-out language spec.