Summary
- How to approach the first step of reading a source code file (without using special-purpose libraries)
- Reading character by character using
Reader.read() - Reading by regex using
Scanner.findInLine()orScanner.findWithinHorizon()
Reading source code
The first thing any interpreter or compiler has to do is read in the raw source code and recognize the syntax; this process is called syntax analysis.
We will learn more about syntax analysis formally in a few classes, but for now we will be thinking about the very first step of that, called scanning. This is where the source code itself is broken down into individual pieces such as characters or “tokens”.
The task
We will work together to write a Java program that does one
seemingly-simple task: read in a .java source code file, and identify
all of the string literals in it.
Two sample solutions are below, but the really important part is the journey to get there as we did together during class.
First approach: character by character
We can use the .read() method in java.io.Reader
to get a single character from the input stream.
This function actually returns an int, because we first have to check
if it’s -1 (indicating end-of-file) before casting it to a char.
The idea is to just read until we see a " character, and then take the
literal contents as whatever is read from that point until the next "
character.
Here is the code:
import java.io.Reader;
import java.io.FileReader;
import java.io.IOException;
import java.util.List;
import java.util.ArrayList;
public class LiteralFinder1 {
static List<String> getLiterals(String fname) throws IOException {
Reader source = new FileReader(fname);
List<String> literals = new ArrayList<>();
// note, -1 is returned at EOF
for (int gotRaw = source.read(); gotRaw != -1; gotRaw = source.read()) {
char got = (char)gotRaw;
if (got == '"') {
// found one
StringBuilder sb = new StringBuilder();
// read inner characters until the matching "
while (true) {
//char inner = (char)source.read();
int innerRaw = source.read();
assert (innerRaw != -1);
char inner = (char)innerRaw;
if (inner == '"') break;
sb.append(inner);
}
literals.add(sb.toString());
}
}
return literals;
}
public static void main(String[] args) throws IOException {
String fname = args[0];
System.out.format("literals in %s:\n", fname);
for (String literal : getLiterals(fname)) {
System.out.format(" |%s|\n", literal);
}
}
}
Second approach: using regular expressions
You may have realized that there is actually a perfect tool for this - a regex! You learned about these in your CS Theory class (SI342), but might not have had lots of practical use for them yet.
We want to construct a regular expression for a single string literal. A first attempt might be:
".*"
That is, a double-quote character, followed by any number of characters
(.*), followed by another double-quote.
But this doesn’t quite work. For example, if we have a line of code in Java like
String x = "one" + "two";
then that regex will capture BOTH strings together "one" + "two" as a
single string literal, instead of two. No good!
The best way to fix this is to use a character class instead of the
universal matching dot . symbol. The character class [^"] in a regex
means any single character except a quote.
Putting this together, here is our regex for a string literal:
"[^"]*"
To incorporate this into a Java program, we just need to make
two tweaks. First, because this will actually go in a literal string in
our own Java program, we have to escape every double-quote as \".
Second, we add a capturing group () around the inside of this, so that
we can extract just the actual characters in the string literal without
the quote signs.
Finally, we can incorporate this into a full Java program to identify
literal strings in Java source code. Now instead of using read() for
single-character input, we will use the
findInLine() method of java.util.Scanner.
Here is the complete code:
import java.io.File;
import java.io.IOException;
import java.util.Scanner;
import java.util.List;
import java.util.ArrayList;
import java.util.regex.Pattern;
import java.util.NoSuchElementException;
public class LiteralFinder2 {
static List<String> getLiterals(String fname) throws IOException {
Scanner scan = new Scanner(new File(fname));
List<String> literals = new ArrayList<>();
while (true) {
String found = scan.findInLine("\"([^\"]*)\"");
if (found == null) {
// no more literals on this line, so move to next line
try { scan.nextLine(); }
catch (NoSuchElementException e) {
// end of file
break;
}
}
else {
// we found a literal!
// use .group(1) to get just the part without the quotes
literals.add(scan.match().group(1));
}
}
return literals;
}
public static void main(String[] args) throws IOException {
String fname = args[0];
System.out.format("literals in %s:\n", fname);
for (String literal : getLiterals(fname)) {
System.out.format(" |%s|\n", literal);
}
}
}
Notice that the code is considerably simpler than before, because our sophisticated regex is doing a lot of the “heavy lifting” for us.
Even better?
If you consider even a simple Java program like this one:
class Simple {
void foo() {
String x = "some string";
x = "another" + "yet another";
// "inside a single-line comment"
x = /* "multi-line comment" */ "outside comment";
x = "let's try an \"escape\" inside the literal";
/*
"multi-line comment on multiple lines"
*/
}
}
you will find out that our programs above aren’t quite perfect yet. See if you can make them even more robust by:
- Accounting for escape sequences, in particular an escaped double-quote character within the string
- Ignoring quotes that show up in single-line or multi-line comments
Try it yourself! If you want to see how I did it,
FullLiteralFinder1.javais a complete working solution that reads character-by-characterFullLiteralFinder2.javaaccounts for everything using a big fancy regex