Summary

How to approach the first step of reading a source code file (without using special-purpose libraries)
Reading character by character using Reader.read()
Reading by regex using Scanner.findInLine() or Scanner.findWithinHorizon()

Reading source code

The first thing any interpreter or compiler has to do is read in the raw source code and recognize the syntax; this process is called syntax analysis.

We will learn more about syntax analysis formally in a few classes, but for now we will be thinking about the very first step of that, called scanning. This is where the source code itself is broken down into individual pieces such as characters or “tokens”.

The task

We will work together to write a Java program that does one seemingly-simple task: read in a .java source code file, and identify all of the string literals in it.

Two sample solutions are below, but the really important part is the journey to get there as we did together during class.

First approach: character by character

We can use the .read() method in java.io.Reader to get a single character from the input stream.

This function actually returns an int, because we first have to check if it’s -1 (indicating end-of-file) before casting it to a char.

The idea is to just read until we see a " character, and then take the literal contents as whatever is read from that point until the next " character.

Here is the code:

import java.io.Reader;
import java.io.FileReader;
import java.io.IOException;
import java.util.List;
import java.util.ArrayList;

public class LiteralFinder1 {
    static List<String> getLiterals(String fname) throws IOException {
        Reader source = new FileReader(fname);
        List<String> literals = new ArrayList<>();

        // note, -1 is returned at EOF
        for (int gotRaw = source.read(); gotRaw != -1; gotRaw = source.read()) {
            char got = (char)gotRaw;
            if (got == '"') {
                // found one
                StringBuilder sb = new StringBuilder();
                // read inner characters until the matching "
                while (true) {
                    //char inner = (char)source.read();
                    int innerRaw = source.read();
                    assert (innerRaw != -1);
                    char inner = (char)innerRaw;
                    if (inner == '"') break;
                    sb.append(inner);
                }
                literals.add(sb.toString());
            }
        }

        return literals;
    }


    public static void main(String[] args) throws IOException {
        String fname = args[0];
        System.out.format("literals in %s:\n", fname);
        for (String literal : getLiterals(fname)) {
            System.out.format("  |%s|\n", literal);
        }
    }
}

Second approach: using regular expressions

You may have realized that there is actually a perfect tool for this - a regex! You learned about these in your CS Theory class (SI342), but might not have had lots of practical use for them yet.

We want to construct a regular expression for a single string literal. A first attempt might be:

".*"

That is, a double-quote character, followed by any number of characters (.*), followed by another double-quote.

But this doesn’t quite work. For example, if we have a line of code in Java like

String x = "one" + "two";

then that regex will capture BOTH strings together "one" + "two" as a single string literal, instead of two. No good!

The best way to fix this is to use a character class instead of the universal matching dot . symbol. The character class [^"] in a regex means any single character except a quote.

Putting this together, here is our regex for a string literal:

"[^"]*"

To incorporate this into a Java program, we just need to make two tweaks. First, because this will actually go in a literal string in our own Java program, we have to escape every double-quote as \". Second, we add a capturing group () around the inside of this, so that we can extract just the actual characters in the string literal without the quote signs.

Finally, we can incorporate this into a full Java program to identify literal strings in Java source code. Now instead of using read() for single-character input, we will use the findInLine() method of java.util.Scanner.

Here is the complete code:

import java.io.File;
import java.io.IOException;
import java.util.Scanner;
import java.util.List;
import java.util.ArrayList;
import java.util.regex.Pattern;
import java.util.NoSuchElementException;

public class LiteralFinder2 {
    static List<String> getLiterals(String fname) throws IOException {
        Scanner scan = new Scanner(new File(fname));
        List<String> literals = new ArrayList<>();

        while (true) {
            String found = scan.findInLine("\"([^\"]*)\"");
            if (found == null) {
                // no more literals on this line, so move to next line
                try { scan.nextLine(); }
                catch (NoSuchElementException e) {
                    // end of file
                    break;
                }
            }
            else {
                // we found a literal!
                // use .group(1) to get just the part without the quotes
                literals.add(scan.match().group(1));
            }
        }

        return literals;
    }

    public static void main(String[] args) throws IOException {
        String fname = args[0];
        System.out.format("literals in %s:\n", fname);
        for (String literal : getLiterals(fname)) {
            System.out.format("  |%s|\n", literal);
        }
    }
}

Notice that the code is considerably simpler than before, because our sophisticated regex is doing a lot of the “heavy lifting” for us.

Even better?

If you consider even a simple Java program like this one:

class Simple {
    void foo() {
        String x = "some string";
        x = "another" + "yet another";
        // "inside a single-line comment"
        x = /* "multi-line comment" */ "outside comment";
        x = "let's try an \"escape\" inside the literal";
        /*
         "multi-line comment on multiple lines"
         */
    }
}

you will find out that our programs above aren’t quite perfect yet. See if you can make them even more robust by:

Accounting for escape sequences, in particular an escaped double-quote character within the string
Ignoring quotes that show up in single-line or multi-line comments

Try it yourself! If you want to see how I did it,

FullLiteralFinder1.java is a complete working solution that reads character-by-character
FullLiteralFinder2.java accounts for everything using a big fancy regex