Highlighting Haskell

In March 2012, I discovered the Codea app, allowing coding on an iPad, and the Lua programming language, its foundation. That was the unlikely seed for discovering the functional programming language Haskell in late 2013: a Codea discussion mentioned Project Euler and Haskell solutions to its problems. In particular, a one-line solution for Problem 9: the Pythagorean triplet which sums to 1,000.

I was attracted by Haskell’s apparently efficient, mathematical syntax. Later, I realised that there were parallels with Excel, the functional ‘programming language’ that I use every day (informed by the paper Improving the world’s most popular functional language: user-defined functions in Excel by Simon Peyton Jones, Margaret Burnett and Alan Blackwell).

Syntax highlighting

I wanted a (free) syntax highlighter plugin for WordPress version 4.9.4 that would work well with Haskell code. Ultimately, I settled on Crayon Syntax Highlighter by Aram Kocharyan.

I ruled out Code Prettify by Kaspars Dambis because it is based on the Google code-prettify library and I understood that library did not handle Haskell well.

Configuring the theme

My favourite code editor is Visual Studio (VS) Code and I use Justin Adam’s Haskell Syntax Highlighting extension with the default dark theme. I wanted the Crayon plugin to highlight Haskell code in the same way as the extension, to the extent that was possible. Syntax highlighting is the product of a language grammar, which names scopes, and a theme, which associates styles with names. VS Code uses TextMate language grammar and themes. Crayon’s approach is simpler.

The first step was to analyse the VS Code language grammar for Haskell, found in the extension’s file haskell.tmLanguage. This is in the form of an XML-format property list, and I used the plist package to identify the names that it contained. The plist package makes use of the hxt package (the Haskell XML Toolbox).

Initially, hxt could not parse the haskell.tmLanguage file. The XML specification forbids < characters inside elements, but the file had three instances that were not converted into &lt; entity references. It appears that VS Code is tolerant of some mis-specified XML files.

The default dark colour scheme used by VS Code is established in .json files dark_defaults, dark_vs and dark_plus. The following scheme colours are used with the Haskell Syntax Highlighting extension:

Colour Code Use Plugin element
Black #1E1E1E Background Not applicable
Green #608B4E Comments COMMENT
Brown #CE9178 String literals, character literals STRING
Tan #D7BA7D Escaped character literals Not replicated
Light green #B5CEA8 Numerical literals CONSTANT
Pink #C586C0 Control flow keywords (do, mdo, if, then, else, case and of) STATEMENT
Blue #569CD6 Compiler pragmas, other keywords, types, ::, -> PREPROCESSOR, RESERVED, TYPE
Light blue #9CDCFE Type variables in data or newtype declarations. Not replicated
Blue green #4EC9B0 Classes in deriving declarations. Not replicated
Yellow #DCDCAA Variable names in type signatures, exports and imports. ENTITY
White #D4D4D4 Default text IDENTIFIER, OPERATOR, SYMBOL

Configuring the language grammar

The Crayon plugin’s grammar files for a user-defined language are located in its subfolder of wp-content/uploads/crayon-syntax-highlighter/langs. If a user-defined language has the same folder name as a built-in language folder, the user-defined one takes precedence.

The plugin’s language grammar associates a regular expression (regex) with a list of unique elements, user-defined or built-in to the plugin theme. A user-defined element can be associated with a built-in one. The regex language is PHP’s Perl-compatible regex (PCRE).

The grammar specification includes (undocumented) ‘modes’ CASE_INSENSITIVE, MULTI_LINE and SINGLE_LINE, which are all set by default. The modes can be set (ON, YES or 1)  or unset (OFF, NO or 0) with lines such as the following (at least one space is required between the mode name and the =):

The grammar for Haskell provided with Crayon was as follows (for those elements which reference the default language grammar, I have added the default as a following comment):

reserved.txt listed keywords but also certain functions from Haskell’s Prelude module. type.txt listed certain types from the Prelude. The default list of operators in operator.txt was =&, <<<, >>>, <<, >>, <<=, =>>, !==, !=, ^=, *=, &=, %=, |=, /=, +=, -=, ===, ==, <>, ->, <=, >=, ++, --, &&, ||, ::, # (escaped), +, -, *, /, %, =, &, |, ^, ~, !, <, > and :. The SYMBOL element matched either HTML character entities or XML entity references (not relevant to Haskell) or the default list of characters in symbol.txt (equivalent to the character class [~`!@#$%()_{}[\]|\\:;,.?]), which was partly duplicative of the default operators.

I considered this supplied grammar to be lacking in certain respects, so I replaced it with the following:

reserved.txt lists only keywords (other than those in statement.txt). I did not want to treat functions in the Prelude differently from other functions. I used new file statement.txt to list the control flow keywords (including \case) and the STATEMENT element to give them a distinct colour, so that:

became:

I distinguished Haskell compiler pragmas from other nested comments, so that:

became:

The ability of regex to detect context is very limited, so it is not possible to nest ‘nested’ comments, but I wanted the detection of the end of a nested comment to be more accurate, so that:

became:

I wanted (non-nested) comments to be more accurate, so that:

became:

The supplied grammar did not recognise that Haskell identifiers can include ', a non-word character in PCRE. This meant taking a different approach to character literals, so that:

became:

The limited ability to detect context meant that I was unable to implement a different colour for escaped character literals. Regex’s lookbehind ((?<= ) or (?<! )) and lookahead ((?= ) or (?! ) facilities provide a basis for detecting some contexts but the lookbehind facility in PCRE requires the look back to be a known number of characters.

Certain keywords are, or can be, followed by the name of a module (module, import, safe, qualified and as). I used regex’s lookbehind to identify such keywords, before matching the module name. In the case of imports, the name of the module may be qualified by the package name. So:

became:

I did not want to treat types in the Prelude differently from other types. I did not use type.txt but defined a regex for the form of a type. The context-detection limitation meant that data constructors and classes are formatted in the same way as types and I was unable to implement a distinct colour for type variables (formatted like other variables) or classes (formatted like types). For the same reason, -> is formatted in case ... of statements the same way as in type signatures.

The ENTITY element is used for the variable in a type signature. Such a variable is the first thing on a line.

I wanted to treat qualified identifiers the same way as others, so that:

became:

I wanted to implement GHC’s magic hash, so that:

became:

Rather than use the plugin’s default lists of operators and symbols, I defined regex expressions which reflected the Haskell 2010 Language Report, including infix operators and data constructors, so that:

became:

I wanted the CONSTANT element to match numeric literals more precisely, including hexidecimal floating point literals, so that:

In respect of the regex for the SYMBOL element, when the Crayon plugin processes a regex, it replaces all ( not followed by a ? with (?: except for escaped ( (strictly, all \( sequences). As a consequence, it is necessary to escape ( in regex character classes.