22 Dec 2014

F# 3.1 for C# 6.0 developers: Creating escaped concat/split functions

by Marc Sigrist | posted in: Language Features | 0

The built-in .NET String.Join and String.Split methods do not provide a way for escaping the separator. If the separator is already contained in the input strings before joining, the strings cannot be reproduced by splitting:

open System
let inputStrings = [| "Hello, world."; "How are you?" |]
let separator = ", "
let joinedResult = String.Join(separator, inputStrings)
let reproducedStrings = joinedResult.Split([| separator |], StringSplitOptions.None)
printfn "%A" inputStrings
printfn "%A" reproducedStrings

The output of the above code is:

[|"Hello, world."; "How are you?"|]   // An array with 2 elements
[|"Hello"; "world."; "How are you?"|] // An array with 3 elements

To solve this problem, we are going to implement a pair of functions called concatEscape and splitUnescape. Besides roundtrip concatenation, the functions will provide some additional benefits.

For C# Developers...

In the example, I will explain a total of about 30 different F# 3.1 concepts that do not exist in C# 6.0. In case F# is completely new to you, here are some fundamentals:

F# uses indentation instead of curly braces { } to define scoped blocks of code.
F# (almost never) uses the return keyword. It is an expression-based language, and so the last executed line of a body of code implicitly specifies the return value.
It is not necessary to separate function parameters or arguments by commas and wrap them in parentheses. When you see something like let myFunction x y = x + Y - 1, myFunction is the name of the function, and x and y are the names of parameters. Everything after the = is the body of the function.
You will see almost no type declarations in the source code. Nevertheless, F# is a statically typed language—in some ways, even more "strongly" typed than C#. The difference to C# is that the F# compiler automatically infers the types for you, and also automatically makes them generic where appropriate. Again, this greatly simplifies the code, so that you can keep your focus on the business logic.

Step 1: Define the Requirements

For concatenation,

process the input strings as a stream (do not load all elements into memory at once)
use only a minimum number of escapes, if any. E.g., if the escape character pre-exists in an input string, it should itself be escaped by duplication only in cases where it precedes a real separator or an escaped pre-existing separator.
if the separator is empty, raise an ArgumentException
in other respects, behave like the built-int .NET Join method:
- If the separator is null, raise an ArgumentNullException.
- If the input sequence is null, raise an ArgumentNullException.
- If the input sequence is empty, treat it like one with a single empty-string element.
- If the input sequence contains null elements, treat them like empty strings.

For splitting,

return the output strings as a stream (do not load all elements into memory at once)
accurately reproduce the original input strings when given the same escape character and separator, except for the following cases (which work the same way as the built-in .NET Split method):
- if the original input sequence was empty, reproduce it as sequence with a single empty-string element
- if the original input sequence contained null strings, reproduce them as empty strings

Step 2: Define a Namespace

1  namespace MyCompany.Foundation.Core
2  
3  open System
4  open System.Text

Step 3: Implement Supporting Functionality

We are going to need some supporting functions to validate the concat/split function parameters:

 5  [<RequireQualifiedAccess>]
 6  module Assert = 
 7      let notNull argName arg = if arg = null then nullArg argName
 8      
 9      let notNullOrEmpty argName arg = 
10          notNull argName arg
11          if Seq.isEmpty arg then invalidArg argName "Value cannot be empty."

The above code uses the following features not present in C# 6.0:

Lines 5–6: Enforcement of qualified access to a module. In C# 6.0, it is now possible to open a static class with the using keyword. In F#, modules are similar to static classes, but allow for much simplified syntax. In addition, the access to modules can be declared with [<AutoOpen>] (open by default) or [<RequireQualifiedAccess>] (non-openable).
Line 9: Automatic type inference and generalization. argName is compiled as a string parameter, because it is used to call the notNull function, which itself calls the nullArg function, which requires a string argument. arg is compiled as a parameter of generic type, which must implement the sequence interface, because it is used to call Seq.isEmpty (Seq<> is the same as IEnumerable<>). Furthermore, arg is compiled as a type who can be null, because it is validated via the notNull function (types declared in F# can never be null, but types declared outside F#—such as System.String or System.Collections.Generic.IEnumerable<T>—can still be null).

Step 4: Declare the Token Type

To simplify the parsing, we will convert the input strings to token streams and then analyze/decompose the tokens streams via F#'s built-in pattern matching facilities.

12  type private Token = 
13      | Esc of Count:int
14      | Sep
15      | Val of Content:string

The above code uses the following features not present in C# 6.0:

Line 12: Private accessibility within namespace declaration groups. The Token type is only accessible by code who has been declared inside the namespace MyCompany.Foundation.Core in the same library.
Lines 12–15: Discriminated union type. An instance of Token symbolizes either a consecutive number of escapes (Esc), a separator (Sep), or anything else (Val). Later, when a Token instance is analyzed via pattern matching, the F# compiler will emit a warning if a case has been forgotten by the developer.

Step 5: Implement Tokenization

We define a Tokenizer module with just one function named create. The function takes the escape character and separator as parameters, validates them, and returns another function. The returned function has the escape character and separator already built-in (in the closure). It takes only a single argument, which is the string to be tokenized.

16  [<RequireQualifiedAccess>]
17  module private Tokenizer =
18    /// Returns a function who can convert a given source string to a token stream.
19    let create esc sep = 
20      let sepName = "sep"
21      let sepLen = String.length sep
22      
23      // Validate parameters
24      Assert.notNullOrEmpty sepName sep
25      if sep.[0] = esc then invalidArg sepName "Separator cannot start with escape char."
26      if sepLen > 1 then
27          let iMax = sepLen - 1
28          for i in 0 .. iMax / 2 do
29              if sep.[0 .. i] = sep.[sepLen - i - 1 .. iMax] then
30                  invalidArg sepName "Separator cannot have same beginning and ending."
31        
32      // Return the tokenizer function
33      fun source -> 
34          match String.length source - 1 with
35          | -1 -> Val String.Empty |> Seq.singleton
36          | iMax -> 
37            let (|Esc|_|) = 
38                let rec aux acc i = 
39                    if i <= iMax && source.[i] = esc then aux (acc + 1) (i + 1) else acc
40                aux 0 >> function 0 -> None | count -> Some count
41            
42            let (|Sep|_|) i = 
43                if i <= iMax - sepLen + 1 
44                   && String.CompareOrdinal(source, i, sep, 0, sepLen) = 0 then Some()
45                else None
46            
47            let rec read valLen i = 
48              seq { let wrapVal() = 
49                        if valLen > 0 
50                        then source.Substring(i - valLen, valLen) |> Val |> Seq.singleton
51                        else Seq.empty
52                    if i <= iMax then 
53                        match i with
54                        | Esc count -> 
55                            yield! wrapVal(); yield Esc count; yield! read 0 (i + count)
56                        | Sep -> yield! wrapVal(); yield Sep; yield! read 0 (i + sepLen)
57                        | _ -> yield! read (valLen + 1) (i + 1)
58                    else yield! wrapVal() }
59            read 0 0

The above code uses the following features not present in C# 6.0:

Line 18: Simplified XML documentation comments. Comment lines starting with triple slashes /// are converted to what looks in C# like ///<summary>...</summary>. Inside the F# comment, angular brackets can be used deliberately, such as MyType<T, SomeOtherType<U>>.
Line 29: Slice expressions allow retrieving a range of values from a collection in a succinct way.
Line 34: Syntactic pattern matching. match ... with ... analyzes and decomposes syntactic patterns, who can come in many different shapes with arbitrary nesting and complexity.
Line 35: Type functions and automatic inference of type parameters. Seq.singleton is inferred as Seq.singleton<Token>, which is a call to an F#-defined type function.
Lines 37, 42: Active patterns. Functions named (| ... |) can define various kinds of active patterns (single-case or optional patterns with or without arguments, multi-case patterns), which provides even more flexibility in pattern matching than is available in other ML-derived languages. The patterns are matched in lines 54 and 56, with completeness checking at compile time.
Line 37: Value declaration with arbitrary initialization body. We have pointed out above that let (|Esc|_|) = defines an active pattern function. More precisely, (|Esc|_|) is just an ordinary value who happens to be a lambda function. The value is initialized with the result of the last line of the body below (a new lambda function created by composing the left and right sides of the >> operator). In C#, we can use the var keyword only with very simple initializers (and not at all with delegates); in F#, we can use the let keyword with an arbitrary body of code containing nested scopes of arbitrary depth. Ironically, this feature of the "functional-first" language F# makes it possible to implement the OO principle of encapsulation in an even more fine-grained way than in the "OO-first" language C#.
Lines 38, 47: Tail call optimization. rec aux defines a recursive function named aux (aux is a common naming convention for an inner auxiliary function within an outer recursive function). The recursive call to aux, aux (acc + 1) (i + 1), is in tail position, which is compiled to a construct that will never cause a stack overflow regardless of recursive depth. The same is true for the rec read function.
Line 40: Partial function application. aux 0 calls the aux function with only one parameter, which gets a new, partial version of aux where the first parameter is already built-in, but the second parameter still has to be passed.
Line 40: Function composition. The partial version of aux is composed ("sticked together") with the following function using the >> operator.
Line 40: Condensed pattern matching with the function keyword. function 0 ... is an abbreviation for fun x -> match x with 0 .... It defines a lambda function with a parameter whose name we do not care about.
Line 40: option types. None and Some ... are the two possible cases of a discriminated union named option<'T>, which we are pattern matching against.
Lines 48–58: Computation expressions with unified syntax and custom builders. seq { ... } defines a sequence (compiled as IEnumerable<Token>). This is a kind of computation expression. Instead of seq { ... }, we could also have used [ ... ] to get a list or [| ... |] to get an array of tokens. The same kind of syntactic principle can also be used with async { ... } for defining asynchronous algorithms or query {...} for defining queries. Furthermore, it is possible in F# to implement custom computation expression builders (e.g., to simplify continuation passing with cont { ... } or to simplify checking of failure conditions with maybe { ... }). Iterators, async methods and query expressions are also available in C#, but they are hard-wired into the compiler; C# does not have a unified, extensible mechanism for defining monadic constructs.
Line 50: Function pipelining. The |> operator forwards the argument from the left side to the function or constructor on the right side. This syntax immediately reveals the "path" that an argument takes through various transformations until it becomes the end result. It is much easier and intuitive to follow than in C#, where you would usually apply nested function calls or needlessly distracting extra variable declarations.
Lines 55–58: yield! keyword for appending multiple elements. C# 6.0 only allows appending single elements using yield. F#'s yield! lets you program recursive sequences in a simple and intuitive way (e.g., for traversing directory trees). At the same time, stack overflows are avoided thanks to tail call optimization.

Step 6: Implement Concatenation and Splitting

Now that we have a custom tokenizer, we can use it in our concat/split functions. Instead of doing the parsing character by character, we can analyze groups of tokens via syntactic pattern matching, with the added benefit of automatic case completeness checking at compile time.

 60  [<RequireQualifiedAccess>]
 61  module String = 
 62    /// Returns a new string by connecting the given strings with the given separator.
 63    let concatEscape esc sep (strings:seq<_>) = 
 64      Assert.notNull "strings" strings
 65      let sb = StringBuilder()
 66      
 67      let appendTokens areLast ts = 
 68          let appendEsc count = sb.Append(esc, count) |> ignore
 69          let appendVal (v: string) = sb.Append v |> ignore
 70          let appendSep() = appendVal sep
 71          
 72          let rec aux = function
 73              | [] -> ()
 74              | Esc count :: [] -> appendEsc <| if areLast then count else count * 2
 75              | Esc count :: (Sep :: _ as ts) -> appendEsc (count * 2); aux ts 
 76              | Esc count :: ts -> appendEsc count; aux ts
 77              | Sep :: ts -> appendEsc 1; appendSep(); aux ts
 78              | Val v :: ts -> appendVal v; aux ts
 79          
 80          aux ts
 81          if not areLast then appendSep()
 82      
 83      strings
 84      |> Seq.map (Tokenizer.create esc sep >> List.ofSeq)
 85      |> Seq.fold (fun ts1 ts2 -> Option.iter (appendTokens false) ts1; Some ts2) None
 86      |> Option.iter (appendTokens true)
 87      
 88      sb.ToString()
 89      
 90    /// Reproduces the original substrings from a string created with concatEscape.
 91    let splitUnescape esc sep string = 
 92        Assert.notNull "string" string
 93        let emptyVal = Val String.Empty
 94        let sepVal = Val sep
 95        let flipAppend x1 x2 = Seq.append x2 x1
 96        
 97        // Produce token stream
 98        string
 99        |> Tokenizer.create esc sep 
100        
101        // Convert token stream to StringBuilder stream
102        |> flipAppend [emptyVal]
103        |> Seq.scan 
104          (fun (sb:StringBuilder, t1) t2 ->
105              match t1, t2 with
106              | Esc count, Sep when count % 2 = 1 -> sb.Append(esc, count / 2), sepVal
107              | Esc count, Sep -> sb.Append(esc, count / 2), Sep
108              | Esc count, _ -> sb.Append(esc, count), t2
109              | Sep, _ -> StringBuilder(), t2
110              | Val v, _ -> sb.Append v, t2)
111          (StringBuilder(), emptyVal)
112        |> Seq.map fst
113        
114        // Of each series of repeated StringBuilder references, keep only the last
115        // reference (which points to the StringBuilder's completed state).
116        // Convert the remaining StringBuilder references to strings.
117        |> flipAppend [null]
118        |> Seq.pairwise
119        |> Seq.filter (fun (sb1, sb2) -> sb1 <> sb2)
120        |> Seq.map (fst >> sprintf "%O")

The above code uses the following features not present in C# 6.0:

Line 61: Module merging. We give our module the name String. However, there is already a .NET class named System.String and an F# library module named FSharp.Core.String. Now, when someone opens our MyCompany.Foundation.Core namespace and types String., Visual Studio IntelliSense will automatically list all static members and functions together. This is a straightforward way of extending pre-existing static library functionality without having to go through a formal extension member definition mechanism. (Such a mechanism exists in F#, too, but it is used less often than in C#. Nonetheless, it allows you to define not only instance extension methods, but also static extension methods, extension properties, and even extension events.)
Line 63: Wildcards. By writing seq<_> using the wildcard symbol _, we force the compiler to declare the strings parameter as a generic sequence (IEnumerable<...>), but let it still infer the type parameter automatically. This is sometimes callled "mumble typing". Because of the way the strings parameter is used in the function, it is ultimately inferred as seq<string>.
Lines 73–78: Syntactic pattern matching on union cases and list patterns. [] represents an empty list, and Esc count :: [] represents a list with a single element at the head, who itself represents the Esc case of the Token union type. The variable count is automatically initialized at runtime (if the overall pattern matches) with the number of consecutive escape characters found in the string. Further patterns are used in the remaining lines, and the F# compiler emits a warning if we forget a match.
Line 91: Only few reserved keywords exist in F#. In line 89, we use string as the name of a parameter. This is possible because string is not a keyword in F#; it's a type abbreviation that can be replaced with a custom definition without a problem. The same is true for all other names of built-in types, as well as most built-in operators (who are just predefined library functions).
Lines 105--110, 120: Syntactic tuple types with pattern matching support. t1, t2 defines a tuple, which is matched against in the lines 105--110, with case completeness checking at compile time. In line 120, fst is a built-in F# library function who returns the first item of a tuple.
Line 120: Statically typed format strings. sprintf "%O" defines a function who takes an object and returns the object's ToString() result. In our example, it gets the string from each StringBuilder instance. (Other available format specifiers would be %i for integer numbers, %s for strings, %b for booleans, and more.)

Summary

I have used string concatenation and splitting as a didactical example to demonstrate many F# 3.1 features not known in C# 6.0. Using other examples, I could continue with still more features: inlining, pattern matching with records, automatic structural equality and comparison, aggressive optimization of closures and generic calls, units of measure, type providers, custom operators as functions or members, quotations, ...

There is one feature I did not point out explicitly, because it is omnipresent: Immutability. All values in the example are read-only (which is the default in F#); I did not have to declare even a single value as mutable. Immutability simplifies understanding the program flow; its advantages are particularly striking in parallel programming. By contrast, in C#, all values are mutable by default (that's why they are called "variables"), and local immutability is not even available, except for constants.

In the next blog post, I will use F# to define a comprehensive set of unit tests against the example.