F# 3.1 for C# 6.0 developers: Creating escaped concat/split functions
The built-in .NET String.Join
and String.Split
methods do not provide a way for escaping the separator. If the separator is already contained in the input strings before joining, the strings cannot be reproduced by splitting:
open System
let inputStrings = [| "Hello, world."; "How are you?" |]
let separator = ", "
let joinedResult = String.Join(separator, inputStrings)
let reproducedStrings = joinedResult.Split([| separator |], StringSplitOptions.None)
printfn "%A" inputStrings
printfn "%A" reproducedStrings
The output of the above code is:
[|"Hello, world."; "How are you?"|] // An array with 2 elements
[|"Hello"; "world."; "How are you?"|] // An array with 3 elements
To solve this problem, we are going to implement a pair of functions called concatEscape
and splitUnescape
. Besides roundtrip concatenation, the functions will provide some additional benefits.
In the example, I will explain a total of about 30 different F# 3.1 concepts that do not exist in C# 6.0. In case F# is completely new to you, here are some fundamentals:
- F# uses indentation instead of curly braces
{ }
to define scoped blocks of code. - F# (almost never) uses the
return
keyword. It is an expression-based language, and so the last executed line of a body of code implicitly specifies the return value. - It is not necessary to separate function parameters or arguments by commas and wrap them in parentheses. When you see something like
let myFunction x y = x + Y - 1
,myFunction
is the name of the function, andx
andy
are the names of parameters. Everything after the=
is the body of the function. - You will see almost no type declarations in the source code. Nevertheless, F# is a statically typed language—in some ways, even more "strongly" typed than C#. The difference to C# is that the F# compiler automatically infers the types for you, and also automatically makes them generic where appropriate. Again, this greatly simplifies the code, so that you can keep your focus on the business logic.
For concatenation,
- process the input strings as a stream (do not load all elements into memory at once)
- use only a minimum number of escapes, if any. E.g., if the escape character pre-exists in an input string, it should itself be escaped by duplication only in cases where it precedes a real separator or an escaped pre-existing separator.
- if the separator is empty, raise an
ArgumentException
- in other respects, behave like the built-int .NET Join method:
- If the separator is
null
, raise anArgumentNullException
. - If the input sequence is
null
, raise anArgumentNullException
. - If the input sequence is empty, treat it like one with a single empty-string element.
- If the input sequence contains
null
elements, treat them like empty strings.
- If the separator is
For splitting,
- return the output strings as a stream (do not load all elements into memory at once)
- accurately reproduce the original input strings when given the same escape character and separator, except for the following cases (which work the same way as the built-in .NET Split method):
- if the original input sequence was empty, reproduce it as sequence with a single empty-string element
- if the original input sequence contained
null
strings, reproduce them as empty strings
1 namespace MyCompany.Foundation.Core
2
3 open System
4 open System.Text
We are going to need some supporting functions to validate the concat/split function parameters:
5 [<RequireQualifiedAccess>]
6 module Assert =
7 let notNull argName arg = if arg = null then nullArg argName
8
9 let notNullOrEmpty argName arg =
10 notNull argName arg
11 if Seq.isEmpty arg then invalidArg argName "Value cannot be empty."
The above code uses the following features not present in C# 6.0:
- Lines 5–6: Enforcement of qualified access to a module. In C# 6.0, it is now possible to open a static class with the
using
keyword. In F#, modules are similar to static classes, but allow for much simplified syntax. In addition, the access to modules can be declared with[<AutoOpen>]
(open by default) or[<RequireQualifiedAccess>]
(non-openable). - Line 9: Automatic type inference and generalization.
argName
is compiled as a string parameter, because it is used to call thenotNull
function, which itself calls thenullArg
function, which requires astring
argument.arg
is compiled as a parameter of generic type, which must implement the sequence interface, because it is used to callSeq.isEmpty
(Seq<>
is the same asIEnumerable<>
). Furthermore,arg
is compiled as a type who can benull
, because it is validated via thenotNull
function (types declared in F# can never benull
, but types declared outside F#—such asSystem.String
orSystem.Collections.Generic.IEnumerable<T>
—can still benull
).
To simplify the parsing, we will convert the input strings to token streams and then analyze/decompose the tokens streams via F#'s built-in pattern matching facilities.
12 type private Token =
13 | Esc of Count:int
14 | Sep
15 | Val of Content:string
The above code uses the following features not present in C# 6.0:
- Line 12: Private accessibility within namespace declaration groups. The
Token
type is only accessible by code who has been declared inside the namespaceMyCompany.Foundation.Core
in the same library. - Lines 12–15: Discriminated union type. An instance of
Token
symbolizes either a consecutive number of escapes (Esc
), a separator (Sep
), or anything else (Val
). Later, when aToken
instance is analyzed via pattern matching, the F# compiler will emit a warning if a case has been forgotten by the developer.
We define a Tokenizer
module with just one function named create. The function takes the escape character and separator as parameters, validates them, and returns another function. The returned function has the escape character and separator already built-in (in the closure). It takes only a single argument, which is the string to be tokenized.
16 [<RequireQualifiedAccess>]
17 module private Tokenizer =
18 /// Returns a function who can convert a given source string to a token stream.
19 let create esc sep =
20 let sepName = "sep"
21 let sepLen = String.length sep
22
23 // Validate parameters
24 Assert.notNullOrEmpty sepName sep
25 if sep.[0] = esc then invalidArg sepName "Separator cannot start with escape char."
26 if sepLen > 1 then
27 let iMax = sepLen - 1
28 for i in 0 .. iMax / 2 do
29 if sep.[0 .. i] = sep.[sepLen - i - 1 .. iMax] then
30 invalidArg sepName "Separator cannot have same beginning and ending."
31
32 // Return the tokenizer function
33 fun source ->
34 match String.length source - 1 with
35 | -1 -> Val String.Empty |> Seq.singleton
36 | iMax ->
37 let (|Esc|_|) =
38 let rec aux acc i =
39 if i <= iMax && source.[i] = esc then aux (acc + 1) (i + 1) else acc
40 aux 0 >> function 0 -> None | count -> Some count
41
42 let (|Sep|_|) i =
43 if i <= iMax - sepLen + 1
44 && String.CompareOrdinal(source, i, sep, 0, sepLen) = 0 then Some()
45 else None
46
47 let rec read valLen i =
48 seq { let wrapVal() =
49 if valLen > 0
50 then source.Substring(i - valLen, valLen) |> Val |> Seq.singleton
51 else Seq.empty
52 if i <= iMax then
53 match i with
54 | Esc count ->
55 yield! wrapVal(); yield Esc count; yield! read 0 (i + count)
56 | Sep -> yield! wrapVal(); yield Sep; yield! read 0 (i + sepLen)
57 | _ -> yield! read (valLen + 1) (i + 1)
58 else yield! wrapVal() }
59 read 0 0
The above code uses the following features not present in C# 6.0:
- Line 18: Simplified XML documentation comments. Comment lines starting with triple slashes
///
are converted to what looks in C# like///<summary>...</summary>
. Inside the F# comment, angular brackets can be used deliberately, such asMyType<T, SomeOtherType<U>>
. - Line 29: Slice expressions allow retrieving a range of values from a collection in a succinct way.
- Line 34: Syntactic pattern matching.
match ... with ...
analyzes and decomposes syntactic patterns, who can come in many different shapes with arbitrary nesting and complexity. - Line 35: Type functions and automatic inference of type parameters.
Seq.singleton
is inferred asSeq.singleton<Token>
, which is a call to an F#-defined type function. - Lines 37, 42: Active patterns. Functions named
(| ... |)
can define various kinds of active patterns (single-case or optional patterns with or without arguments, multi-case patterns), which provides even more flexibility in pattern matching than is available in other ML-derived languages. The patterns are matched in lines 54 and 56, with completeness checking at compile time. - Line 37: Value declaration with arbitrary initialization body. We have pointed out above that
let (|Esc|_|) =
defines an active pattern function. More precisely,(|Esc|_|)
is just an ordinary value who happens to be a lambda function. The value is initialized with the result of the last line of the body below (a new lambda function created by composing the left and right sides of the>>
operator). In C#, we can use thevar
keyword only with very simple initializers (and not at all with delegates); in F#, we can use thelet
keyword with an arbitrary body of code containing nested scopes of arbitrary depth. Ironically, this feature of the "functional-first" language F# makes it possible to implement the OO principle of encapsulation in an even more fine-grained way than in the "OO-first" language C#. - Lines 38, 47: Tail call optimization.
rec aux
defines a recursive function namedaux
(aux is a common naming convention for an inner auxiliary function within an outer recursive function). The recursive call toaux
,aux (acc + 1) (i + 1)
, is in tail position, which is compiled to a construct that will never cause a stack overflow regardless of recursive depth. The same is true for therec read
function. - Line 40: Partial function application.
aux 0
calls theaux
function with only one parameter, which gets a new, partial version ofaux
where the first parameter is already built-in, but the second parameter still has to be passed. - Line 40: Function composition. The partial version of
aux
is composed ("sticked together") with the following function using the>>
operator. - Line 40: Condensed pattern matching with the
function
keyword.function 0 ...
is an abbreviation forfun x -> match x with 0 ...
. It defines a lambda function with a parameter whose name we do not care about. - Line 40: option types.
None
andSome ...
are the two possible cases of a discriminated union namedoption<'T>
, which we are pattern matching against. - Lines 48–58: Computation expressions with unified syntax and custom builders.
seq { ... }
defines a sequence (compiled asIEnumerable<Token>
). This is a kind of computation expression. Instead ofseq { ... }
, we could also have used[ ... ]
to get a list or[| ... |]
to get an array of tokens. The same kind of syntactic principle can also be used withasync { ... }
for defining asynchronous algorithms orquery {...}
for defining queries. Furthermore, it is possible in F# to implement custom computation expression builders (e.g., to simplify continuation passing withcont { ... }
or to simplify checking of failure conditions withmaybe { ... }
). Iterators, async methods and query expressions are also available in C#, but they are hard-wired into the compiler; C# does not have a unified, extensible mechanism for defining monadic constructs. - Line 50: Function pipelining. The
|>
operator forwards the argument from the left side to the function or constructor on the right side. This syntax immediately reveals the "path" that an argument takes through various transformations until it becomes the end result. It is much easier and intuitive to follow than in C#, where you would usually apply nested function calls or needlessly distracting extra variable declarations. - Lines 55–58:
yield!
keyword for appending multiple elements. C# 6.0 only allows appending single elements usingyield
. F#'syield!
lets you program recursive sequences in a simple and intuitive way (e.g., for traversing directory trees). At the same time, stack overflows are avoided thanks to tail call optimization.
Now that we have a custom tokenizer, we can use it in our concat/split functions. Instead of doing the parsing character by character, we can analyze groups of tokens via syntactic pattern matching, with the added benefit of automatic case completeness checking at compile time.
60 [<RequireQualifiedAccess>]
61 module String =
62 /// Returns a new string by connecting the given strings with the given separator.
63 let concatEscape esc sep (strings:seq<_>) =
64 Assert.notNull "strings" strings
65 let sb = StringBuilder()
66
67 let appendTokens areLast ts =
68 let appendEsc count = sb.Append(esc, count) |> ignore
69 let appendVal (v: string) = sb.Append v |> ignore
70 let appendSep() = appendVal sep
71
72 let rec aux = function
73 | [] -> ()
74 | Esc count :: [] -> appendEsc <| if areLast then count else count * 2
75 | Esc count :: (Sep :: _ as ts) -> appendEsc (count * 2); aux ts
76 | Esc count :: ts -> appendEsc count; aux ts
77 | Sep :: ts -> appendEsc 1; appendSep(); aux ts
78 | Val v :: ts -> appendVal v; aux ts
79
80 aux ts
81 if not areLast then appendSep()
82
83 strings
84 |> Seq.map (Tokenizer.create esc sep >> List.ofSeq)
85 |> Seq.fold (fun ts1 ts2 -> Option.iter (appendTokens false) ts1; Some ts2) None
86 |> Option.iter (appendTokens true)
87
88 sb.ToString()
89
90 /// Reproduces the original substrings from a string created with concatEscape.
91 let splitUnescape esc sep string =
92 Assert.notNull "string" string
93 let emptyVal = Val String.Empty
94 let sepVal = Val sep
95 let flipAppend x1 x2 = Seq.append x2 x1
96
97 // Produce token stream
98 string
99 |> Tokenizer.create esc sep
100
101 // Convert token stream to StringBuilder stream
102 |> flipAppend [emptyVal]
103 |> Seq.scan
104 (fun (sb:StringBuilder, t1) t2 ->
105 match t1, t2 with
106 | Esc count, Sep when count % 2 = 1 -> sb.Append(esc, count / 2), sepVal
107 | Esc count, Sep -> sb.Append(esc, count / 2), Sep
108 | Esc count, _ -> sb.Append(esc, count), t2
109 | Sep, _ -> StringBuilder(), t2
110 | Val v, _ -> sb.Append v, t2)
111 (StringBuilder(), emptyVal)
112 |> Seq.map fst
113
114 // Of each series of repeated StringBuilder references, keep only the last
115 // reference (which points to the StringBuilder's completed state).
116 // Convert the remaining StringBuilder references to strings.
117 |> flipAppend [null]
118 |> Seq.pairwise
119 |> Seq.filter (fun (sb1, sb2) -> sb1 <> sb2)
120 |> Seq.map (fst >> sprintf "%O")
The above code uses the following features not present in C# 6.0:
- Line 61: Module merging. We give our module the name String. However, there is already a .NET class named
System.String
and an F# library module namedFSharp.Core.String
. Now, when someone opens ourMyCompany.Foundation.Core
namespace and typesString.
, Visual Studio IntelliSense will automatically list all static members and functions together. This is a straightforward way of extending pre-existing static library functionality without having to go through a formal extension member definition mechanism. (Such a mechanism exists in F#, too, but it is used less often than in C#. Nonetheless, it allows you to define not only instance extension methods, but also static extension methods, extension properties, and even extension events.) - Line 63: Wildcards. By writing
seq<_>
using the wildcard symbol_
, we force the compiler to declare the strings parameter as a generic sequence (IEnumerable<...>
), but let it still infer the type parameter automatically. This is sometimes callled "mumble typing". Because of the way the strings parameter is used in the function, it is ultimately inferred asseq<string>
. - Lines 73–78: Syntactic pattern matching on union cases and list patterns.
[]
represents an empty list, andEsc count :: []
represents a list with a single element at the head, who itself represents theEsc
case of theToken
union type. The variablecount
is automatically initialized at runtime (if the overall pattern matches) with the number of consecutive escape characters found in the string. Further patterns are used in the remaining lines, and the F# compiler emits a warning if we forget a match. - Line 91: Only few reserved keywords exist in F#. In line 89, we use string as the name of a parameter. This is possible because string is not a keyword in F#; it's a type abbreviation that can be replaced with a custom definition without a problem. The same is true for all other names of built-in types, as well as most built-in operators (who are just predefined library functions).
- Lines 105--110, 120: Syntactic tuple types with pattern matching support.
t1, t2
defines a tuple, which is matched against in the lines 105--110, with case completeness checking at compile time. In line 120,fst
is a built-in F# library function who returns the first item of a tuple. - Line 120: Statically typed format strings.
sprintf "%O"
defines a function who takes an object and returns the object'sToString()
result. In our example, it gets thestring
from eachStringBuilder
instance. (Other available format specifiers would be%i
for integer numbers,%s
for strings,%b
for booleans, and more.)
I have used string concatenation and splitting as a didactical example to demonstrate many F# 3.1 features not known in C# 6.0. Using other examples, I could continue with still more features: inlining, pattern matching with records, automatic structural equality and comparison, aggressive optimization of closures and generic calls, units of measure, type providers, custom operators as functions or members, quotations, ...
There is one feature I did not point out explicitly, because it is omnipresent: Immutability. All values in the example are read-only (which is the default in F#); I did not have to declare even a single value as mutable. Immutability simplifies understanding the program flow; its advantages are particularly striking in parallel programming. By contrast, in C#, all values are mutable by default (that's why they are called "variables"), and local immutability is not even available, except for constants.
In the next blog post, I will use F# to define a comprehensive set of unit tests against the example.