The built-in .NET String.Join and String.Split methods do not provide a way for escaping the separator. If the separator is already contained in the input strings before joining, the strings cannot be reproduced by splitting:
open System
let inputStrings = [| "Hello, world."; "How are you?" |]
let separator = ", "
let joinedResult = String.Join(separator, inputStrings)
let reproducedStrings = joinedResult.Split([| separator |], StringSplitOptions.None)
printfn "%A" inputStrings
printfn "%A" reproducedStringsThe output of the above code is:
[|"Hello, world."; "How are you?"|] // An array with 2 elements
[|"Hello"; "world."; "How are you?"|] // An array with 3 elementsTo solve this problem, we are going to implement a pair of functions called concatEscape and splitUnescape. Besides roundtrip concatenation, the functions will provide some additional benefits.
In the example, I will explain a total of about 30 different F# 3.1 concepts that do not exist in C# 6.0. In case F# is completely new to you, here are some fundamentals:
- F# uses indentation instead of curly braces
{ }to define scoped blocks of code. - F# (almost never) uses the
returnkeyword. It is an expression-based language, and so the last executed line of a body of code implicitly specifies the return value. - It is not necessary to separate function parameters or arguments by commas and wrap them in parentheses. When you see something like
let myFunction x y = x + Y - 1,myFunctionis the name of the function, andxandyare the names of parameters. Everything after the=is the body of the function. - You will see almost no type declarations in the source code. Nevertheless, F# is a statically typed language—in some ways, even more "strongly" typed than C#. The difference to C# is that the F# compiler automatically infers the types for you, and also automatically makes them generic where appropriate. Again, this greatly simplifies the code, so that you can keep your focus on the business logic.
For concatenation,
- process the input strings as a stream (do not load all elements into memory at once)
- use only a minimum number of escapes, if any. E.g., if the escape character pre-exists in an input string, it should itself be escaped by duplication only in cases where it precedes a real separator or an escaped pre-existing separator.
- if the separator is empty, raise an
ArgumentException - in other respects, behave like the built-int .NET Join method:
- If the separator is
null, raise anArgumentNullException. - If the input sequence is
null, raise anArgumentNullException. - If the input sequence is empty, treat it like one with a single empty-string element.
- If the input sequence contains
nullelements, treat them like empty strings.
- If the separator is
For splitting,
- return the output strings as a stream (do not load all elements into memory at once)
- accurately reproduce the original input strings when given the same escape character and separator, except for the following cases (which work the same way as the built-in .NET Split method):
- if the original input sequence was empty, reproduce it as sequence with a single empty-string element
- if the original input sequence contained
nullstrings, reproduce them as empty strings
1 namespace MyCompany.Foundation.Core
2
3 open System
4 open System.TextWe are going to need some supporting functions to validate the concat/split function parameters:
5 [<RequireQualifiedAccess>]
6 module Assert =
7 let notNull argName arg = if arg = null then nullArg argName
8
9 let notNullOrEmpty argName arg =
10 notNull argName arg
11 if Seq.isEmpty arg then invalidArg argName "Value cannot be empty."The above code uses the following features not present in C# 6.0:
- Lines 5–6: Enforcement of qualified access to a module. In C# 6.0, it is now possible to open a static class with the
usingkeyword. In F#, modules are similar to static classes, but allow for much simplified syntax. In addition, the access to modules can be declared with[<AutoOpen>](open by default) or[<RequireQualifiedAccess>](non-openable). - Line 9: Automatic type inference and generalization.
argNameis compiled as a string parameter, because it is used to call thenotNullfunction, which itself calls thenullArgfunction, which requires astringargument.argis compiled as a parameter of generic type, which must implement the sequence interface, because it is used to callSeq.isEmpty(Seq<>is the same asIEnumerable<>). Furthermore,argis compiled as a type who can benull, because it is validated via thenotNullfunction (types declared in F# can never benull, but types declared outside F#—such asSystem.StringorSystem.Collections.Generic.IEnumerable<T>—can still benull).
To simplify the parsing, we will convert the input strings to token streams and then analyze/decompose the tokens streams via F#'s built-in pattern matching facilities.
12 type private Token =
13 | Esc of Count:int
14 | Sep
15 | Val of Content:stringThe above code uses the following features not present in C# 6.0:
- Line 12: Private accessibility within namespace declaration groups. The
Tokentype is only accessible by code who has been declared inside the namespaceMyCompany.Foundation.Corein the same library. - Lines 12–15: Discriminated union type. An instance of
Tokensymbolizes either a consecutive number of escapes (Esc), a separator (Sep), or anything else (Val). Later, when aTokeninstance is analyzed via pattern matching, the F# compiler will emit a warning if a case has been forgotten by the developer.
We define a Tokenizer module with just one function named create. The function takes the escape character and separator as parameters, validates them, and returns another function. The returned function has the escape character and separator already built-in (in the closure). It takes only a single argument, which is the string to be tokenized.
16 [<RequireQualifiedAccess>]
17 module private Tokenizer =
18 /// Returns a function who can convert a given source string to a token stream.
19 let create esc sep =
20 let sepName = "sep"
21 let sepLen = String.length sep
22
23 // Validate parameters
24 Assert.notNullOrEmpty sepName sep
25 if sep.[0] = esc then invalidArg sepName "Separator cannot start with escape char."
26 if sepLen > 1 then
27 let iMax = sepLen - 1
28 for i in 0 .. iMax / 2 do
29 if sep.[0 .. i] = sep.[sepLen - i - 1 .. iMax] then
30 invalidArg sepName "Separator cannot have same beginning and ending."
31
32 // Return the tokenizer function
33 fun source ->
34 match String.length source - 1 with
35 | -1 -> Val String.Empty |> Seq.singleton
36 | iMax ->
37 let (|Esc|_|) =
38 let rec aux acc i =
39 if i <= iMax && source.[i] = esc then aux (acc + 1) (i + 1) else acc
40 aux 0 >> function 0 -> None | count -> Some count
41
42 let (|Sep|_|) i =
43 if i <= iMax - sepLen + 1
44 && String.CompareOrdinal(source, i, sep, 0, sepLen) = 0 then Some()
45 else None
46
47 let rec read valLen i =
48 seq { let wrapVal() =
49 if valLen > 0
50 then source.Substring(i - valLen, valLen) |> Val |> Seq.singleton
51 else Seq.empty
52 if i <= iMax then
53 match i with
54 | Esc count ->
55 yield! wrapVal(); yield Esc count; yield! read 0 (i + count)
56 | Sep -> yield! wrapVal(); yield Sep; yield! read 0 (i + sepLen)
57 | _ -> yield! read (valLen + 1) (i + 1)
58 else yield! wrapVal() }
59 read 0 0The above code uses the following features not present in C# 6.0:
- Line 18: Simplified XML documentation comments. Comment lines starting with triple slashes
///are converted to what looks in C# like///<summary>...</summary>. Inside the F# comment, angular brackets can be used deliberately, such asMyType<T, SomeOtherType<U>>. - Line 29: Slice expressions allow retrieving a range of values from a collection in a succinct way.
- Line 34: Syntactic pattern matching.
match ... with ...analyzes and decomposes syntactic patterns, who can come in many different shapes with arbitrary nesting and complexity. - Line 35: Type functions and automatic inference of type parameters.
Seq.singletonis inferred asSeq.singleton<Token>, which is a call to an F#-defined type function. - Lines 37, 42: Active patterns. Functions named
(| ... |)can define various kinds of active patterns (single-case or optional patterns with or without arguments, multi-case patterns), which provides even more flexibility in pattern matching than is available in other ML-derived languages. The patterns are matched in lines 54 and 56, with completeness checking at compile time. - Line 37: Value declaration with arbitrary initialization body. We have pointed out above that
let (|Esc|_|) =defines an active pattern function. More precisely,(|Esc|_|)is just an ordinary value who happens to be a lambda function. The value is initialized with the result of the last line of the body below (a new lambda function created by composing the left and right sides of the>>operator). In C#, we can use thevarkeyword only with very simple initializers (and not at all with delegates); in F#, we can use theletkeyword with an arbitrary body of code containing nested scopes of arbitrary depth. Ironically, this feature of the "functional-first" language F# makes it possible to implement the OO principle of encapsulation in an even more fine-grained way than in the "OO-first" language C#. - Lines 38, 47: Tail call optimization.
rec auxdefines a recursive function namedaux(aux is a common naming convention for an inner auxiliary function within an outer recursive function). The recursive call toaux,aux (acc + 1) (i + 1), is in tail position, which is compiled to a construct that will never cause a stack overflow regardless of recursive depth. The same is true for therec readfunction. - Line 40: Partial function application.
aux 0calls theauxfunction with only one parameter, which gets a new, partial version ofauxwhere the first parameter is already built-in, but the second parameter still has to be passed. - Line 40: Function composition. The partial version of
auxis composed ("sticked together") with the following function using the>>operator. - Line 40: Condensed pattern matching with the
functionkeyword.function 0 ...is an abbreviation forfun x -> match x with 0 .... It defines a lambda function with a parameter whose name we do not care about. - Line 40: option types.
NoneandSome ...are the two possible cases of a discriminated union namedoption<'T>, which we are pattern matching against. - Lines 48–58: Computation expressions with unified syntax and custom builders.
seq { ... }defines a sequence (compiled asIEnumerable<Token>). This is a kind of computation expression. Instead ofseq { ... }, we could also have used[ ... ]to get a list or[| ... |]to get an array of tokens. The same kind of syntactic principle can also be used withasync { ... }for defining asynchronous algorithms orquery {...}for defining queries. Furthermore, it is possible in F# to implement custom computation expression builders (e.g., to simplify continuation passing withcont { ... }or to simplify checking of failure conditions withmaybe { ... }). Iterators, async methods and query expressions are also available in C#, but they are hard-wired into the compiler; C# does not have a unified, extensible mechanism for defining monadic constructs. - Line 50: Function pipelining. The
|>operator forwards the argument from the left side to the function or constructor on the right side. This syntax immediately reveals the "path" that an argument takes through various transformations until it becomes the end result. It is much easier and intuitive to follow than in C#, where you would usually apply nested function calls or needlessly distracting extra variable declarations. - Lines 55–58:
yield!keyword for appending multiple elements. C# 6.0 only allows appending single elements usingyield. F#'syield!lets you program recursive sequences in a simple and intuitive way (e.g., for traversing directory trees). At the same time, stack overflows are avoided thanks to tail call optimization.
Now that we have a custom tokenizer, we can use it in our concat/split functions. Instead of doing the parsing character by character, we can analyze groups of tokens via syntactic pattern matching, with the added benefit of automatic case completeness checking at compile time.
60 [<RequireQualifiedAccess>]
61 module String =
62 /// Returns a new string by connecting the given strings with the given separator.
63 let concatEscape esc sep (strings:seq<_>) =
64 Assert.notNull "strings" strings
65 let sb = StringBuilder()
66
67 let appendTokens areLast ts =
68 let appendEsc count = sb.Append(esc, count) |> ignore
69 let appendVal (v: string) = sb.Append v |> ignore
70 let appendSep() = appendVal sep
71
72 let rec aux = function
73 | [] -> ()
74 | Esc count :: [] -> appendEsc <| if areLast then count else count * 2
75 | Esc count :: (Sep :: _ as ts) -> appendEsc (count * 2); aux ts
76 | Esc count :: ts -> appendEsc count; aux ts
77 | Sep :: ts -> appendEsc 1; appendSep(); aux ts
78 | Val v :: ts -> appendVal v; aux ts
79
80 aux ts
81 if not areLast then appendSep()
82
83 strings
84 |> Seq.map (Tokenizer.create esc sep >> List.ofSeq)
85 |> Seq.fold (fun ts1 ts2 -> Option.iter (appendTokens false) ts1; Some ts2) None
86 |> Option.iter (appendTokens true)
87
88 sb.ToString()
89
90 /// Reproduces the original substrings from a string created with concatEscape.
91 let splitUnescape esc sep string =
92 Assert.notNull "string" string
93 let emptyVal = Val String.Empty
94 let sepVal = Val sep
95 let flipAppend x1 x2 = Seq.append x2 x1
96
97 // Produce token stream
98 string
99 |> Tokenizer.create esc sep
100
101 // Convert token stream to StringBuilder stream
102 |> flipAppend [emptyVal]
103 |> Seq.scan
104 (fun (sb:StringBuilder, t1) t2 ->
105 match t1, t2 with
106 | Esc count, Sep when count % 2 = 1 -> sb.Append(esc, count / 2), sepVal
107 | Esc count, Sep -> sb.Append(esc, count / 2), Sep
108 | Esc count, _ -> sb.Append(esc, count), t2
109 | Sep, _ -> StringBuilder(), t2
110 | Val v, _ -> sb.Append v, t2)
111 (StringBuilder(), emptyVal)
112 |> Seq.map fst
113
114 // Of each series of repeated StringBuilder references, keep only the last
115 // reference (which points to the StringBuilder's completed state).
116 // Convert the remaining StringBuilder references to strings.
117 |> flipAppend [null]
118 |> Seq.pairwise
119 |> Seq.filter (fun (sb1, sb2) -> sb1 <> sb2)
120 |> Seq.map (fst >> sprintf "%O")
The above code uses the following features not present in C# 6.0:
- Line 61: Module merging. We give our module the name String. However, there is already a .NET class named
System.Stringand an F# library module namedFSharp.Core.String. Now, when someone opens ourMyCompany.Foundation.Corenamespace and typesString., Visual Studio IntelliSense will automatically list all static members and functions together. This is a straightforward way of extending pre-existing static library functionality without having to go through a formal extension member definition mechanism. (Such a mechanism exists in F#, too, but it is used less often than in C#. Nonetheless, it allows you to define not only instance extension methods, but also static extension methods, extension properties, and even extension events.) - Line 63: Wildcards. By writing
seq<_>using the wildcard symbol_, we force the compiler to declare the strings parameter as a generic sequence (IEnumerable<...>), but let it still infer the type parameter automatically. This is sometimes callled "mumble typing". Because of the way the strings parameter is used in the function, it is ultimately inferred asseq<string>. - Lines 73–78: Syntactic pattern matching on union cases and list patterns.
[]represents an empty list, andEsc count :: []represents a list with a single element at the head, who itself represents theEsccase of theTokenunion type. The variablecountis automatically initialized at runtime (if the overall pattern matches) with the number of consecutive escape characters found in the string. Further patterns are used in the remaining lines, and the F# compiler emits a warning if we forget a match. - Line 91: Only few reserved keywords exist in F#. In line 89, we use string as the name of a parameter. This is possible because string is not a keyword in F#; it's a type abbreviation that can be replaced with a custom definition without a problem. The same is true for all other names of built-in types, as well as most built-in operators (who are just predefined library functions).
- Lines 105--110, 120: Syntactic tuple types with pattern matching support.
t1, t2defines a tuple, which is matched against in the lines 105--110, with case completeness checking at compile time. In line 120,fstis a built-in F# library function who returns the first item of a tuple. - Line 120: Statically typed format strings.
sprintf "%O"defines a function who takes an object and returns the object'sToString()result. In our example, it gets thestringfrom eachStringBuilderinstance. (Other available format specifiers would be%ifor integer numbers,%sfor strings,%bfor booleans, and more.)
I have used string concatenation and splitting as a didactical example to demonstrate many F# 3.1 features not known in C# 6.0. Using other examples, I could continue with still more features: inlining, pattern matching with records, automatic structural equality and comparison, aggressive optimization of closures and generic calls, units of measure, type providers, custom operators as functions or members, quotations, ...
There is one feature I did not point out explicitly, because it is omnipresent: Immutability. All values in the example are read-only (which is the default in F#); I did not have to declare even a single value as mutable. Immutability simplifies understanding the program flow; its advantages are particularly striking in parallel programming. By contrast, in C#, all values are mutable by default (that's why they are called "variables"), and local immutability is not even available, except for constants.
In the next blog post, I will use F# to define a comprehensive set of unit tests against the example.