Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
April 16, 2024 04:00 pm GMT

C Regular Expression Benchmarks How To Avoid My Mistakes!

Recently I published an article about the various performance characteristics of using regex in C#. Specifically, there are a handful of ways that we can use regular expressions in C# to do seemingly the same thing. I was curious about the differences, so I set out to collect some C# regular expression benchmarks but I goofed.

In this article, Ill explain my error so you can try to avoid this for yourself and of course, Ill provide new benchmarks and an update to the benchmarking code using BenchmarkDotNet.

The Invalid CSharp Regular Expression Benchmarks

In the previous article I wrote about C# regular expression benchmarks, I had results that illustrated enormous performance gains if you were incorrectly using some of the available C# regex methods. The benchmark results clearly indicated that the cost of creating a new regex object in C# was expensive, and even more expensive if you had the compile flag. So if you were doing this each time you wanted to match, you were paying an unfortunately high penalty.

You can watch the video walking through these benchmarks here to get a better idea, keeping in mind that the results were about to discuss are the correct ones:

There was a great YouTube comment that came up pretty shortly after the video was released, and heres what the viewer had to say:

C# Regular Expression Performance - YouTube Comment

That meant I had to go back to the drawing board to investigate!

Dev Leader Weekly | Substack

My weekly newsletter that simplifies software engineering for you, along with C# code examples. Join thousands of software engineers from companies like Microsoft and Amazon who are already reading! Click to read Dev Leader Weekly, a Substack publication with thousands of subscribers.

favicon weekly.devleader.ca

Being Lazy Like CSharp Iterators

As the viewer pointed out it looks like MatchCollection that comes back from doing a regex match is lazily evaluated. That might be confusing if you havent heard of this concept before, but this is all much like using iterators in C#, which I have talked about extensively. If you want a bit of a crash course on the idea behind iterators in C#, you can check out this video for a walkthrough of my experiences:

Long story short: It looks like a line of code executes a method, but youre simply just assigning a function pointer to a variable. The result is that instead of a method taking some amount of time to run, you instead have it execute nearly instantaneously.

But is the MatchCollection actually a C# iterator, or is it something else?

Revisiting The CSharp Regex Benchmark Code

Lets consider the following code, which is one of the simple benchmarks from the previous article:

Regex regex = new(RegexPattern!, RegexOptions.Compiled);return regex.Matches(_sourceText!);

What were returning in that benchmark method is actually whats called a MatchCollection. We could instead write that code as follows:

Regex regex = new(RegexPattern!, RegexOptions.Compiled);MatchCollection matches = regex.Matches(_sourceText!);return matches;

The line in the middle of the code example directly above looks like its evaluating all of the matches, but indeed it returns nearly instantly. Thats definitely not what we wanted to measure! Instead, I was just measuring the cost of creating a MatchCollection instance vs creating an entire C# Regex object instance plus a MatchCollection instance. This explains why of course the two benchmarks that created instances were absolutely guaranteed to be slower they did the same thing when it came to creating a MatchCollection instance, but they had their own associated overhead for regular expression creation.

Whats Happening With MatchCollection?

Well, it turns out that MatchCollection is not quite an iterator, but it certainly is lazily evaluated. If you keep jumping to the definition and go deeper into the code for MatchCollection, youll see the following:

private void EnsureInitialized(){    if (!_done)    {        GetMatch(int.MaxValue);    }}

This is a method called in many spots within the MatchCollection class, and basically any time someone wants to access information about the matches, we eventually need to get into the GetMatch method. That method looks as follows:

private Match? GetMatch(int i){    Debug.Assert(i >= 0, "i cannot be negative.");    if (_matches.Count > i)    {        return _matches[i];    }    if (_done)    {        return null;    }    Match match;    do    {        match = _regex.RunSingleMatch(            RegexRunnerMode.FullMatchRequired,            _prevlen,            _input,            0,            _input.Length,            _startat)!;        if (!match.Success)        {            _done = true;            return null;        }        _matches.Add(match);        _prevlen = match.Length;        _startat = match._textpos;    } while (_matches.Count <= i);    return match;}

If you step through this code, youll notice that theres state kept for how far along in the matches a caller is looking for. That means if this instance has not yet been evaluated that far along, it will then go attempt to evaluate that far based on the requested index. Looks pretty lazy to me! But if that wasnt a giveaway Look how the constructor doesnt even take a collection of matches at the time of creation!

internal MatchCollection(Regex regex, string input, int startat){    if ((uint)startat > (uint)input.Length)    {        ThrowHelper.ThrowArgumentOutOfRangeException(            ExceptionArgument.startat,            ExceptionResource.BeginIndexNotNegative);    }    _regex = regex;    _input = input;    _startat = startat;    _prevlen = -1;    _matches = new List<Match>();    _done = false;}

Please keep in mind that this is what Visual Studio is showing me, and if you try checking the source at a later point it may be different in the dotnet code.

You can follow along with this video here:

The New Code for CSharp Regular Expression Benchmarks

Lets jump right into the new BenchmarkDotNet code for these regular expression scenarios, which will look VERY similar to before:

using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System.Reflection;using System.Text.RegularExpressions;BenchmarkRunner.Run(    Assembly.GetExecutingAssembly(),     args: args);[MemoryDiagnoser][MediumRunJob]public partial class RegexBenchmarks{    private const string RegexPattern = @"\b\w*(ing|ed)\b";    private string? _sourceText;    private Regex? _regex;    private Regex? _regexCompiled;    private Regex? _generatedRegex;    private Regex? _generatedRegexCompiled;    [GeneratedRegex(RegexPattern, RegexOptions.None, "en-US")]    private static partial Regex GetGeneratedRegex();    [GeneratedRegex(RegexPattern, RegexOptions.Compiled, "en-US")]    private static partial Regex GetGeneratedRegexCompiled();    [Params("pg73346.txt")]    public string? SourceFileName { get; set; }    [GlobalSetup]    public void Setup()    {        _sourceText = File.ReadAllText(SourceFileName!);        _regex = new(RegexPattern);        _regexCompiled = new(RegexPattern, RegexOptions.Compiled);        _generatedRegex = GetGeneratedRegex();        _generatedRegexCompiled = GetGeneratedRegexCompiled();    }    [Benchmark(Baseline = true)]    public int Static()    {        return Regex.Matches(_sourceText!, RegexPattern!).Count;    }    [Benchmark]    public int New()    {        Regex regex = new(RegexPattern!);        return regex.Matches(_sourceText!).Count;    }    [Benchmark]    public int New_Compiled()    {        Regex regex = new(RegexPattern!, RegexOptions.Compiled);        return regex.Matches(_sourceText!).Count;    }    [Benchmark]    public int Cached()    {        return _regex!.Matches(_sourceText!).Count;    }    [Benchmark]    public int Cached_Compiled()    {        return _regexCompiled!.Matches(_sourceText!).Count;    }    [Benchmark]    public int Generated()    {        return GetGeneratedRegex().Matches(_sourceText!).Count;    }    [Benchmark]    public int Generated_Cached()    {        return _generatedRegex!.Matches(_sourceText!).Count;    }    [Benchmark]    public int Generated_Compiled()    {        return GetGeneratedRegexCompiled().Matches(_sourceText!).Count;    }    [Benchmark]    public int Generated_Cached_Compiled()    {        return _generatedRegexCompiled!.Matches(_sourceText!).Count;    }}

If you compare these benchmark methods to the previous article, youll notice they are very close with a very subtle difference: Im returning an integer count instead of the MatchCollection. If you consider this and what I pointed out in the code earlier, Count is one of the properties on MatchCollection that will eventually try to evaluate the matches under the hood as in, it will force the lazy behavior to get evaluated.

We can see that by factoring in this code snippet below:

/// <summary>/// Returns the number of captures./// </summary>public int Count{    get    {        EnsureInitialized();        return _matches.Count;    }}

The New Results of CSharp Regular Expression Benchmarks

Now that Ive come clean and admitted that I totally screwed up on my benchmarks AND provided you the updated code, lets redo this analysis. Here are the newly run benchmarks:

C# Regular Expression Benchmarks - Corrected Results

Now here are some interesting things to make note of:

  • The static method call is indeed the worst of all the options. While its nearly identical, creating a new regex is only a tiny bit faster as these benchmark results show. Under the hood, the static method call does try to get or add the regex to a cache, which seems to be a static collection of regular expressions that have been created. MORE ON THIS SHORTLY

  • It looks like now in these benchmarks the overhead of compiling the regex is now WORTH doing, even if youre going to do it each time you go to find matches. About a 56% performance boost! We need to keep in mind the pattern Im using, the contents of data thats being tested, and the overall length of that data as well.

  • Simply caching the regex in a field shows very minimal performance benefits, if not just a rounding error, for this setup. Again, this could be largely due to the size of the data and the pattern being used, and perhaps a very short set of data to scan would make caching proportionally more advantageous.

  • Now we get into the magic territory where we can see caching a compiled regex gives us performance that is 310% of the baseline. Not having to recompile is very helpful here!

  • As predicted, like in the original article, all of the generated regexes are on par with each other, and in this test setup perform exactly on par with the compiled cached regex.

But Wait! Theres Caching!

I wanted to be even more careful before publishing this follow-up article on C# regular expression benchmarks because of the first slip up I had. I recalled that the regex class has a RegexCache internal to the dotnet code that is running and since were using the same pattern over and over Are we caching things behind the scenes?

It turns out, we can change the cache size to zero by doing the following:

Regex.CacheSize = 0;

If I add this line of code to the beginning of the benchmark runner, I can be assured that the caching is turned off. But just to be sure, I added this debug watch in Visual Studio after creating a regex just to sanity check:

typeof(RegexCache)    .GetField("s_cacheList", BindingFlags.Static | BindingFlags.NonPublic)    .GetValue(null)

The result of this was seeing an empty cache, which is exactly what I was hoping for. A quick note is that RegexCache is internal, and I wasnt sure the best way to get reflection to work in code to access this (i.e. not sure which assembly to go look at), but this does resolve properly in the debug watch list! But now youre probably interested in the results:

C# Regular Expression Benchmarks - Corrected Results No Cache

Interestingly theres really no difference here. I suspect this is largely because the text were scanning through is large enough that the time it takes to evaluate regex matches mostly dwarfs the time it takes to create the regex objects. This is exactly what was not shown in the original article where the time to evaluate was basically zero because we never evaluated the matches.

Wrapping Up CSharp Regular Expression Benchmarks

Okay, now its time for some lessons learned

  1. Its okay to make mistakes. Were all human.

  2. Mistakes are a great way to take ownership and show others that we can grow from messing up!

  3. Failing is one of the best ways to learn. I had absolutely no idea MatchCollection was lazy after probably 10+ years of using Regex in C#.

At the end of the day, these new C# regular expression benchmarks are still very helpful in understanding just how big of an impact compiling your regex can have! Ideally, you dont want to be recreating new regex instances, especially with the compile flag but the performance of compiled regexes is awesome.

If you found this useful and youre looking for more learning opportunities, consider subscribing to my free weekly software engineering newsletter and check out my free videos on YouTube! Meet other like-minded software engineers and join my Discord community!

Dev Leader Weekly | Substack

My weekly newsletter that simplifies software engineering for you, along with C# code examples. Join thousands of software engineers from companies like Microsoft and Amazon who are already reading! Click to read Dev Leader Weekly, a Substack publication with thousands of subscribers.

favicon weekly.devleader.ca

Want More Dev Leader Content?

  • Follow along on this platform if you havent already!
  • Subscribe to my free weekly software engineering and dotnet-focused newsletter. I include exclusive articles and early access to videos:SUBSCRIBE FOR FREE
  • Looking for courses? Check out my offerings:VIEW COURSES
  • E-Books & other resources:VIEW RESOURCES
  • Watch hundreds of full-length videos on my YouTube channel:VISIT CHANNEL
  • Visit my website for hundreds of articles on various software engineering topics (including code snippets):VISIT WEBSITE
  • Check out the repository with many code examples from my articles and videos on GitHub:VIEW REPOSITORY

Original Link: https://dev.to/devleader/c-regular-expression-benchmarks-how-to-avoid-my-mistakes-2cef

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To