Concurrent memo: mutexes vs. channels

Posted on
go

Background

Recently I saw an implementation of a concurrent cache that didn’t look quite right, and I was tempted to write one from scratch.

Go is a natural choice for implementation and testing, given its clear focus on concurrency, and its language primitives for exploiting parallelism and implementing synchronization.

I’m going to walk through my implementation of a concurrent memo using two different approaches for synchronizing concurrent access:

  • mutex-based
  • channel-based

Mutexes

I wanted to develop iteratively, and use Go’s race detector to fail my code and guide development.

The script I used for informal functional testing was

var f = func(key string) (interface{}, error) {
	log.Printf("Computing value for key %s\n", key)
	time.Sleep(10 * time.Millisecond)
	return fmt.Sprintf("value for %s", key), nil
}

func main() {
	var wg sync.WaitGroup

	c := cache.New(f)
	n := 5
	keys := []string{"key"}

	getKey := func(key string, wg *sync.WaitGroup) {
		defer wg.Done()
		c.Get(key)
	}

	start := time.Now()
	for i := 0; i < n; i++ {
		for _, k := range keys {
			wg.Add(1)
			go getKey(k, &wg)
		}
	}

	wg.Wait()
	log.Printf("Elapsed: %s\n", time.Since(start))
}

Evidently, this script is testing concurrent access of the memo, simulating an expensive function to be memoized, and recording the total time taken.

Testing followed the process outlined below:

Function Verification1
No duplicate computations for the same key Check stdout
Parallelism Total elapsed time should be around 10 ms
Race conditions Run with the -race flag, check stdout

I also wrote a benchmark

var f = func(key string) (interface{}, error) {
	time.Sleep(10 * time.Millisecond)
	return fmt.Sprintf("value for %s", key), nil
}

func BenchmarkGetSingleKey(b *testing.B) {
	c := cache.New(f)
	key := "mykey"
	c.Get(key)
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			c.Get(key)
		}
	})
}

to test the access speed.

Attempt 1 - Single-threaded mode

My first approach was as naive as possible, with no synchronization whatsoever.

type Func func(key string) interface{}

type FuncResult struct {
	val interface{}
	err error
}

type Cache struct {
	memo map[string]*FuncResult
	f    Func
}

func New(f Func) *Cache {
	return &Cache{f: f, memo: make(map[string]*FuncResult)}
}

func (c *Cache) Get(key string) (*FuncResult, bool) {
	r, ok := c.memo[key]
	if ok {
		return r, true
	}

	val, err := c.f(key)
	fr := &FuncResult{val, err}
	c.memo[key] = fr

	return fr, ok
}

This had data races, and duplication of work

❗ ~/c/g/s/g/y/go-cache [49bda8f] (49bda8f)
(i) go run -race main.go
2018/05/08 20:12:51 Computing value for key key
2018/05/08 20:12:51 Computing value for key key
2018/05/08 20:12:51 Computing value for key key
2018/05/08 20:12:51 Computing value for key key
2018/05/08 20:12:51 Computing value for key key
==================
WARNING: DATA RACE
Write at 0x00c420096180 by goroutine 7:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:694 +0x0
  github.com/yangmillstheory/go-cache/cache.(*Cache).Get()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/cache/cache.go:31 +0x1e3
  main.main.func1()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/main.go:31 +0x76

Previous read at 0x00c420096180 by goroutine 8:
  runtime.mapaccess2_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:261 +0x0
  github.com/yangmillstheory/go-cache/cache.(*Cache).Get()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/cache/cache.go:24 +0x7f
  main.main.func1()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/main.go:31 +0x76

Goroutine 7 (running) created at:
  main.main()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/main.go:29 +0x1d1

Goroutine 8 (running) created at:
  main.main()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/main.go:29 +0x1d1
==================
# ...bunch of output...
2018/05/08 20:12:51 Elapsed: 12.938069 ms
Found 8 data race(s)
exit status 66

but it did have parallelism, based on the elapsed time and the number of goroutines spawned.

Benchmarks actually failed early due to race conditions, and I couldn’t get any useful data from it.

πŸ‘Œ ~/c/go-examples
(i) go test -bench=. github.com/yangmillstheory/go-cache/cache
goos: darwin
goarch: amd64
pkg: github.com/yangmillstheory/go-cache/cache
BenchmarkGetSingleKey-4         fatal error: concurrent map writes

Attempt 2 - Exclusive lock

To synchronize the concurrent access to memo, I used an exclusive lock.

type Cache struct {
    // mu guards memo
    mu   sync.Mutex
    memo map[string]*FuncResult
    f    Func
}

This approach eliminated the race condition2 and duplication of work. Total elapsed time was around 10 ms for 5 goroutines, so parallelism existed2.

Using an exclusive lock and treating the entire Get method as the critical section, though clear, struck me as a bit brutal, particularly since it’s typically a best practice to lock only those lines of code that aren’t safe to be run concurrently.

Thus, I benchmarked the code and made a note of the result which was 97.6 ns/op.

❗ ~/c/go-examples
(i) go test -bench=. github.com/yangmillstheory/go-cache/cache
goos: darwin
goarch: amd64
pkg: github.com/yangmillstheory/go-cache/cache
BenchmarkGetSingleKey-4         20000000                97.6 ns/op
PASS
ok      github.com/yangmillstheory/go-cache/cache       3.344s

Attempt 3 - Shared and exclusive locks

To try and improve performance, I used a shared lock when performing the initial check. The lack of a shared lock was actually the deficiency I saw in someone else’s implementation (mentioned above) that inspired this work.

type Cache struct {
    // mu guards memo
    mu   sync.RWMutex
	memo map[string]*FuncResult
	f    Func
}

func (c *Cache) Get(key string) (*FuncResult, bool) {
	c.mu.RLock()
	r, ok := c.memo[key]
	c.mu.RUnlock()
	if ok {
		return r, true
	}

    c.mu.Lock()
    val, err := c.f(key)
    fr := &FuncResult{val, err}
	c.memo[key] = fr
	c.mu.Unlock()

	return fr, ok
}

There was a 67% performance improvement

πŸ‘Œ ~/c/go-examples
(i) go test -bench=. github.com/yangmillstheory/go-cache/cache
goos: darwin
goarch: amd64
pkg: github.com/yangmillstheory/go-cache/cache
BenchmarkGetSingleKey-4         20000000                87.8 ns/op
PASS
ok      github.com/yangmillstheory/go-cache/cache       1.891s

since by using a shared lock, I avoided locking readers when the computation was already memoized.

The race detector didn’t complain as well.

πŸ‘Œ ~/c/go-examples
(i) go test -bench=. github.com/yangmillstheory/go-cache/cache
goos: darwin
goarch: amd64
pkg: github.com/yangmillstheory/go-cache/cache
BenchmarkGetSingleKey-4         50000000                32.3 ns/op
PASS
ok      github.com/yangmillstheory/go-cache/cache       2.672s

However, occasionally I’d see duplicate computations:

πŸ‘Œ ~/c/g/s/g/y/go-cache [e5b6775] (blog-post⚑)
(i) go run main.go
2018/05/08 19:16:51 Computing value for key key
2018/05/08 19:16:51 Computing value for key key
2018/05/08 19:16:51 Elapsed: 22.589435 ms

Like a true race condition, this only happened rarely and in an irreproducible way. Note also that in this case, performance degrades, since parallelism is lost due to the exclusive locking around the computation.

Attempt 4 - Shared and exclusive locks with channels

In the previous attempt, how were computations duplicated?

It’s actually quite easy to spot. If you look at the code, if several goroutines perform the initial check under the shared lock and get a negative result, they’ll all race to compute the result. There’s an exclusive lock on the actual computation, but that doesn’t matter: the computation will still be done, it’ll just use FIFO ordering amongst the contending writers.

So I opted to use a done map which channel values for synchronization, which fixed duplicate computations while preserving performance.

func New(f Func) *Cache {
    return &Cache{
        f:    f,
        done: make(map[string]chan bool),
        memo: make(map[string]*FuncResult),
    }
}

func (c *Cache) Get(key string) (*FuncResult, bool) {
    c.mu.RLock()
    d, ok := c.done[key]
    c.mu.RUnlock()
    if ok {
        <-d
        return c.memo[key], true
    }

    c.mu.Lock()
    d, ok = c.done[key]
    if ok {
        c.mu.Unlock()
    } else {
        d = make(chan bool)
        c.done[key] = d
        c.mu.Unlock()

        val, err := c.f(key)
        fr := &FuncResult{val, err}
        c.memo[key] = fr
        close(d)
    }
    <-d
    return c.memo[key], ok
}

Note that this guarantees deduplication, since

  1. close(d) happens before <-d3
  2. any read of c.memo[key] happens after <-d
  3. only one goroutine is allowed in the else, due to the exclusive lock and the second check against c.done[key]

However, it was at this point that I realized that I wasn’t testing access and parallelism across different keys. After adding just one more key key2 to the test script,

func main() {
	var wg sync.WaitGroup

	c := cache.New(f)
	n := 5
	keys := []string{"key", "key2"}

	getKey := func(key string, wg *sync.WaitGroup) {
		defer wg.Done()
		c.Get(key)
	}

	start := time.Now()
	for i := 0; i < n; i++ {
		for _, k := range keys {
			wg.Add(1)
			go getKey(k, &wg)
		}
	}

	wg.Wait()
	log.Printf("Elapsed: %s\n", time.Since(start))
}

and running it, the race detector yelled at me.

πŸ‘Œ ~/c/g/s/g/y/go-cache [d2a9fae] (d2a9fae)
(i) go run -race main.go
2018/05/08 20:56:47 Computing value for key key
2018/05/08 20:56:47 Computing value for key key2
==================
WARNING: DATA RACE
Write at 0x00c4200881b0 by goroutine 7:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:694 +0x0
  github.com/yangmillstheory/go-cache/cache.(*Cache).Get()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/cache/cache.go:47 +0x44c
  main.main.func1()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/main.go:27 +0x89

Previous write at 0x00c4200881b0 by goroutine 6:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/hashmap_fast.go:694 +0x0
  github.com/yangmillstheory/go-cache/cache.(*Cache).Get()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/cache/cache.go:47 +0x44c
  main.main.func1()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/main.go:27 +0x89

Goroutine 7 (running) created at:
  main.main()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/main.go:34 +0x2ea

Goroutine 6 (finished) created at:
  main.main()
      /Users/yangmillstheory/code/go-examples/src/github.com/yangmillstheory/go-cache/main.go:34 +0x2ea
==================
2018/05/08 20:56:47 Elapsed: 12.319763 ms
Found 1 data race(s)
exit status 66

Note that parallelism still existed (otherwise we’d see an elapsed time of at least 20 ms), but the complaint was specifically around unprotected writes for different keys.

Attempt 5 - Shared and exclusive locks with channels (Take 2)

With slight modifications

func (c *Cache) Get(key string) (*FuncResult, bool) {
    c.mu.RLock()
    d, ok := c.done[key]
    fr, _ := c.memo[key]
    c.mu.RUnlock()
    if ok {
        <-d
        return fr, true
    }

    c.mu.Lock()
    d, ok = c.done[key]
    fr, _ = c.memo[key]
    if ok {
        c.mu.Unlock()
    } else {
        d = make(chan bool)
        fr = &FuncResult{}
        c.done[key] = d
        c.memo[key] = fr
        c.mu.Unlock()

        fr.val, fr.err = c.f(key)
        close(d)
    }
    <-d

    return fr, ok
}

the race conditions above were fixed with no change in performance

πŸ‘Œ ~/c/go-examples
(i) go test -bench=. github.com/yangmillstheory/go-cache/cache
goos: darwin
goarch: amd64
pkg: github.com/yangmillstheory/go-cache/cache
BenchmarkGetSingleKey-4         30000000                38.7 ns/op
PASS
ok      github.com/yangmillstheory/go-cache/cache       1.259s

or parallelism.

πŸ‘Œ ~/c/g/s/g/y/go-cache [d2a9fae] (d2a9fae⚑)
(i) go run -race main.go
2018/05/08 22:48:51 Computing value for key key
2018/05/08 22:48:51 Computing value for key key2
2018/05/08 22:48:51 Elapsed: 11.30273ms

All I did was put all data access within locks, so this is really just cleaning up the previous implementation.

Attempt 6 - Shared and exclusive locks with channels (Take 3)

At this point I realized I didn’t want to maintain another internal data structure (the done map) that I’d have to synchronize access to.

Remembering an implementation I saw elsewhere, I chose instead to have a struct composed of a done channel (describing the state of an oustanding computation) and a FuncResult which was the actual result of the computation.

This is more elegant than carrying around two maps.

type result struct {
	fr   *FuncResult
	done chan bool
}

type Cache struct {
    mu   sync.RWMutex
	memo map[string]*result
	f    Func
}

func (c *Cache) Get(key string) (*FuncResult, bool) {
	c.mu.RLock()
	r, ok := c.memo[key]
	c.mu.RUnlock()
	if ok {
		<-r.done
		return r.fr, true
	}

	c.mu.Lock()
	r, ok = c.memo[key]
	if ok {
		c.mu.Unlock()
	} else {
		d := make(chan bool)
		r = &result{&FuncResult{}, d}
		c.memo[key] = r
		c.mu.Unlock()

		r.fr.val, r.fr.err = c.f(key)
		close(d)
	}
	<-r.done

	return r.fr, ok
}

This works, because all accesses of memo are protected, and we have the same good performance

πŸ‘Œ ~/c/go-examples
(i) go test -bench=. github.com/yangmillstheory/go-cache/cache
goos: darwin
goarch: amd64
pkg: github.com/yangmillstheory/go-cache/cache
BenchmarkGetSingleKey-4         50000000                32.5 ns/op
PASS
ok      github.com/yangmillstheory/go-cache/cache       2.705s

and parallelism across different keys.

πŸ‘Œ ~/c/g/s/g/y/go-cache [1925332] (blog-post)
(i) go run -race main.go
2018/05/08 21:29:24 Computing value for key key
2018/05/08 21:29:24 Computing value for key key2
2018/05/08 21:29:24 Elapsed: 12.534349 ms

Note that the above is output for 5 goroutines per key, with a computation taking 10 ms.

Channels

I have to admit that the above implementation with mutexes wasn’t hard (at least in retrospect), but it wasn’t obvious either, and reasoning about concurrent access and locking is always tricky.

Go encourages sharing memory by communicating, and a typical approach for protecting shared memory is to use a “monitor goroutine” that owns a data structure to which writes and reads are synchronized via channels, which would otherwise be typically guarded by traditional mutexes.

Rather than outline the step-by-step process I took to arrive here (it definitely took more than one step), I’ll just show the result.

type Cache interface {
	Get(key string) *FuncResult
}

type request struct {
	key string
	out chan *FuncResult
}

type result struct {
	fr   *FuncResult
	done chan bool
}

type cache struct {
	f    Func
	memo map[string]*result
	reqs chan request
}

func (c *cache) Get(key string) *FuncResult {
	req := request{key, make(chan *FuncResult)}
	c.reqs <- req
	return <-req.out
}

func (c *cache) monitor() {
	for req := range c.reqs {
		key := req.key
		r, ok := c.memo[key]
		if !ok {
			r = &result{nil, make(chan bool)}
			c.memo[key] = r
			go r.compute(c.f, key)
		}
		go r.reply(req)
	}
}

func (r *result) compute(f Func, key string) {
	v, err := f(key)
	r.fr = &FuncResult{v, err}
	close(r.done)
}

func (r *result) reply(req request) {
	<-r.done
	req.out <- r.fr
}

Note that memo no longer has to be protected by mutexes, and is only ever accessed by the monitor goroutine. There’s also a new struct called request which represents a request for a given key. Without this struct, it’s harder to not leak access to memo across other goroutines (try it).

As in the mutex-based approach, each result is stored in memo, but here they gain two methods

  • compute, which sets its FuncResult property and closes its done channel, signaling that it’s ready to read
  • reply, which waits until it’s <-done, and sends its FuncResult to a request’s out channel

Finally, compute and reply have to run in separate goroutines, otherwise requests for different keys that weren’t previously computed would be serialized, so parallelism would be lost.

This approach is much clearer in my opinion, but the perfomance is actually 97% worse.

πŸ‘Œ ~/c/go-examples
(i) go test -bench=. github.com/yangmillstheory/go-cache/cache
goos: darwin
goarch: amd64
pkg: github.com/yangmillstheory/go-cache/cache
BenchmarkGetSingleKey-4          2000000               853 ns/op
PASS
ok      github.com/yangmillstheory/go-cache/cache       2.605s

This isn’t too surprising, since channels use locks internally, and the difference has been noted elsewhere. In practice, it likely won’t be the bottleneck, and in most cases I might happily trade clarity and maintainability for 1 ms.

Conclusion

Concurrency is easy; synchronization is hard.

Luckily Go offers some great built in language primitives (not libraries) like channels that ease the burden of implementation. And if a channel-based approach ever feels awkward or performance becomes an issue, you can always go back to plain old mutexes.


  1. Note that the script needs to test across multiple keys; so my initial script was flawed at testing race conditions and parallelism. [return]
  2. At least for a single key; remember that my testing script is still just testing one key at this point. [return]
  3. Which is why channels are used as synchronization mechanisms in Go. [return]