Monitoring Ethereum infra

A.k.a. "How do I debug Geth?!"


10 October 2019

Devcon 5, Osaka



Péter Szilágyi

Go Ethereum Lead

Logging – Glorified `printf`

...either too silent... ...or too verbose...

Operationality is a spectrum

PerfectWorksKind ofBarelyFailsCrashes
FromToWeight
PerfectWorks1
WorksKind of1
Kind ofBarely1
BarelyFails1
FailsCrashes1

Software is never perfect... and rarely doesn't work...

Operating an infrastructure is a game of trade-offs

Monitoring and metrics

Not a new concept, just a new dimension

Abstract goals of monitoring 😕

Practically, you need to have questions

Novice – How healthy is a system and at what cost?

Every metric tells a story... but we need to go deeper

Intermediate – What is the system doing with its resources? (Net)

Transaction propagation at 1.1MB/s ⇒ New propagation algo Synchronization data serving at 1.3MB/s ⇒ New seed node mechanism

Intermediate – What is the system doing with its resources? (CPU)

CPU load comes up in curious places

Intermediate – What is the system doing with its resources? (Disk)

This doesn't look like much...

Accurate disk I/O measurement is notoriously hard

We added our disk monitoring *into* LevelDB itself! Can't split by data type though.

Advanced – Why is the system doing something in particular?

Maybe ~1/3rd of the light server charts... Sisyphean task: Measure the tiniest of intricacies of the current algorithms... which might become stale with th first update...

All Star – How does a change influence the system? (Seek PR – Good)

Green was master, yellow was the proposed change

All Star – How does a change influence the system? (File size PR – Bad)

Blue was master, purple was the proposed change * Dashboard is messy as it was flattened into a single screen for PR posterity

All Star – How does a change influence the system? (Trie – Tradeoff)

Most performance decisions are hard

Sometimes degrading certain things is necessary 😞

Close-up of the Shanghai DoS attacks Zoomed out view of the same code across a full import

Monitoring infrastructure

Enterprise QoS or self-hosted OSS? ¹

Time series database for Grafana? ¹

¹ Geth supports all combos! ExpVar and Prometheus scraping (--metrics) + InfluxDB pushing (--metrics.influxdb) ² Our own current Grafana & InfluxDB dashboards: Single Geth | Multi Geth | Dual Geth

Lessons learned

Measure as low as you can

Measure your worst-case numbers

Measure everything that you can afford to

Thank you

Go Ethereum Lead

Use the left and right arrow keys or click the left and right edges of the page to navigate between slides.
(Press 'H' or navigate to hide this message.)