Write-up published
Resolved
On October 1 11:16 PM UTC \(7:16 PM ET\) we executed a DDL statement on a production Clickhouse cluster that caused all writes to the changed table to fail. The statement changed the type of a column from String to LowCardinality(String). Writes failed because our Clickhouse driver serializes row data to a binary format, and to do so it needs to specify the exact types of the fields. We did not realize that the Clickhouse server would reject inserts with a String field header for a LowCardinality(String) field.
We were notified immediately, and realized the problem within a few minutes. We reversed the DDL statement 13 minutes from when it was initially executed and log inserts started succeeding again.
On October 3 3:04 PM UTC \(11:04 AM ET\) we completed a backfill of the missing 13 days of data from a kinesis stream.
The root cause of the incident was that an engineer performed a DDL statement in production without adequately testing it in their local dev or staging environments. While it is sometimes advantageous to run DDL statements directly in prod to tune indexes, this optimization should have flowed through our regular DB migration code \+ CI process.
During incident remediation we implemented a framework and runbook for replaying our Clickhouse log consumers from Kinesis. Future replay operations will be much quicker.
Update internal Clickhouse runbook with guidance on migrations, add pitfalls section about column types
Investigate changing our driver’s row serialization format so we can easily migrate type T to LowCardinality(T) and avoid this specific issue in the future
Resolved
We've resolved the underlying issue and logs are flowing. We're restoring the missing data and will post an update when backfill is complete.
Identified
We've identified an issue where workflow run logs generated between October 1 11:16 PM UTC (7:16 PM ET) and 11:29 PM UTC (7:29 PM ET) are unavailable in the dashboard. We are working to restore the log data. This issue did not affect workflow execution or message delivery.