Shared Infrastructure

All times in local timezone

All systems operational

We're not aware of any issues affecting our systems.

(Viewing historical data)

Uptime (Jul 2023 - Dec 2023)

99.98%

How uptime is calculated

Uptime = (Total time - Contractual downtime) / Total time

"Contractual downtime" is the period during which Fabriq is unavailable as commonly defined in our commercial contracts. It includes any periods of partial or full outages, excluding when Fabriq's liability is waived (e.g. unavailability due to natural disasters or third parties outside Fabriq's control). It does not include periods of degraded performance or scheduled maintenance.

2023-07-01: operational 2023-07-02: operational 2023-07-03: operational 2023-07-04: operational 2023-07-05: operational 2023-07-06: operational 2023-07-07: operational 2023-07-08: operational 2023-07-09: operational 2023-07-10: operational 2023-07-11: operational 2023-07-12: operational 2023-07-13: operational 2023-07-14: operational 2023-07-15: operational 2023-07-16: operational 2023-07-17: operational 2023-07-18: degraded_performance 2023-07-19: operational 2023-07-20: operational 2023-07-21: operational 2023-07-22: operational 2023-07-23: operational 2023-07-24: operational 2023-07-25: operational 2023-07-26: operational 2023-07-27: operational 2023-07-28: operational 2023-07-29: degraded_performance 2023-07-30: degraded_performance 2023-07-31: degraded_performance 2023-08-01: operational 2023-08-02: operational 2023-08-03: operational 2023-08-04: operational 2023-08-05: operational 2023-08-06: operational 2023-08-07: operational 2023-08-08: operational 2023-08-09: operational 2023-08-10: operational 2023-08-11: operational 2023-08-12: operational 2023-08-13: operational 2023-08-14: operational 2023-08-15: operational 2023-08-16: operational 2023-08-17: operational 2023-08-18: operational 2023-08-19: operational 2023-08-20: operational 2023-08-21: degraded_performance 2023-08-22: degraded_performance 2023-08-23: degraded_performance 2023-08-24: operational 2023-08-25: operational 2023-08-26: operational 2023-08-27: operational 2023-08-28: operational 2023-08-29: operational 2023-08-30: operational 2023-08-31: operational 2023-09-01: operational 2023-09-02: operational 2023-09-03: operational 2023-09-04: operational 2023-09-05: operational 2023-09-06: operational 2023-09-07: operational 2023-09-08: operational 2023-09-09: operational 2023-09-10: operational 2023-09-11: operational 2023-09-12: operational 2023-09-13: operational 2023-09-14: operational 2023-09-15: full_outage 2023-09-16: operational 2023-09-17: operational 2023-09-18: operational 2023-09-19: operational 2023-09-20: operational 2023-09-21: operational 2023-09-22: operational 2023-09-23: operational 2023-09-24: operational 2023-09-25: operational 2023-09-26: operational 2023-09-27: operational 2023-09-28: operational 2023-09-29: operational 2023-09-30: operational 2023-10-01: operational 2023-10-02: operational 2023-10-03: operational 2023-10-04: operational 2023-10-05: operational 2023-10-06: operational 2023-10-07: operational 2023-10-08: operational 2023-10-09: operational 2023-10-10: operational 2023-10-11: full_outage 2023-10-12: operational 2023-10-13: operational 2023-10-14: operational 2023-10-15: operational 2023-10-16: operational 2023-10-17: operational 2023-10-18: operational 2023-10-19: operational 2023-10-20: operational 2023-10-21: operational 2023-10-22: operational 2023-10-23: operational 2023-10-24: operational 2023-10-25: operational 2023-10-26: operational 2023-10-27: operational 2023-10-28: operational 2023-10-29: operational 2023-10-30: operational 2023-10-31: operational 2023-11-01: operational 2023-11-02: operational 2023-11-03: operational 2023-11-04: operational 2023-11-05: operational 2023-11-06: operational 2023-11-07: operational 2023-11-08: full_outage 2023-11-09: operational 2023-11-10: operational 2023-11-11: operational 2023-11-12: operational 2023-11-13: operational 2023-11-14: operational 2023-11-15: operational 2023-11-16: operational 2023-11-17: operational 2023-11-18: operational 2023-11-19: operational 2023-11-20: operational 2023-11-21: operational 2023-11-22: operational 2023-11-23: operational 2023-11-24: operational 2023-11-25: operational 2023-11-26: operational 2023-11-27: operational 2023-11-28: operational 2023-11-29: operational 2023-11-30: operational 2023-12-01: operational 2023-12-02: operational 2023-12-03: operational 2023-12-04: operational 2023-12-05: operational 2023-12-06: operational 2023-12-07: operational 2023-12-08: operational 2023-12-09: operational 2023-12-10: operational 2023-12-11: operational 2023-12-12: operational 2023-12-13: operational 2023-12-14: operational 2023-12-15: operational 2023-12-16: operational 2023-12-17: operational 2023-12-18: operational 2023-12-19: operational 2023-12-20: operational 2023-12-21: operational 2023-12-22: operational 2023-12-23: operational 2023-12-24: operational 2023-12-25: operational 2023-12-26: operational 2023-12-27: operational 2023-12-28: operational 2023-12-29: operational 2023-12-30: operational 2023-12-31: operational
Jul 2023 Dec 2023
Incident & Maintenance History (Jul 2023 - Dec 2023)

App is unavailable

Incident Resolved 12m
Major outage
Resolved

Currently, the root cause of the incident is still unknown. An investigation is underway to understand the underlying factors that led to the incident. A procedural step for manual intervention has been established should the incident reoccur.

Identified

The API is not responding to requests.

App is unavailable

Incident Resolved 16m
Major outage
Resolved

After restarting our servers, the API is back up. We are continuing our investigation to find the root cause of the incident.

Identified

The API is not responding to requests.

Servers lacked resources

Incident Resolved 4m
Major outage
Resolved

Following the assignment of an organizational role to a user, our messaging system triggered numerous simultaneous API requests, intended to notify users in the relevant environment about the change. The substantial volume of these requests posed challenges across our instances. The incident ultimately resolved itself without necessitating further intervention. We are refining certain technical processes to manage data flow more effectively, ensuring stable interaction even during peak interactions.

Identified

The API is not responding, the incident was identified and reported by our monitoring tool.

App is unavailable

Incident Resolved 16m
Major outage
Resolved

Healthy instances were collected during the garbage collection of instances of failed deployments, leaving no healthy instances remaining. The problem was resolved by provisioning new instances and re-routing the traffic to them.

Identified

The API returns status 503 which makes both the API and the application unusable.

Degraded performance on asynchronous tasks

Incident Resolved 38h 52m
Degraded performance
Resolved

Performance on asynchronous tasks is back to normal. The incident was due to messaging events causing congestion in asynchronous tasks. We took immediate actions to create separate resources that handle messaging tasks and other asynchronous tasks. This way any future messaging issues will not affect other asynchronous tasks.

Identified

The root cause of the incident causing blockages in the processing of asynchronous tasks has been found and resolved. The asynchronous tasks that were queued since the start of the incidents are now being processed. Performance on asynchronous tasks will continue to be degraded until the queues are fully processed.

Identified

Developers detected the problem causing the degraded performance and are working on resolving it.

Identified

High volume of messaging events caused degraded performance on asynchronous tasks. This degraded performance affected email notifications, datapoint imports, webhooks and the messaging system.

Ticket creation impossible

Incident Resolved 42h 35m
Degraded performance
Resolved

Incident resolved. Asynchronous tasks are operational again and availability of the API is fully restored. The incident was due to a congestion of the service that stores asynchronous tasks. On July 28, from around 22:00 CEST, our API has started to receive an inordinate amount of calls from a specific customer. This enqueued asynchronous tasks at a higher throughout than what our system can handle. On July 29 The immoderate use of the API caused short outages (31 seconds at 16:02 CEST and 21 seconds at 17:03 CEST). Our on-call engineers reacted by blocking API use for that specific customer at 17:09 CEST. This restored the availability of the API. We did not realize that there was still partial unavailability on operations that depend on asynchronous tasks, such as ticket creation and datapoint uploads. There was no feedback from users and our team as the incident occurred on a week-end. On July 31 at 7:28 CEST, we noticed the lingering effects of the incident. At 9:00 CEST, we purged the service storing asynchronous tasks, which restored both the API and the asynchronous tasks system, at the cost of a loss of data. 787 asynchronous tasks have been deleted, affecting mostly 5 customers (which are being contacted directly). We plan on taking the following actions: - Perform a complete postmortem in the next 15 days. - Revise our incident response process so that it includes more tests before declaring an incident resolved. - Introduce more stringent rate limiting for API users.

Identified

Services related to asynchronous tasks system are degraded. Some operations that depend on asynchronous tasks (such a ticket creation and datapoint imports) are unavailable.

Cloudflare experiencing issues

Incident Resolved 1h 1m
Degraded performance
Resolved

Cloudflare is still experiencing issues, but we have implemented a way to circumvent the use of Workers KV when it is unavailable.

Identified

Our provider, Cloudflare, is experiencing issues with its Workers KV product, which is critical for Fabriq. See their status page: https://www.cloudflarestatus.com/incidents/g728y3w79kw8