Fabriq status page

App is unavailable

Incident Resolved 12m Nov 8, 2023, 5:23 PM UTC

Major outage

Resolved

Currently, the root cause of the incident is still unknown. An investigation is underway to understand the underlying factors that led to the incident. A procedural step for manual intervention has been established should the incident reoccur.

Nov 8, 2023, 5:35 PM UTC

Identified

The API is not responding to requests.

Nov 8, 2023, 5:23 PM UTC

App is unavailable

Incident Resolved 16m Nov 8, 2023, 4:04 PM UTC

Major outage

Resolved

After restarting our servers, the API is back up. We are continuing our investigation to find the root cause of the incident.

Nov 8, 2023, 4:20 PM UTC

Identified

The API is not responding to requests.

Nov 8, 2023, 4:04 PM UTC

Servers lacked resources

Incident Resolved 4m Oct 11, 2023, 8:41 AM UTC

Major outage

Resolved

Following the assignment of an organizational role to a user, our messaging system triggered numerous simultaneous API requests, intended to notify users in the relevant environment about the change. The substantial volume of these requests posed challenges across our instances. The incident ultimately resolved itself without necessitating further intervention. We are refining certain technical processes to manage data flow more effectively, ensuring stable interaction even during peak interactions.

Oct 11, 2023, 8:45 AM UTC

Identified

The API is not responding, the incident was identified and reported by our monitoring tool.

Oct 11, 2023, 8:41 AM UTC

App is unavailable

Incident Resolved 16m Sep 15, 2023, 2:07 PM UTC

Major outage

Resolved

Healthy instances were collected during the garbage collection of instances of failed deployments, leaving no healthy instances remaining. The problem was resolved by provisioning new instances and re-routing the traffic to them.

Sep 15, 2023, 2:23 PM UTC

Identified

The API returns status 503 which makes both the API and the application unusable.

Sep 15, 2023, 2:07 PM UTC

Degraded performance on asynchronous tasks

Incident Resolved 38h 52m Aug 21, 2023, 10:32 PM UTC

Degraded performance

Resolved

Performance on asynchronous tasks is back to normal. The incident was due to messaging events causing congestion in asynchronous tasks. We took immediate actions to create separate resources that handle messaging tasks and other asynchronous tasks. This way any future messaging issues will not affect other asynchronous tasks.

Aug 23, 2023, 1:24 PM UTC

Identified

The root cause of the incident causing blockages in the processing of asynchronous tasks has been found and resolved. The asynchronous tasks that were queued since the start of the incidents are now being processed. Performance on asynchronous tasks will continue to be degraded until the queues are fully processed.

Aug 22, 2023, 7:35 AM UTC

Identified

Developers detected the problem causing the degraded performance and are working on resolving it.

Aug 22, 2023, 7:05 AM UTC

Identified

High volume of messaging events caused degraded performance on asynchronous tasks. This degraded performance affected email notifications, datapoint imports, webhooks and the messaging system.

Aug 21, 2023, 10:32 PM UTC

Ticket creation impossible

Incident Resolved 42h 35m Jul 29, 2023, 12:25 PM UTC

Degraded performance

Resolved

Incident resolved. Asynchronous tasks are operational again and availability of the API is fully restored. The incident was due to a congestion of the service that stores asynchronous tasks. On July 28, from around 22:00 CEST, our API has started to receive an inordinate amount of calls from a specific customer. This enqueued asynchronous tasks at a higher throughout than what our system can handle. On July 29 The immoderate use of the API caused short outages (31 seconds at 16:02 CEST and 21 seconds at 17:03 CEST). Our on-call engineers reacted by blocking API use for that specific customer at 17:09 CEST. This restored the availability of the API. We did not realize that there was still partial unavailability on operations that depend on asynchronous tasks, such as ticket creation and datapoint uploads. There was no feedback from users and our team as the incident occurred on a week-end. On July 31 at 7:28 CEST, we noticed the lingering effects of the incident. At 9:00 CEST, we purged the service storing asynchronous tasks, which restored both the API and the asynchronous tasks system, at the cost of a loss of data. 787 asynchronous tasks have been deleted, affecting mostly 5 customers (which are being contacted directly). We plan on taking the following actions: - Perform a complete postmortem in the next 15 days. - Revise our incident response process so that it includes more tests before declaring an incident resolved. - Introduce more stringent rate limiting for API users.

Jul 31, 2023, 7:00 AM UTC

Identified

Services related to asynchronous tasks system are degraded. Some operations that depend on asynchronous tasks (such a ticket creation and datapoint imports) are unavailable.

Jul 29, 2023, 12:25 PM UTC

Cloudflare experiencing issues

Incident Resolved 1h 1m Jul 18, 2023, 5:49 AM UTC

Degraded performance

Resolved

Cloudflare is still experiencing issues, but we have implemented a way to circumvent the use of Workers KV when it is unavailable.

Jul 18, 2023, 6:50 AM UTC

Identified

Our provider, Cloudflare, is experiencing issues with its Workers KV product, which is critical for Fabriq. See their status page: https://www.cloudflarestatus.com/incidents/g728y3w79kw8

Jul 18, 2023, 5:49 AM UTC

Shared Infrastructure

Uptime (Jul 2023 - Dec 2023)

App is unavailable

App is unavailable

Servers lacked resources

App is unavailable

Degraded performance on asynchronous tasks

Ticket creation impossible

Cloudflare experiencing issues