Without monitoring, your app is running blindly in production.

27/02/2026

Deployment is just the beginning. If your app isn't monitored, your users will discover every problem before you do. What you need isn't pretty dashboards or alerts for everything. You need to know what to watch, what to ignore, and when to take action.

Your team has been working for weeks on a new feature. You've tested it, it's gone through the CI/CD pipeline, and it's been in production since this morning. Everything seems to be going well. Until a user writes to support saying the app is slow. Without monitoring, your only alert system is the users themselves.

Monitoring your app in production isn't a luxury you can postpone until the project grows. It's what separates a team that reacts from a team that's aware of problems. Without monitoring, your problem-finding strategy is "wait for someone to complain." And by the time someone complains, the problem has been there for hours or days.

You don't need a control room with giant screens. You need the right signals, the right thresholds, and a system that alerts you when something goes wrong.

What does monitoring mean (without the enterprise noise)

Monitoring a production application means observing its behavior continuously and automatically. The goal: to detect problems before your users notice them. It's not Big Data, it's not NASA-level observability, and it doesn't require a dedicated SRE team.

In practice, monitoring boils down to three questions your system should be able to answer at any time: Is the app running? Is it working well? Has anything changed since yesterday?

If you can answer those three questions without connecting to a server or manually opening logs, your monitoring is functional. If you can't, you have a blind spot that will eventually cost you users, money, or both.

The four metrics that really matter

There are dozens of metrics you can monitor: CPU usage, memory, disk, network, database connections, queries per second, queue size, DNS latency. The list is endless, and that's the problem: if you monitor everything, you monitor nothing.

For a team managing a production project with limited resources, there are four metrics that cover 90% of what you need to know.

Availability (uptime). Is your app responding? Not just whether it responds quickly, but simply whether it responds at all. A check every minute makes an HTTP request to your main endpoint and verifies that it returns a 200 status code. If it stops responding, you'll find out in 60 seconds, not 60 minutes. For a Django backend deployed on AWS or a VPS, this is as simple as a healthcheck endpoint that verifies the app and database are up and running.

Response time (latency). Does your app respond in a reasonable time? The average response time is useful but misleading. If 95% of requests take 200ms and 5% take 8 seconds, the average seems acceptable. But that 5% of users has a terrible experience. Monitor the 95th percentile (p95): the response time for 95% of requests. If p95 spikes, something is wrong even if the average remains the same.

Error rate. What percentage of requests return an error? A 0.1% error rate (500 requests) on a normal day might be acceptable. A 2% error rate is a clear sign that something is broken. The important thing isn't the absolute number but the change: if yesterday you had 0.1% and today you have 1.5%, something has happened in between. Probably the last deployment.

Saturation. Are your resources nearing their limits? CPU at 90%, memory at 95%, disk full, database connections exhausted. These metrics don't tell you something has failed; they tell you something is about to fail. They're the only ones that give you room to act before the incident, not after.

These four metrics are not our invention. Google formalized them as the Four Golden Signals in its SRE book. They have become the de facto standard for service monitoring.

Logs, metrics, and traces: what each one is

If you're just starting out with monitoring, these three concepts can easily get mixed up. But they are different tools for different problems.

Logs are records of individual events. Examples include: "User X attempted to log in and failed," "The request to /api/orders took 3.2 seconds," and "Error connecting to the payment gateway." They are the most detailed and useful level of data for diagnosing a specific problem. Django generates logs by default, but configuring them properly (structured format, appropriate levels, file rotation) makes all the difference between useful logs and noise.

Metrics are numerical values aggregated over time. "Requests per second," "memory usage," "errors per minute." They don't tell you exactly what happened, but they tell you when something changed. They're the alarm system: metrics wake you up at 3 a.m., logs tell you why.

Traces follow a request from beginning to end through your system. The user clicks "buy." The request passes through the web server, reaches Django, queries the database, and calls the payment gateway. A trace shows you where the process got stuck. For a Django monolith, traces are less critical than for a microservices architecture. But if your app interacts with external services (payments, email, third-party APIs), knowing where time is being wasted is valuable.

To begin with, well-configured logs and four basic metrics are sufficient. Traces can come later, when the complexity of your system warrants it.

Tools without overkill

There are monitoring tools for every budget and level of complexity. The most common mistake is choosing an enterprise tool when your project needs something pragmatic.

For uptime and basic alerts: UptimeRobot or Better Stack (formerly Uptime) offer free plans that monitor HTTP endpoints and notify you via email, Slack, or SMS if your app stops responding. They can be set up in five minutes and cover the most basic need: knowing your app is up and running.

For server and application metrics: If your infrastructure is on AWS, CloudWatch is included and provides CPU, memory, and disk metrics without any installation. If you're using a VPS (Hetzner or similar), Prometheus with Grafana is the go-to open-source combination. It requires more initial setup, but it's free and scales well. For something in between, Datadog and New Relic offer limited free plans that are sufficient for small projects.

For logging: In a Django project, Python's built-in logging setup already provides a solid foundation. Send logs to a centralized service (Papertrail, Logtail, or CloudWatch Logs) so you can search and filter them without connecting to the server. Reviewing logs via SSH works for one server. When you have two or three, you need a central location.

For application errors: Sentry is the standard for exception catching. It integrates with Django and Flutter in minutes and shows you every error with its stack trace, the affected user, and the request context. The free plan is more than enough for a growing project. If an error occurs in production, you'll see it in Sentry before the user even contacts support.

The rule: start with the bare minimum that gives you visibility. A health check, Sentry for errors, and basic server metrics. You can iterate later.

Alerts: the art of not getting overwhelmed

Monitoring without alerts is like a dashboard no one looks at. But poorly configured alerts are worse than no alerts at all: they lead to fatigue, the team starts ignoring them, and when a real alert arrives, no one reacts.

Alert only when action is required. If an alert is triggered and the team's response is "we'll look into it tomorrow," that alert shouldn't exist. Every alert should have a clear action associated with it: restarting a service, investigating a recent deployment, escalating resources.

Distinguish between urgent and informational. " The app isn't responding" is urgent: Slack, SMS, whatever it takes to get someone to see it in minutes. "CPU usage is up 15% compared to yesterday" is informational: an email or a message in a monitoring channel that someone will check during work hours. Mixing both levels in the same channel is the fastest way for the team to ignore everything.

Use thresholds with context. A response time of 500ms might be normal for a request generating a PDF report but disastrous for an endpoint returning a 10-field JSON object. Thresholds should reflect what's normal for each service, not an arbitrary number applied to everything.

Less is more. A small team should have between five and ten active alerts, not fifty. Each new alert you add dilutes the attention given to existing ones. According to the DORA report , high-performing teams have fewer but more relevant alerts. They respond faster precisely because they aren't overwhelmed by noise.

What to monitor depending on the type of project

Not all projects require the same level of monitoring. What you monitor depends on what could go wrong and the potential impact of that failure.

E-commerce or app with payments. The purchase flow is paramount. Monitor transaction success rates, payment gateway latency, and any checkout errors. A 1% payment failure rate can mean thousands of euros lost per month. Here, an end-to-end test that simulates a purchase every five minutes is invaluable.

App with authentication and user data. Monitors the login flow (success rate, latency), permission errors, and any anomalies in access to sensitive data. A spike in 403 errors could be a bug or something worse.

An API or backend that serves a frontend or mobile app. It monitors latency per endpoint, error rate per client version, and backward compatibility. If you deploy a new version of the backend and the production Flutter app starts crashing, you need to know about it before the user does.

Content Management System (CMS) or platform. Monitor public page load times, admin panel performance, and the status of scheduled tasks (imports, email sending, content generation). A CMS with Wagtail might work perfectly for editors but have slow public pages due to a poorly optimized query that only appears with a high volume of content.

The mistake of monitoring afterwards

Most teams only implement monitoring after the first serious incident. One Friday night the app crashes, nobody notices until Monday, users complain, and someone says, "We need to implement monitoring." It's a pattern as common as it is avoidable.

Monitoring should begin immediately after the first automated deployment. If you already have a CI/CD pipeline deploying to production, the next logical step is to determine if what you've deployed is working. Not next week. Not when the project is larger. Now.

Setting up basic monitoring takes less than half a day: a health check with UptimeRobot, Sentry for errors, and the server metrics your provider already offers (CloudWatch on AWS, Hetzner's native metrics, or those from your VPS dashboard). With that, you'll be the first to know the next time something goes wrong.

And that's the difference between a professional team and one that relies on luck: not that nothing goes wrong, but that when it does, you know about it before anyone else.