Service Protection
Li Wei
Title: Service Protection
Overview
During remote calls in a micro‑service architecture, several issues need to be addressed.
Business robustness:
In a micro‑service architecture, the system is split into multiple independent services, each responsible for a different business function. If one or more services fail or become unavailable, the overall system operation may be affected.
Cascading failures:
Services often depend on each other; a failure in one service can affect others, leading to cascading failures, also known as an avalanche.
Ensuring that services run robustly and preventing avalanche‑type cascading failures is what we call micro‑service protection. There are many protection strategies, such as:
- Rate‑limiting: Limit request frequency or concurrency to protect the system from overload.
- Service degradation: When a service is unavailable or too slow, provide a simplified version of the functionality or return a default value to keep the system usable.
- Circuit‑breaker: When a service’s error rate reaches a certain threshold, the circuit‑breaker temporarily stops calls to that service, preventing repeated failures from causing an avalanche.
- Thread isolation: Split different functions or modules into independent services and isolate them via the network so that failures do not propagate.
- Fault‑tolerance: Apply strategies such as retries, fallbacks, etc., to handle network latency, timeouts, and other anomalies.
- Asynchronous communication: Use async calls to dependent services to avoid blocking caused by synchronous calls.
- Monitoring and alerting: Continuously monitor service health and performance metrics, and trigger alerts and remedial actions when anomalies appear.
Hystrix (Overview)
Basic Introduction
Hystrix is an open‑source library for handling latency and fault tolerance in distributed systems. In distributed environments, many dependencies can fail—timeouts, exceptions, etc. Hystrix ensures that a problem in one dependency does not cause the whole service to fail, preventing cascading failures and improving system resilience.
A circuit‑breaker acts like a switch: when a service unit fails, the breaker’s fault monitor (similar to a fuse) returns an expected, handleable fallback response instead of making the caller wait indefinitely or throwing an unmanageable exception. This keeps the caller’s thread from being held up and prevents the fault from spreading through the distributed system.
- Service degradation
Fallback: When the system is unavailable, provide a fallback solution or alternative response that the caller can handle. - Service circuit‑break
Break: After reaching the maximum number of service accesses, reject further requests outright. - Service rate‑limit
Flowlimit: Under high concurrency, prevent a flood of requests; allow only N requests per second and queue them orderly.
Official documentation:
Service Degradation
Producer module:
Add the Maven dependency:
- Main starter class: enable Feign
- Controller:
- Service:
- Use JMeter to load‑test the two endpoints; the endpoint
paymentInfo_Okalso becomes sluggish.
Consumer module:
Service interface:
- Controller:
- Test: using Feign as the client, which by default times out after 1 s without a response; perform concurrent load testing.
Solutions:
- Timeout causing server slowdown (spinning): stop waiting after a timeout.
- Error (crash or runtime exception): provide a fallback.
Degradation can be applied on both producer and consumer sides by using the @HystrixCommand annotation to specify the fallback method.
Producer side: add the new annotation @EnableCircuitBreaker to the main starter class, and modify the Service methods as follows:
The degradation methods are mixed with business logic, resulting in high coupling and a separate fallback method for each business method.
- Add the
@DefaultProperties(defaultFallback = "method_name")annotation on the Controller class. - Annotate methods that need degradation with
@HystrixCommand; if@HystrixCommanddoes not specifyfallbackMethod, the fallback defined in@DefaultPropertiesis used by default.
When the client calls the server and the server crashes or is shut down, defining a fallback implementation class for the Feign client interface decouples the two sides.
application.yml: enable Hystrix in the configuration file.- Service: centralize exception handling for interface methods; use
PaymentFallbackServicefor unified degradation handling. PaymentFallbackService:
Service Circuit‑Break
Circuit‑break types – The circuit‑breaker is a micro‑service link protection mechanism that mitigates avalanche effects. When a downstream service becomes unavailable or responds too slowly, the call is degraded and the circuit for that node is opened, quickly returning an error response.
Hystrix monitors inter‑service calls; when failures exceed a default threshold (20 failures within 5 seconds), the circuit‑breaker trips. Once the node’s responses return to normal (detected by a trial request), the circuit automatically closes.
- Circuit open: Requests no longer invoke the primary service; they go straight to the fallback. This automatically switches from error handling back to normal logic, reducing response latency. Internally, a timer—usually the MTTR (Mean Time To Repair)—determines how long the circuit stays open before entering a half‑open state.
- Circuit closed: No breaking; the service is called normally.
- Circuit half‑open: Some requests are allowed through; if they succeed and meet the criteria, the circuit closes; otherwise it stays open.
Circuit‑break operation involves four key parameters: snapshot window, request volume threshold, sleep window, error‑percentage threshold.
circuitBreaker.enabled: Whether to enable the circuit‑breaker.metrics.rollingStats.timeInMilliseconds: Snapshot time window—Hystrix gathers request and error data over this period to decide whether to open the circuit. Default: last 10 seconds.circuitBreaker.requestVolumeThreshold: Request volume threshold—the minimum number of requests in the snapshot window (default 20) required to consider tripping the circuit. If fewer than this number occur, even total failure won’t open the circuit.circuitBreaker.sleepWindowInMilliseconds: Sleep window—the time the circuit stays open before attempting to transition to half‑open. Default: 5 seconds.circuitBreaker.errorThresholdPercentage: Error‑percentage threshold—failure rate that triggers the circuit to open.- Open: When thresholds are met (default >20 requests in 10 seconds) and failure rate exceeds the error‑percentage threshold (default >50 % failures in 10 seconds).
- Close: After the sleep window (default 5 seconds), the circuit goes half‑open; a single successful request closes it, otherwise it stays open.
Workflow
Official docs: https://github.com/Netflix/Hystrix/wiki/How-it-Works
Specific workflow:
Create a
HystrixCommand(used when the dependent service returns a single result) or aHystrixObserableCommand(used when the dependent service returns multiple results) object.Execute the command.
HystrixCommandimplements the first two execution styles, whileHystrixObservableCommandimplements the latter two.execute(): Synchronous execution—returns a single result object from the dependency, or throws an exception on error.queue(): Asynchronous execution—returns aFuturecontaining the single result once the service finishes.observe(): Returns anObservablerepresenting multiple results; it is a Hot Observable (emits events regardless of subscribers, so a subscriber may see only part of the stream).toObservable(): Also returns anObservable, but a Cold Observable (does not emit until there is a subscriber, guaranteeing the subscriber sees the whole sequence from the start).
If request caching is enabled and the command hits the cache, the cached result is returned immediately as a
Observableobject.Check whether the circuit is open; if open, Hystrix skips command execution and jumps to fallback (step 8). If closed, check resource availability (step 5).
If the thread pool / request queue / semaphore is saturated, Hystrix also skips execution and goes to fallback (step 8).
Hystrix decides how to invoke the dependent service based on the method we implement:
HystrixCommand.run(): Return a single result or throw an exception.HystrixObservableCommand.construct(): Return anObservablethat emits multiple results or signals an error viaonError.
Hystrix reports “success”, “failure”, “rejection”, “timeout”, etc., to the circuit‑breaker, which maintains counters and decides whether to open the circuit for that dependency.
When a command fails, Hystrix attempts a fallback (often called service degradation). Situations that trigger fallback:
- Step 4: Command is in “circuit‑break/short‑circuit” state (circuit open).
- Step 5: Thread pool, request queue, or semaphore is exhausted.
- Step 6:
HystrixObservableCommand.construct()orHystrixCommand.run()throws an exception.
After successful execution, Hystrix returns the result directly or as an
Observable.
Note: If no fallback is provided or the fallback itself throws, Hystrix still returns an Observable, but it emits no data and immediately terminates the request via the onError() method, propagating the underlying exception to the caller.
Service Monitoring
Hystrix offers near‑real‑time monitoring via the Hystrix Dashboard. It continuously records execution details of all Hystrix‑wrapped requests and displays them as charts and tables (requests per second, successes, failures, etc.). Netflix provides the hystrix-metrics-event-stream project for metric streaming; Spring Cloud integrates this into a visual dashboard.
Add Maven dependency:
application.yaml: only the port is needed.- Main starter class:
- All producer services (ports 8001/8002/8003) must include the monitoring configuration.
- Test by visiting
http://localhost:9001/hystrix.
Newer Hystrix versions require the monitoring endpoint to be declared in the producer’s main starter class; otherwise an error occurs.
Metric explanations:
Sentinel (Alibaba)
Basic Introduction
Sentinel is an open‑source traffic‑governance component from Alibaba, designed for distributed, multi‑language, heterogeneous service architectures.
Official site: https://sentinelguard.io/zh-cn/
Download: https://github.com/alibaba/Sentinel/releases
Sentinel consists of two parts:
- Core library (JAR): No framework dependencies; runs on Java 8+ and integrates well with Dubbo, Spring Cloud, etc. Adding the dependency enables rate‑limiting, isolation, circuit‑breaking, and more.
- Dashboard: Manages rule publishing, monitoring, and machine information.
Start the dashboard with the following command (-Dserver.port=8080 sets the dashboard port to 8080).
Visit http://localhost:8080/; username and password are both sentinel.
Additional startup parameters are documented in the official dashboard guide.
Basic Usage
Create a demo project:
Add Maven dependency:
application.yaml:sentinel.transport.portport configuration starts an HTTP server on the host machine, which communicates with the Sentinel dashboard.
For example, when the dashboard adds a rate‑limit rule, the rule is pushed to the server, which registers it with Sentinel.
- Main starter class:
- Traffic‑control controller:
- Sentinel uses lazy loading; you must first access
http://localhost:8401/testAbefore the dashboard can display the rule.
Click the Cluster Link menu to see the following page:
The cluster link (or “cluster node”) represents a single request’s traversal through all Sentinel‑monitored resources. By default, Sentinel monitors every Spring MVC endpoint (Endpoint). Thus, the path /carts appears as a cluster node that can be protected with rate‑limiting, circuit‑breaking, isolation, etc.
By default Sentinel uses the URL path as the resource name, which cannot distinguish between the same path with different HTTP methods (GET, POST, DELETE, etc.). To differentiate, enable the request‑method prefix so that method + path becomes the resource name.
Add the following to service/application.yml:
# enable method prefix
spring:
sentinel:
transport:
dashboard: localhost:8080
filter:
enabled: true
http:
method-prefix: true
Restart the service; accessing its endpoints will now show distinct cluster nodes in the Sentinel dashboard.
Flow Control Rules
Flow‑control rule FlowRule: A single resource can have multiple rate‑limit rules.
Resource name
**resource**: The target of the rule; in the demo it istestA.Limit origin
**limitApp**: Limits based on the caller; defaultdefaultmeans no distinction.Threshold type
**grade**: QPS or thread‑count mode.Single‑machine threshold
**count**: The actual limit value.Control strategy
**strategy**: Relationship‑based limiting.- Direct: Limit the resource itself when it reaches the threshold.
- Associated: Limit the resource when a related resource hits its threshold.
- Chain: Limit only traffic that comes from a specified upstream resource.
Effect
**controlBehavior**:- Fast fail: Immediately reject and throw an exception.
- Warm‑up: Gradually increase the allowed QPS from
count/codeFactorybased oncoldFactor(default3) to give the system a warm‑up period. - Queueing: Queue requests at a steady rate; requires QPS mode, otherwise ineffective.
Define a rule programmatically via the SystemRuleManager.loadRules() method:
FlowRule rule = new FlowRule();
rule.setResource("testA");
rule.setCount(20);
rule.setControlBehavior(RuleConstant.CONTROL_BEHAVIOR_DEFAULT);
FlowRuleManager.loadRules(Collections.singletonList(rule));
Detailed documentation: https://sentinelguard.io/zh-cn/docs/flow-control.html
Feign Integration with Sentinel
Service protection is needed not only for front‑end calls but also for inter‑service communication.
Add Maven dependency:
application.yml: enable Sentinel support for Feign.Note: By default, a Spring Boot Tomcat has a maximum thread count of
**200**and maximum connections of**8492**, making it hard to saturate in a single‑machine test. Adjustapplication.ymlto increase Tomcat connections.Main starter class: add the
@EnableFeignClientannotation to enable OpenFeign.Business class:
Degradation & Circuit‑Break
Degradation logic – Sentinel’s circuit‑break degradation limits calls to an unstable resource in the call chain, causing rapid failures and preventing downstream cascade errors. Once a resource is degraded, all calls to it within the degradation window are automatically short‑circuited (default behavior is to throw **DegradeException**).
Requests that trigger rate‑limit or circuit‑break don’t have to return an error; they can return default data or friendly messages for a better user experience.
Two ways to write fallback logic for a Feign client:
FallbackClass– cannot handle remote‑call exceptions.FallbackFactory– can handle remote‑call exceptions; this is the preferred approach.
Example using method 2:
- Define a fallback class that implements the
FallbackFactoryinterface:
@Component
public class PaymentServiceFallback implements PaymentService {
@Override
public Result pay(Long id) {
return Result.fail("Service unavailable, please try again later.");
}
}
- Register the fallback with the Feign client:
@FeignClient(name = "payment-service", fallback = PaymentServiceFallback.class)
public interface PaymentService {
@GetMapping("/pay/{id}")
Result pay(@PathVariable("id") Long id);
}
(The rest of the original document continues with further code snippets, configuration details, and explanations, all preserved with their original ⟦TOKxx⟧ placeholders.)
Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.