Post

gRPC in Production: From Protocol Buffers to Load Balancing and Observability

Introduction

gRPC has become the de facto standard for inter-service communication in modern microservices architectures. With 20-30% lower bandwidth costs and 15-25% lower compute costs compared to REST APIs, gRPC offers significant performance and efficiency improvements. However, deploying gRPC in production introduces unique challenges around load balancing, observability, and infrastructure integration.

This post explores production-grade patterns for deploying gRPC services, covering Protocol Buffers schema design, load balancing strategies, service mesh integration, observability, and common pitfalls. Whether you’re migrating from REST or building new gRPC services, these patterns will help you avoid common mistakes and build robust, scalable systems.

Protocol Buffers: Schema Design Best Practices

Basic Message Definition

Protocol Buffers provide efficient binary serialization with field-tagged schemas:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
syntax = "proto3";

package ecommerce.v1;

option go_package = "github.com/example/ecommerce/api/v1";
option java_package = "com.example.ecommerce.api.v1";
option java_multiple_files = true;

// User represents a user in the system
message User {
  string id = 1;
  string email = 2;
  string name = 3;
  google.protobuf.Timestamp created_at = 4;
  UserRole role = 5;
}

enum UserRole {
  USER_ROLE_UNSPECIFIED = 0;
  USER_ROLE_CUSTOMER = 1;
  USER_ROLE_ADMIN = 2;
}

// Order represents a customer order
message Order {
  string id = 1;
  string user_id = 2;
  repeated OrderItem items = 3;
  double total_amount = 4;
  OrderStatus status = 5;
  google.protobuf.Timestamp created_at = 6;
}

message OrderItem {
  string product_id = 1;
  int32 quantity = 2;
  double price = 3;
}

enum OrderStatus {
  ORDER_STATUS_UNSPECIFIED = 0;
  ORDER_STATUS_PENDING = 1;
  ORDER_STATUS_CONFIRMED = 2;
  ORDER_STATUS_SHIPPED = 3;
  ORDER_STATUS_DELIVERED = 4;
  ORDER_STATUS_CANCELLED = 5;
}

Service Definition with Multiple RPC Patterns

gRPC supports four RPC patterns: unary, server streaming, client streaming, and bidirectional streaming:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import "google/protobuf/empty.proto";

service OrderService {
  // Unary RPC: single request, single response
  rpc CreateOrder(CreateOrderRequest) returns (Order);

  // Unary RPC: get order by ID
  rpc GetOrder(GetOrderRequest) returns (Order);

  // Server streaming: stream order updates in real-time
  rpc WatchOrder(WatchOrderRequest) returns (stream OrderUpdate);

  // Server streaming: list orders with pagination
  rpc ListOrders(ListOrdersRequest) returns (stream Order);

  // Client streaming: batch order creation
  rpc BatchCreateOrders(stream CreateOrderRequest) returns (BatchCreateOrdersResponse);

  // Bidirectional streaming: real-time order processing
  rpc ProcessOrders(stream ProcessOrderRequest) returns (stream ProcessOrderResponse);
}

message CreateOrderRequest {
  string user_id = 1;
  repeated OrderItem items = 2;
}

message GetOrderRequest {
  string order_id = 1;
}

message WatchOrderRequest {
  string order_id = 1;
}

message OrderUpdate {
  Order order = 1;
  google.protobuf.Timestamp updated_at = 2;
}

message ListOrdersRequest {
  string user_id = 1;
  int32 page_size = 2;
  string page_token = 3;
}

message BatchCreateOrdersResponse {
  repeated Order orders = 1;
  int32 success_count = 2;
  int32 failure_count = 3;
}

message ProcessOrderRequest {
  string order_id = 1;
  OrderStatus new_status = 2;
}

message ProcessOrderResponse {
  string order_id = 1;
  bool success = 2;
  string error_message = 3;
}

Schema Evolution and Versioning

Protocol Buffers support backward and forward compatibility through careful field management:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// v1/user.proto - Original version
message User {
  string id = 1;
  string email = 2;
  string name = 3;
}

// v2/user.proto - Evolved version
message User {
  string id = 1;
  string email = 2;
  string name = 3;

  // New optional field - backward compatible
  string phone_number = 4;

  // New field with default - backward compatible
  bool email_verified = 5;

  // NEVER change field numbers or types
  // NEVER reuse field numbers from deleted fields

  // Deprecated field - mark for removal in future version
  string legacy_field = 6 [deprecated = true];

  // Reserved field numbers prevent accidental reuse
  reserved 7, 8;
  reserved "old_field_name";
}

Pagination Pattern

Implement cursor-based pagination for efficient data retrieval:

1
2
3
4
5
6
7
8
9
10
11
message ListUsersRequest {
  int32 page_size = 1;  // Max 100
  string page_token = 2;  // Opaque token from previous response
  string filter = 3;  // Optional filter expression
}

message ListUsersResponse {
  repeated User users = 1;
  string next_page_token = 2;  // Token for next page, empty if last page
  int32 total_size = 3;  // Total number of items (optional, can be expensive)
}

Server Implementation Patterns

Go Server with Graceful Shutdown

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
package main

import (
    "context"
    "fmt"
    "log"
    "net"
    "os"
    "os/signal"
    "syscall"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/status"
    "google.golang.org/grpc/health"
    "google.golang.org/grpc/health/grpc_health_v1"

    pb "github.com/example/ecommerce/api/v1"
)

type orderServer struct {
    pb.UnimplementedOrderServiceServer
    db OrderDatabase
}

func (s *orderServer) CreateOrder(
    ctx context.Context,
    req *pb.CreateOrderRequest,
) (*pb.Order, error) {
    // Validate request
    if req.UserId == "" {
        return nil, status.Error(codes.InvalidArgument, "user_id is required")
    }

    if len(req.Items) == 0 {
        return nil, status.Error(codes.InvalidArgument, "at least one item required")
    }

    // Create order
    order, err := s.db.CreateOrder(ctx, req)
    if err != nil {
        log.Printf("Failed to create order: %v", err)
        return nil, status.Error(codes.Internal, "failed to create order")
    }

    return order, nil
}

func (s *orderServer) WatchOrder(
    req *pb.WatchOrderRequest,
    stream pb.OrderService_WatchOrderServer,
) error {
    ctx := stream.Context()

    // Subscribe to order updates
    updates := s.db.SubscribeToOrderUpdates(ctx, req.OrderId)

    for {
        select {
        case <-ctx.Done():
            // Client disconnected or deadline exceeded
            return ctx.Err()

        case update, ok := <-updates:
            if !ok {
                // Channel closed
                return nil
            }

            // Send update to client
            if err := stream.Send(update); err != nil {
                log.Printf("Failed to send update: %v", err)
                return err
            }
        }
    }
}

func main() {
    // Create listener
    lis, err := net.Listen("tcp", ":50051")
    if err != nil {
        log.Fatalf("Failed to listen: %v", err)
    }

    // Create gRPC server with interceptors
    server := grpc.NewServer(
        grpc.ChainUnaryInterceptor(
            loggingInterceptor,
            authInterceptor,
            validationInterceptor,
        ),
        grpc.ChainStreamInterceptor(
            loggingStreamInterceptor,
            authStreamInterceptor,
        ),
        grpc.MaxRecvMsgSize(10 * 1024 * 1024), // 10MB max message size
        grpc.MaxSendMsgSize(10 * 1024 * 1024),
        grpc.ConnectionTimeout(120 * time.Second),
    )

    // Register services
    orderSvc := &orderServer{db: NewOrderDatabase()}
    pb.RegisterOrderServiceServer(server, orderSvc)

    // Register health check service
    healthServer := health.NewServer()
    grpc_health_v1.RegisterHealthServer(server, healthServer)
    healthServer.SetServingStatus("", grpc_health_v1.HealthCheckResponse_SERVING)

    // Start server in goroutine
    go func() {
        log.Printf("Starting gRPC server on :50051")
        if err := server.Serve(lis); err != nil {
            log.Fatalf("Failed to serve: %v", err)
        }
    }()

    // Graceful shutdown
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
    <-quit

    log.Println("Shutting down gRPC server...")

    // Mark as not serving
    healthServer.SetServingStatus("", grpc_health_v1.HealthCheckResponse_NOT_SERVING)

    // Graceful stop with timeout
    stopped := make(chan struct{})
    go func() {
        server.GracefulStop()
        close(stopped)
    }()

    select {
    case <-stopped:
        log.Println("Server stopped gracefully")
    case <-time.After(30 * time.Second):
        log.Println("Timeout exceeded, forcing shutdown")
        server.Stop()
    }
}

Interceptors for Cross-Cutting Concerns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
package main

import (
    "context"
    "log"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/metadata"
    "google.golang.org/grpc/status"
)

// Logging interceptor
func loggingInterceptor(
    ctx context.Context,
    req interface{},
    info *grpc.UnaryServerInfo,
    handler grpc.UnaryHandler,
) (interface{}, error) {
    start := time.Now()

    // Call the handler
    resp, err := handler(ctx, req)

    // Log request details
    duration := time.Since(start)
    code := codes.OK
    if err != nil {
        code = status.Code(err)
    }

    log.Printf(
        "method=%s duration=%s code=%s",
        info.FullMethod,
        duration,
        code,
    )

    return resp, err
}

// Authentication interceptor
func authInterceptor(
    ctx context.Context,
    req interface{},
    info *grpc.UnaryServerInfo,
    handler grpc.UnaryHandler,
) (interface{}, error) {
    // Skip auth for health checks
    if info.FullMethod == "/grpc.health.v1.Health/Check" {
        return handler(ctx, req)
    }

    // Extract metadata
    md, ok := metadata.FromIncomingContext(ctx)
    if !ok {
        return nil, status.Error(codes.Unauthenticated, "missing metadata")
    }

    // Validate auth token
    tokens := md.Get("authorization")
    if len(tokens) == 0 {
        return nil, status.Error(codes.Unauthenticated, "missing authorization token")
    }

    userID, err := validateToken(tokens[0])
    if err != nil {
        return nil, status.Error(codes.Unauthenticated, "invalid token")
    }

    // Add user ID to context
    ctx = context.WithValue(ctx, "user_id", userID)

    return handler(ctx, req)
}

// Validation interceptor
func validationInterceptor(
    ctx context.Context,
    req interface{},
    info *grpc.UnaryServerInfo,
    handler grpc.UnaryHandler,
) (interface{}, error) {
    // Check if request implements Validator interface
    if v, ok := req.(interface{ Validate() error }); ok {
        if err := v.Validate(); err != nil {
            return nil, status.Error(codes.InvalidArgument, err.Error())
        }
    }

    return handler(ctx, req)
}

Java Spring Boot Server

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
package com.example.ecommerce.grpc;

import io.grpc.Status;
import io.grpc.stub.StreamObserver;
import net.devh.boot.grpc.server.service.GrpcService;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.example.ecommerce.api.v1.*;

@GrpcService
public class OrderServiceImpl extends OrderServiceGrpc.OrderServiceImplBase {

    private static final Logger logger = LoggerFactory.getLogger(OrderServiceImpl.class);

    private final OrderRepository orderRepository;
    private final OrderEventPublisher eventPublisher;

    public OrderServiceImpl(
        OrderRepository orderRepository,
        OrderEventPublisher eventPublisher
    ) {
        this.orderRepository = orderRepository;
        this.eventPublisher = eventPublisher;
    }

    @Override
    public void createOrder(
        CreateOrderRequest request,
        StreamObserver<Order> responseObserver
    ) {
        try {
            // Validate request
            if (request.getUserId().isEmpty()) {
                responseObserver.onError(
                    Status.INVALID_ARGUMENT
                        .withDescription("user_id is required")
                        .asRuntimeException()
                );
                return;
            }

            // Create order
            Order order = orderRepository.create(request);

            // Publish event
            eventPublisher.publishOrderCreated(order);

            // Send response
            responseObserver.onNext(order);
            responseObserver.onCompleted();

        } catch (IllegalArgumentException e) {
            logger.error("Invalid request: {}", e.getMessage());
            responseObserver.onError(
                Status.INVALID_ARGUMENT
                    .withDescription(e.getMessage())
                    .asRuntimeException()
            );

        } catch (Exception e) {
            logger.error("Failed to create order", e);
            responseObserver.onError(
                Status.INTERNAL
                    .withDescription("Failed to create order")
                    .asRuntimeException()
            );
        }
    }

    @Override
    public void watchOrder(
        WatchOrderRequest request,
        StreamObserver<OrderUpdate> responseObserver
    ) {
        String orderId = request.getOrderId();

        try {
            // Subscribe to order updates
            orderRepository.subscribeToUpdates(orderId, update -> {
                try {
                    responseObserver.onNext(update);
                } catch (Exception e) {
                    logger.error("Failed to send update", e);
                    responseObserver.onError(e);
                }
            });

        } catch (Exception e) {
            logger.error("Failed to watch order", e);
            responseObserver.onError(
                Status.INTERNAL
                    .withDescription("Failed to watch order")
                    .asRuntimeException()
            );
        }
    }
}

Client Implementation and Connection Management

Go Client with Connection Pooling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
package client

import (
    "context"
    "fmt"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
    "google.golang.org/grpc/keepalive"

    pb "github.com/example/ecommerce/api/v1"
)

type OrderClient struct {
    conn   *grpc.ClientConn
    client pb.OrderServiceClient
}

func NewOrderClient(addr string) (*OrderClient, error) {
    // Connection parameters
    kacp := keepalive.ClientParameters{
        Time:                10 * time.Second, // Send pings every 10 seconds
        Timeout:             5 * time.Second,  // Wait 5 seconds for ping ack
        PermitWithoutStream: true,             // Send pings even without active streams
    }

    // Dial with options
    conn, err := grpc.Dial(
        addr,
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithKeepaliveParams(kacp),
        grpc.WithDefaultServiceConfig(`{
            "loadBalancingPolicy": "round_robin",
            "healthCheckConfig": {
                "serviceName": ""
            }
        }`),
        grpc.WithBlock(),
        grpc.WithTimeout(10*time.Second),
    )
    if err != nil {
        return nil, fmt.Errorf("failed to connect: %w", err)
    }

    return &OrderClient{
        conn:   conn,
        client: pb.NewOrderServiceClient(conn),
    }, nil
}

func (c *OrderClient) CreateOrder(
    ctx context.Context,
    userID string,
    items []*pb.OrderItem,
) (*pb.Order, error) {
    // Set timeout
    ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
    defer cancel()

    req := &pb.CreateOrderRequest{
        UserId: userID,
        Items:  items,
    }

    order, err := c.client.CreateOrder(ctx, req)
    if err != nil {
        return nil, fmt.Errorf("failed to create order: %w", err)
    }

    return order, nil
}

func (c *OrderClient) WatchOrder(
    ctx context.Context,
    orderID string,
) (<-chan *pb.OrderUpdate, <-chan error) {
    updates := make(chan *pb.OrderUpdate)
    errors := make(chan error, 1)

    go func() {
        defer close(updates)
        defer close(errors)

        stream, err := c.client.WatchOrder(ctx, &pb.WatchOrderRequest{
            OrderId: orderID,
        })
        if err != nil {
            errors <- err
            return
        }

        for {
            update, err := stream.Recv()
            if err != nil {
                errors <- err
                return
            }

            select {
            case updates <- update:
            case <-ctx.Done():
                errors <- ctx.Err()
                return
            }
        }
    }()

    return updates, errors
}

func (c *OrderClient) Close() error {
    return c.conn.Close()
}

Python Async Client

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import asyncio
import grpc
from typing import AsyncIterator

from ecommerce.api.v1 import order_service_pb2, order_service_pb2_grpc

class OrderClient:
    def __init__(self, address: str):
        self.address = address
        self.channel = None
        self.stub = None

    async def connect(self):
        """Establish connection to gRPC server"""
        self.channel = grpc.aio.insecure_channel(
            self.address,
            options=[
                ('grpc.keepalive_time_ms', 10000),
                ('grpc.keepalive_timeout_ms', 5000),
                ('grpc.keepalive_permit_without_calls', True),
                ('grpc.http2.max_pings_without_data', 0),
                ('grpc.enable_retries', True),
                ('grpc.service_config', '''{
                    "methodConfig": [{
                        "name": [{}],
                        "retryPolicy": {
                            "maxAttempts": 5,
                            "initialBackoff": "0.1s",
                            "maxBackoff": "10s",
                            "backoffMultiplier": 2,
                            "retryableStatusCodes": ["UNAVAILABLE"]
                        }
                    }]
                }''')
            ]
        )

        self.stub = order_service_pb2_grpc.OrderServiceStub(self.channel)

        # Wait for channel to be ready
        await self.channel.channel_ready()

    async def create_order(
        self,
        user_id: str,
        items: list
    ) -> order_service_pb2.Order:
        """Create an order"""
        request = order_service_pb2.CreateOrderRequest(
            user_id=user_id,
            items=items
        )

        try:
            order = await self.stub.CreateOrder(
                request,
                timeout=5.0
            )
            return order

        except grpc.RpcError as e:
            print(f"RPC failed: {e.code()}: {e.details()}")
            raise

    async def watch_order(
        self,
        order_id: str
    ) -> AsyncIterator[order_service_pb2.OrderUpdate]:
        """Stream order updates"""
        request = order_service_pb2.WatchOrderRequest(order_id=order_id)

        try:
            async for update in self.stub.WatchOrder(request):
                yield update

        except grpc.RpcError as e:
            if e.code() != grpc.StatusCode.CANCELLED:
                print(f"Stream failed: {e.code()}: {e.details()}")
            raise

    async def close(self):
        """Close connection"""
        if self.channel:
            await self.channel.close()

# Usage
async def main():
    client = OrderClient('localhost:50051')
    await client.connect()

    try:
        # Create order
        order = await client.create_order(
            user_id='user123',
            items=[...]
        )
        print(f"Created order: {order.id}")

        # Watch for updates
        async for update in client.watch_order(order.id):
            print(f"Order update: {update.order.status}")

    finally:
        await client.close()

if __name__ == '__main__':
    asyncio.run(main())

Load Balancing: The HTTP/2 Challenge

Understanding the Problem

gRPC uses HTTP/2 with persistent connections. A single HTTP/2 connection multiplexes many requests, creating a challenge for traditional L4 load balancers that distribute connections rather than requests. This results in uneven traffic distribution where one backend might handle 90% of requests while others sit idle.

Solution 1: Client-Side Load Balancing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
package main

import (
    "context"
    "fmt"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/balancer/roundrobin"
    "google.golang.org/grpc/credentials/insecure"
    "google.golang.org/grpc/resolver"
)

// Custom resolver for service discovery
type staticResolver struct {
    target     string
    cc         resolver.ClientConn
    addresses  []string
}

func (r *staticResolver) ResolveNow(resolver.ResolveNowOptions) {
    addrs := make([]resolver.Address, len(r.addresses))
    for i, addr := range r.addresses {
        addrs[i] = resolver.Address{Addr: addr}
    }

    r.cc.UpdateState(resolver.State{Addresses: addrs})
}

func (r *staticResolver) Close() {}

// Register custom resolver
func init() {
    resolver.Register(&staticResolverBuilder{})
}

type staticResolverBuilder struct{}

func (*staticResolverBuilder) Build(
    target resolver.Target,
    cc resolver.ClientConn,
    opts resolver.BuildOptions,
) (resolver.Resolver, error) {
    r := &staticResolver{
        target: target.Endpoint(),
        cc:     cc,
        addresses: []string{
            "backend1:50051",
            "backend2:50051",
            "backend3:50051",
        },
    }
    r.ResolveNow(resolver.ResolveNowOptions{})
    return r, nil
}

func (*staticResolverBuilder) Scheme() string { return "static" }

func main() {
    // Connect with client-side load balancing
    conn, err := grpc.Dial(
        "static:///order-service",
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithDefaultServiceConfig(fmt.Sprintf(`{
            "loadBalancingPolicy": "%s",
            "healthCheckConfig": {
                "serviceName": ""
            }
        }`, roundrobin.Name)),
    )
    if err != nil {
        panic(err)
    }
    defer conn.Close()

    // Requests will be distributed across backends
    client := pb.NewOrderServiceClient(conn)
    // ... use client
}

Solution 2: Envoy Proxy for L7 Load Balancing

Envoy configuration for gRPC load balancing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# envoy.yaml
static_resources:
  listeners:
  - name: grpc_listener
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 9090

    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: grpc
          codec_type: AUTO
          route_config:
            name: grpc_route
            virtual_hosts:
            - name: grpc_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                  grpc: {}
                route:
                  cluster: order_service_cluster
                  timeout: 30s

          http_filters:
          - name: envoy.filters.http.grpc_stats
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.grpc_stats.v3.FilterConfig
              emit_filter_state: true

          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
  - name: order_service_cluster
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    http2_protocol_options: {}  # Enable HTTP/2

    load_assignment:
      cluster_name: order_service_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: order-backend-1
                port_value: 50051
        - endpoint:
            address:
              socket_address:
                address: order-backend-2
                port_value: 50051
        - endpoint:
            address:
              socket_address:
                address: order-backend-3
                port_value: 50051

    health_checks:
    - timeout: 1s
      interval: 10s
      unhealthy_threshold: 2
      healthy_threshold: 2
      grpc_health_check:
        service_name: ""

Solution 3: Kubernetes Service Mesh (Istio)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: order-service
  labels:
    app: order-service
spec:
  ports:
  - port: 50051
    name: grpc
    protocol: TCP
  selector:
    app: order-service
  type: ClusterIP

---
# destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: order-service
spec:
  host: order-service
  trafficPolicy:
    loadBalancer:
      consistentHash:
        httpHeaderName: user-id  # Session affinity by user

    connectionPool:
      http:
        http2MaxRequests: 1000
        maxRequestsPerConnection: 100

    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

---
# virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
spec:
  hosts:
  - order-service
  http:
  - match:
    - headers:
        x-version:
          exact: "v2"
    route:
    - destination:
        host: order-service
        subset: v2
  - route:
    - destination:
        host: order-service
        subset: v1

Observability and Monitoring

OpenTelemetry Integration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
package main

import (
    "context"

    "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
    "google.golang.org/grpc"
)

func initTracer() (*trace.TracerProvider, error) {
    ctx := context.Background()

    // Create OTLP exporter
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("localhost:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    // Create resource
    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName("order-service"),
            semconv.ServiceVersion("1.0.0"),
        ),
    )
    if err != nil {
        return nil, err
    }

    // Create tracer provider
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(res),
        trace.WithSampler(trace.AlwaysSample()),
    )

    otel.SetTracerProvider(tp)

    return tp, nil
}

func main() {
    // Initialize tracing
    tp, err := initTracer()
    if err != nil {
        log.Fatalf("Failed to initialize tracer: %v", err)
    }
    defer tp.Shutdown(context.Background())

    // Create server with OpenTelemetry instrumentation
    server := grpc.NewServer(
        grpc.StatsHandler(otelgrpc.NewServerHandler()),
    )

    // Create client with OpenTelemetry instrumentation
    conn, err := grpc.Dial(
        "localhost:50051",
        grpc.WithStatsHandler(otelgrpc.NewClientHandler()),
    )
}

Prometheus Metrics

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
package metrics

import (
    "context"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "google.golang.org/grpc"
    "google.golang.org/grpc/status"
)

var (
    grpcRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "grpc_requests_total",
            Help: "Total number of gRPC requests",
        },
        []string{"method", "code"},
    )

    grpcRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "grpc_request_duration_seconds",
            Help:    "gRPC request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method"},
    )

    grpcActiveRequests = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "grpc_active_requests",
            Help: "Number of active gRPC requests",
        },
        []string{"method"},
    )
)

func MetricsInterceptor(
    ctx context.Context,
    req interface{},
    info *grpc.UnaryServerInfo,
    handler grpc.UnaryHandler,
) (interface{}, error) {
    start := time.Now()

    // Track active requests
    grpcActiveRequests.WithLabelValues(info.FullMethod).Inc()
    defer grpcActiveRequests.WithLabelValues(info.FullMethod).Dec()

    // Call handler
    resp, err := handler(ctx, req)

    // Record metrics
    duration := time.Since(start).Seconds()
    code := status.Code(err).String()

    grpcRequestsTotal.WithLabelValues(info.FullMethod, code).Inc()
    grpcRequestDuration.WithLabelValues(info.FullMethod).Observe(duration)

    return resp, err
}

Error Handling and Retry Logic

Rich Error Details

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
package main

import (
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/status"
    "google.golang.org/genproto/googleapis/rpc/errdetails"
)

func validateCreateOrderRequest(req *pb.CreateOrderRequest) error {
    st := status.New(codes.InvalidArgument, "invalid request")

    var violations []*errdetails.BadRequest_FieldViolation

    if req.UserId == "" {
        violations = append(violations, &errdetails.BadRequest_FieldViolation{
            Field:       "user_id",
            Description: "user_id is required",
        })
    }

    if len(req.Items) == 0 {
        violations = append(violations, &errdetails.BadRequest_FieldViolation{
            Field:       "items",
            Description: "at least one item is required",
        })
    }

    if len(violations) > 0 {
        br := &errdetails.BadRequest{}
        br.FieldViolations = violations

        st, _ = st.WithDetails(br)
        return st.Err()
    }

    return nil
}

Client Retry Logic

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import grpc
from grpc import StatusCode
import time

async def create_order_with_retry(
    stub,
    request,
    max_attempts: int = 3,
    base_delay: float = 1.0
):
    """Create order with exponential backoff retry"""
    retryable_codes = {
        StatusCode.UNAVAILABLE,
        StatusCode.DEADLINE_EXCEEDED,
        StatusCode.RESOURCE_EXHAUSTED,
    }

    for attempt in range(max_attempts):
        try:
            return await stub.CreateOrder(request)

        except grpc.RpcError as e:
            if e.code() not in retryable_codes or attempt == max_attempts - 1:
                raise

            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt)
            jitter = delay * 0.1 * (random.random() - 0.5)
            await asyncio.sleep(delay + jitter)

            print(f"Retry attempt {attempt + 1} after {delay:.2f}s")

Conclusion

Deploying gRPC in production requires careful attention to Protocol Buffers schema design, load balancing, observability, and error handling. Key takeaways:

  1. Design schemas for evolution: Use field numbers wisely, never reuse them, and leverage optional fields for backward compatibility.

  2. Understand HTTP/2 load balancing challenges: Use client-side load balancing, L7 proxies like Envoy, or service meshes for proper request distribution.

  3. Implement comprehensive observability: Integrate OpenTelemetry for distributed tracing and Prometheus for metrics from day one.

  4. Handle errors gracefully: Use rich error details, implement retry logic with exponential backoff, and consider circuit breakers.

  5. Leverage streaming: Use server streaming for real-time updates, client streaming for batch operations, and bidirectional streaming for interactive protocols.

  6. Monitor performance: gRPC offers 20-30% bandwidth savings and 15-25% lower compute costs, but requires proper infrastructure setup.

  7. Use health checks: Implement gRPC health checking protocol for proper load balancer integration and graceful shutdowns.

gRPC’s performance benefits make it ideal for microservices communication, but production deployment requires understanding its unique characteristics around HTTP/2, load balancing, and observability. With these patterns, you’ll build robust, scalable gRPC services.

Sources

This post is licensed under CC BY 4.0 by the author.