Reliable, Scalable, and Maintainable Applications

Posted May 22, 2022

By Biswa Prakash 4 min read

Standard building blocks of data intensive applications databases caches search indexes stream processing batch processing The above sounds painfully obvious ? because data systems are successful abstractions API abstracts away the data system
Focus on 3 concerns in software systems
1. Reliability
  - The systems should continue to work correctly even in face of adversity
  - Tolerate user mistakes
  - Performance good enough under expected load and volume
  - Prevents unauthorized access and abuse
  - “Continuing to work correctly even when things go wrong”
    - Fault tolerant/ Resilient
    - Things that can go wrong are called fault
    - System that can anticipate fault and deal with it are called fault tolerant
    - Difference between fault and failure
      - Fault a single component deviating from spec
      - Failure when the system as a whole fails providing the desired service
    - Multiple faults lead to failure of the service
    - Chaos Monkey (Netflix)
      - Intentionally Inducing faults to check fault tolerance machinery in action
  - Types of Faults
    - Hardware Faults
    - Software Faults
      - Allowing processes to crash and restart
      - Careful thinking about assumptions and interactions
      - Measuring, monitoring and analyzing
    - Human Errors
      - How to make systems reliable in spite of unreliable humans
        Allow quick and easy recovery from human errors
        Quick rollback of configuration changes
        Roll out code slowly
        clear monitoring and performance metrics
2. Scalability
  - If a system grows in a particular way what are are ways of dealing with the growth?
  - How can we add computing resources to handle the additional growth
    - Load Parameters (Best choice depends on the architecture)
      - Request Per Seconds
      - Ratio of reads to write to database
      - No. of simultaneously active users in a chatroom
      - The hit rate on a cache
    - Describing load -> Describing Performance
      - Increase load parameter and keep system resources unchanged how is the performance of the system affected?
      - When you increase load parameter, how much do you need to increase the resources to keep the performance unchanged ?
    - Batch Processing System
      - We care about throughput, the number of jobs we can process per second or the total time it takes to run a job
    - Online Systems
      - We care about response time
        Response time: This is what client sees
        Latency: the duration that a request is waiting to be handled during which it is latent, awaiting service
    - Response Percentiles
      - Median: If median response time is 200 ms then half your requests return in less than 200 ms, and half your requests take longer than that
        Metric to understand how long typically users have to wait
        95th, 99th, and 99.9th percentiles
        if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more
      - Tail Latencies
        Directly affect user experience
        Amazon response time requirements for internal services in terms of the 99.9th percentile even though it only affects 1 in 1,000 requests
        customers with the slowest requests are often those who have the most data on their accounts because they have made many purchase that is, they’re the most valuable customers
        Optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed too expensive and to not yield enough benefit for Amazon’s purposes
      - Head of Line Blocking
        Only takes a small number of slow requests to hold up the processing of subsequent requests Even if those subsequent requests are fast to process on the server, the client will see a slow overall response time due to the time waiting for the prior request to complete, thats why necessary to measure response on client side
      - Tail Latency Amplification
        If end user requires multiple backend calls to serve a request a single slow call can make the entire end user request slow
  - Approaches For Dealing With Load
    - Easy: distributing stateless services across multiple machines
    - Hard: taking stateful data systems from a single node to a distributed setup
    - In an early-stage startup or an unproven product it’s usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load
    - Architecture is built around certain assumptions (load parameters) and if they are wrong the scaling efforts are wasted.
3. Maintainability
  - Operability
    - Make it easy for operations teams to keep the system running smoothly.
    - Good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations
    - Simplicity (Managing complexity)
      - Make it easy for new engineers to understand the system
      - when the system is harder for developers to understand and reason about
        hidden assumptions
        unintended consequences
        unexpected interactions
      - Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity
        Accidental Complexity
        complexity not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.
        Removing Complexity
        Good abstraction
        hide a great deal of implementation detail behind a clean,simple-to-understand façade
        finding good abstractions is very hard
        In distributed systems many good algorithms but much less clear how to package them into algorithms that help in managing the Complexity of the system
        reusable components
  - Evolvability
    - Make it easy for engineers to make changes to the system in the future
    - agility on a data system level: evolvability

distributed-system

scalability reliability

This post is licensed under CC BY 4.0 by the author.

Trending Tags