A list of non-functional requirements for recall
After reading books and articles, watching videos, etc. I tried to summarize a list of non-functional requirements that is more important to LOB apps
- George Fairbanks: Just enough software architecture
- Memi Lavi: The Complete Guide to Becoming a Software Architect
- Enterprise integration patterns
- Ivaylo Kenov’s Architecture series
A list of non-functional requirements for LOB apps
- How much data volume the system will acquire over time?
- What data is required on Day 1?
- How the data grows each month/year?
- Is this going to influence what database to use?
- Is this going to influence the design of your queries?
- Is this going to influence your storage capacity and network planning?
Latency [in milliseconds]
- How much time does it take to get/insert data from API? This is the latency in milliseconds.
- How much time does it take to read/write a file?
- What is the required response time?
Throughput [X requests/second]
- How many tasks can be performed in a given unit of time?
- How many files can be read in a second?
- How many users can save data in the database simultaneously?
Load [600 simultaneous without crashing]
- Quantity of simultaneous work before crashing
- How many concurrent requests can be processed before the system starts crashing?
- How many users will use the system simultaneously?
Concurrent Users => Including “Dead time”
Load => Excluding “Dead time” (actual requests/second)
Rule of thumb => Concurrent = Load X 10
- On app level
- On platform level
- What is the required uptime?
- Manage client expectations. 99.99% is not realistic
- How do you detect SLA failure? What is the reporting mechanism?
- How much compute/storage can be added without interruption?
- Keep components stateless or add stateful sidecar
- How many redundant instances per type of service should be there?
- Redundancy (to enable resilience during internal system failures)
- Db + replication
- Web backend
- Message Broker
- Patterns for reliable message processing
- Persistent queues – to avoid overloading/provide backpressure
- Retry strategies, timeouts, fallbacks to make the system predictable
- Idempotency – to allow retry requests without worries of double records. Idempotency keys.
- System failures
- Poison messages
- Functional defects
- To know what is going on in the system in be informed on time to take action
- Monitoring agents on platform and app level
- What do we use for observability and high-cardinality logs?
- What do we do about logging? Platform, app, component, logs-metrics-traces?
- Extend functionality without modifying existing code base or downtime
- What needs to be backed up?
- How much data we can afford to lose?
- Deltas of full backups?
- How do we plan for encryption malware?
- What time to recover do we need/aim for?
- Amount of data lost on recover / Snapshot frequency
- Dynamic change of configuration
- Security of configuration
- Process to change configuration
- Capability to roll-back configuration
Deployment model and topology
- Tech stack for the CICD
- Time to roll-out / roll-back
- Ability for A/B or canary deployments
- How do we rollout database model changes?
- mean time to diagnose and fix a problem
- what are the applicable KPIs and alerts
- What do we test?
- How do we test?
- How do we keep tests current and avoid bad practices?
- CI/CD integration of the tests
- Capability to change a component while abiding to the interface
- How to scan for prohibited licenses in dependencies?
- How to stop licenses from proliferating in the pipelines?
- Interface Contract Management for external system integrations
- OpenAPI specs
- The next worst thing after naming conventions 😀
- Where do we cache data?
- How do you test performance?
- Are you prematurely inserting caching into the system?
Not an extensive list but when consulted works well on memory recall