Notes from the Distributed System Design by Udi Dahan

Posted on September 27, 2010 @ 09:00 school


  • idempotency
  • gigabit ethernet => 128 megabytes ethernet => subtract tcp/http overhead => 50 megabytes data
  • multiple networks => prioritize commands over queries. (CustomerBecamePreferred vs GetAllCustomers
    • network admins can prioritize data on separate networks.
    • move time critical data to separate networks.
  • properties suck! accessing one property from another is not detministically determined. (time to access)
  • everytime you create an assocation their is cost down the road involved.

    8 fallacies of distributed computing..

  • 1 - the network is reliable
  • 2 - latency isn’t a problem
  • 3 - bandwith isn’t a problem
  • 4 - the network is secure - security
  • 5 - the topology wont change
  • 6 - the admin will know what to do.
  • 7 - transport cost isnt’ a problem.
  • 8 - the network is homogeneous
  • 9 - the system is atomic
    • centralized dba committee to sign off on schema changes.
    • solution: internal loose coupling, modularize, design for scale out in advance, design for interaction with other software.
  • 10 - the system is finished.
    • maintenance costs more than development - design for maintenance
    • the system is never finished. - design for upgrades.
    • how will you upgrade the system.
  • 11 - business logic can and should be centrailized
    • re-use can be bad. context matters.
    • single generic abstraction can make things more difficult, and can cause performance problems.
    • more classes but more maintainable. the number of lines and coupling is reduced.
    • generic abstractions can cause a lot of problems down the line. performance, maintainability. (small code base but much more complex to jump in)
    • rules that change often can be segrated from rules that don’t change often.
    • we are taught that re-use is one of the greatest values in software development, in reality, it doesn’t really help as much as we think it should.


    • accept that business logic and validation will be distributed. plan for it.

    big idea! (logical centralization and physical centralization is one to one.) at design time package rules enforcement together. - development time artifact - 12.sln that only has files that relate to rule “12” - ie. js file, cs file, and sql file all related to business rule 12… - business says we want to change rule # 12, then open up 12.sln and make those changes.??? – COULD THIS WORK??? “-“ any new sql files would have to have the next ordinal number to run migration scripts in order. “+” order migrations by timestamp would solve this problem.

  • best practices have yet to catch up to “best thinking”
  • tech cannot solve all problems
  • adding hardware doesn’t necessarily help lunch
  • coupling
    • afferent : depends on you
    • efferent : you depend on
  • attempt to minimize afferent and efferent coupling.
  • zero coupling is not possible.
  • types of coupling for systems -platform -temporal -spatial

shared database is one of the worst forms of coupling. - e.g. one system writes to database, and another reads from it.

  • make coupling visible. no more shared database, it’s to hard to figure out the coupling when you do that.
  • 1 danger is not being aware of the coupling by not seeing it.

Platform coupling

  • aka “interoperability”
  • using protocols only available on one platform. e.g. .NET [A] ——–> JAVA [B]


  • use standards based transfer protocol like http, tcp, udp, smtp, snmp

Temporal coupling

  • coupling to time.
  • e.g. service [A] sends message to service [B] and is waiting for a response. —> TIMEOUT
  • stop trying to solve this type of coupling with multithreaded code.
  • most people should not be writing multi threaded code.
  • fowler says that only 4 people in the world know how to write proper multi threaded code.
  • publish/subscribe vs request/response

Spatial coupling

  • where in space are we depending on.
  • how closely tied am i in space to the physical machine where this is running.


  • loose coupling is more than just a slogan.
  • coupling is a function of 5 diff dimensions.
    • platform
    • spatial
    • temporal
    • afferent
    • efferent

mitigate temporal and spatial coupling by hosting in the same process. [A[B]] or [B[A]]

to do:

  • add zip file to google docs.
  • look up 8 fallacies of distributed computing.


  • reduces coupling
  • RPC crashes when load increases, messaging does not.
  • messaging: data can sit and wait to be processed in a queue
  • rpc: data cannot just sit, threads are blocked. as load increases, more threads are needed.
  • throttling: tell clients to go away. all threads are tied up, and dont intend to activate more.
    • get to max throughput then stay there.
  • messaging: fire and forget
    • we can set the max # of threads to process items off of a queue.
    • store items in persistent storage (queue) which is cheap.
    • this works well for commands.

    • how do you deal with queries? the result needs to be returned in almost real-time.

“represent methods as messages”

  • Authorization : IHandles --> runs against every message for authorization.

DAY 2 — 2010.09.28

role playing… kinky?

  • previous assumptions are incorrect
  • cross functional teams


  • same infrastructure
  • slightly different architecture
  • increase performance

cost of messaging: - learning curve

10:15 am

  • need “correlation id” to tell us which response belongs to which request.
  • eg this is response for request 123
  • now you can have multiple responses for a single request.

browser -> send request to server -> poll server for response with correlation id <- return responses for corellation id


  • events are…


  • topic hierarchies

1:08 pm


  • a little bit of orchestration
  • a little bit of transformation


  • all about connecting event sources to event sinks events are core abstraction of bus arch style.
  • supports federated mode?
  • doesn’t break service autonomey
    • harder to design

Service Orientation

4 tenets of service orientation

  • services are autonomous
  • services have explicit boundaries
  • services share contract & schema, not class or type
  • service interaction is controlled by policy

what is a service?

Service: technical authority for a specific business capability. Service: all data and business rules reside within the service.

what a service is not:

  • has only functionality
  • only has data is a database
    • like [create, read, update, delete] entity

service examples

service deployments

  • many services can be deployed to the same box
  • many services can be deployed in the same app
  • many services can cooperate in a workflow
  • many services can be mashed up in the same page.


  • if subscriber goes down, messages are buffered to disk
    • this can require a significant amount of disk space for an outage.
    • this can take the publisher down.
  • schema: defines the logical message types
  • contract: provides additional tech info

The IT/Operations Service

  • responsible for keeping info flowing in the enterprise
  • owns all hosting technologies.
  • hosts all the business services
    • configures it’s own handlers to run first
      • authentication, authorization - LDAP/AD access
  • HR - employee hired, employee fired.
    • provisions/de-provisions machines, accounts, etc.


  • watch big bang theory
  • nerdtree vim plugin
  • tcommment vim plugin

DAY 3 = 2009.09.29

Review: (Hotel management system exercise) define the services

  • page composition really simplifies things.
  • avoid request/response between services.
  • avoid data duplication between service boundaries.
  • defining the services exercise helps identify the key business capabilities and fosters communication about the business.
  • Services

  • Availability/Booking:
    • booking[bookingid, customerid, daterange, roomclassification]
    • tells how many of what room classifications are available at what times.
  • Facilities/Housekeeping
    • tells what rooms are physically available
  • Customer Care
    • tracks customers information like first name and last name.
  • Billing
    • has price and bills customers for their bookings.


Service Decomposition

  • large scale business capability that a service provides can be futher broken down.

Business Components - BC

  • [yes] multiple databases schemas, no fk relationships between schemas
    • this mitigates deadlocks, and perf issues.
    • referential integrity argument: delete => deletions usually mean the data will not be shown.
    • duplicate data on different islands of data is a huge problem, this is why we do not duplicate data across service boundaries.
    • data guy says do not delete data ever.
    • we effectively scale our our databases by having multiple databases.

be aware of solutions masquerading as requirements - udi dahan

Services and transactions

“autonomous components” : AC


- responsible for one of more message types
- thing that performs a unit of work.
- is independently deployable, has it's own endpoint.
- common code that is needed to handle a specific message type not a single piece of code.
- running part of a service.


BC - multiple AC’s - single DB

  • usually when users report the system is slow, they are usually talking about a specific use case.
    • using the typical monolithic architecture it is difficult to scale.
    • when the business components are separated it is much easier to scale specific business capabilities.
  • from a runtime perspective there are more moving parts, and it can be difficult to monitor.

SOA building blocks summary

you see the AC’s running.

  • autonomous components are the unit of deployment in SOA.
  • ac’s takes responsibility for a specific set of messages types in the service.
  • ac uses the bus to communicate with other ac’s
  • ac’s can communicate using a message bus instead of a service bus. same technological bus, but used differently.
  • not all ac’s require the same infrastructure, within a bc or across all ac’s.
    • eg. one ac my use an orm, another might write straight sql.

Service Structure

  • single user domain / multi-user collaborative domain.
  • model multi user data explicitly.

Queries in a collaborative multi user domain.

is this the right screen to be built for the purpose of collaboration?

  • tell users how stale the data is. “data correct as of 10 minutes ago.”
  • eg bank statement without date.
  • decision support. all query screens are for decision support.
  • which decision(s) are you looking to support for each screen.
  • create separate screens for different levels of decision support.
  • persistent view model. what if we had a table for each screen that stores everything for that screen.
  • how do you keep the data in sync between the persistent view model tables and the source data.
  • no coupling between screens in the ui.
  • no fk relationships in the persisten view model tables.
  • avoid calculations when doing queries.

Deployment and security

Role based security

  • different screens for different roles go to different tables. select permissions per role.
  • use the persistent view model to run some validation before issuing the command.


  • validation: is the input potentially good? structured correctly? ranges, lengths, etc?
    • this is not a service, this is a function.
  • rules: should we do this?
    • based on current system date.
    • what the user saw is irrelevant.

model user intent - udi dahan we want to implement a fair system. if no one can see that a system is unfair, then it is fair enough.

HFT - high frequency trading.

  • good commands
    • “thank you. your confirmation email will arrive shortly.”
    • inherently asynchronous
  • it’s easier to validate a command, because
    • we have context,
    • less data
    • more specific
  • in most cases it’s difficult to justify many to many relationships within a bc.
  • document databases are great for persistent view models.




  • internal schema
    • message types are commands
      • ChangeCustomerAddress
    • faults
      • customer…. missed it.
  • external schema
    • primarily based on events
      • CustomerBecamePreferred
      • OrderCancelled
    • past tense
    • something that has already occurred
    • stay away from db thinking = no CRUD
      • think about business status changes
  • faults
    • an enum value
      • CustomerNotFound
      • OrderLimitExceededForCustomer
    • not an exception
      • we expect these things to occur
      • exceptions don’t really work in async programming.

ultimately, we are refactoring a bus into a plane while it’s driving. - udi dahan

Day 4 – 2009.09.30

Review: Order cancelled

  • don’t process the refund until the products have been returned.
    • the cancel order command has no reason to fail now.
      • you can now cancel an order at anytime, for the customer to receive the refund they must return the product.
    • the ship order command has no reason to fail now.
      • we do not need to ship the order if it is cancelled, but even if they cancel the order after it has been shipped that’s ok.
  • if the new requirements don’t fit the rules then
    • your service boundaries might be incorrect.
    • there is something missing in the requirements or process.
  • svn => single user domain
  • git => multi user domain


Long Running Processes

  • mortgage lending process
  • time is important. you don’t see time, but it’s important to model.

SAGA - handles long lived transaction

  • triggers ares messages
  • similar to message handlers
    • can handle a number of different message types.
  • different from message handlers
    • have state, message handlers don’t
  • sql server isolation level repeatable read is like “select for update”
    • i am going to execute this query again in this transaction and i would like to get back the same result set.
  • within a single business component


  • test it as a black box.
  • make sure you test the saga’s because they are core, and they will change, and it’s important to maintain the original business behaviour as it grows and changes.

workflow foundation WWF

  • transaction management is manual do it yourself
  • hard to test
  • no timeout mechanism


  • can’t actually represent this kind of functionality in a drag and drop orchestration.
  • can be useful when doing something procedural and synchronous.
  • latency tends to be slower.
  • haven’t modeled time.
  • biztalk rules engine (bre)

the hard part

  • the easy part is using the building blocks.
  • hard part is getting them to tell you what the process needs to be.
  • interacting with legacy systems, each response becomes a message which triggers an activity.
    • legacy systems are usually internal to a service.


whenever you hear about workflows, orchestration etc then saga’s are likely a candidate. if we see a saga handling 50 messages, that’s usually a smell.



domain model

  • if you have complicated and ever changing business rules
  • if you have simple not-null checks and a couple of sums to calculate, a transaction script is a better bet.
  • independent of all concerns
  • poco - plain old c# objects
  • testing a connection between objects does not test any sort of behaviour
  • a unit is something that has a boundary.
  • you have been testing the innards of a unit
  • can be deployed multi tier
  • it’s not about persistence, it’s about behaviour.

service layer

  • manages persistence
    • e.g. uses orm to persist domain model.

concurrency models

  • at least with eventual consistency we will effectively get true consistency.
  • with the current way we develop we do not have consistency.
  • the current domain models you’ve built are great for single user model, but not multi user model.

    realistic concurrency

  • happy face
    • you change the customers address
    • i update the customers credit history
  • sad face
    • you cancel an order
    • i try to ship the order
  • only get one domain object.
    • ask it to update itself
    • domain object runs business rules
    • eg. – customer care using() { var customer = session.Get(id); customer.MakePreferred(); }
  • violation
    • crosses 3 service boundaries
      • shipping, billing, customer care public void MakePreferred(){ foreach( var order in this.UnshippedOrders) foreach(var orderLine in order.OrderLines) orderLine.Discount(10.Percent()); }

DAY 5 - 2009.10.01


out of order events

Building a saga for shipping

public class ShippingSagaData : IContainSagaData
	public virtual Guid Id{get;set;}
	public virtual string Originator {get;set;}
	public virtual string OriginalMessageId {get;set;}

	public virtual bool Billed {get;set;}
	public virtual bool Accepted {get;set;}
	public virtual Guid OrderId {get;set;}

public class ShippingSaga : Saga<ShippingSagaData>,
	public override void ConfigureHowToFindSaga()
		ConfigureMapping<OrderAccepted>(s =>s.OrderId, m =>m.OrderId);
		ConfigureMapping<OrderBilled>(s =>s.OrderId, m =>m.OrderId);

	public void Handle(OrderAccepted message)
		Data.OrderId = message.OrderId;
		Data.Accepted = true;

			RequestTimeout(TimeSpan.FromDays(7), "bill");

	public void Handle(OrderBilled message)
		Data.OrderId = message.OrderId;
		Data.Billed = true;

public class OrderAccepted : IMessage
	public Guid OrderId{get;set;}


the tests…

public class ShippingTests
	public void WhenBillingArrivesAfterAcceptedSagaShouldComplete()
		.When(s =>s.Handle(new OrderAccepted())
		.When(s =>s.Handle(new OrderBilled())

	public void WhenBillingArrivesBeforeAcceptedSagaShouldComplete()
		.When(s =>s.Handle(new OrderBilled())
		.When(s =>s.Handle(new OrderAccepted())

	public void SagaRequestsBillingTimeout()
			.ExpectSend<TimeoutMessage>(m => true)
			.When(s =>s.Handle(new OrderAccepted())
			.When(s =>s.Timeout(null))


  • synchronous user login
  • caching
    • keeping cache up to date across farms is challenging.
    • cache invalidation
    • track hit rate (number of times that item was in cache.)
      • hard to do
      • google, facebook, try to get a hit rate above 95%.
  • start using a cdn (content delivery network.)
    • akamai
  • in the db, reads interfere with writes - hurts perf.

    the number one reason why people are having scaling their web applications, is because they are ignoring the web. - udi dahan

  • 90 % of page does not need to be rendered server side for every request.
  • different interface for search engine than for users.
    • meta tags for search engine.
    • ui for users.
  • persistent view model browser side using cookies.
  • ho

smart clients

  • use synchronization domains for thread synchronization.
  • provide information radiators for your knowledge workers.
  • client side domain model.
  • use property grid to display the status of objects without needing to attach a debugger.
  • cloning-proxies for views
    • create a proxy for your views so that any data handed to the view can be cloned and bound to the view.

map display

  • see your wells on a map
  • constraints
    • may receive 100’s of updates per second at peak