28 November 2018

Deep Dive on Amazon Neptune (DAT403)

by mo


What can you do with Apache TinkerPop and Gremlin or RDF and SPARQL? How does Neptune provide multi-Availability Zone high availability? Learn about the features and details of Amazon’s fully managed graph database service.

Graph database.

  • useful for traversing relationship
  • kinds of data
  • rigid highly connected data
  • heterogenous schema and formats
  • higly connected data (hr system, recommendation, social graph)
  • value is from thinkng about the relationshps
  • social networking
  • recommendations
  • knowledge graphs
  • fraud detetion
  • life sciences
  • network and it operations

examples: whom might i know? what product should i buy?

we can see shared interests. we can identify new edges that create new triangles in the graph.

drawbacks of using a a rds

  • query patterns are difficult and requires lots of joins and becomes complex and hard to be efficient.
  • join querires are slow and need indexes
  • indexes slow down write perf.
  • what is the write perfor for this graph db?
  • replication/sharding a thing? HA?

relationships are first class objects. this provides power for querying. graph dbs are optimized for highly connected data.

  • storage and retrieval

2 graph models/frameworks

  • property graph (apach tinkerpop)
    • gremlin query language
  • resoure description framework (rdf)
    • SPARQL query language from W3C

nodes -> edge props insted of rows and tables

  • gremlin is an imperitive language.

RDF is used described resources on the web.

example graph query:

  • find all grad students who recieved an under grad from the same university gremlin
example is kind of gross so i wont write it. java like builder syntax.

nicer looking syntax

select ?student where(
  ?student rdf:type ub:gradstudent
  ?univ rdf:type ub:uni .
  ?dept rdf:type ub:department
  ?student ub:memberof :ub uni
  ?student ub:memberof :ub dept
)

challenges

  • difficult to scale (high ops workload)
  • difficult to maintain HA (high ops workload)
  • too expensive
  • limited support for open standards (needs enterprise support and licensing)

neptune: high throughput/low latency

  • fast
  • reliable: 6 replicas across AZ’s
  • easy
  • open

query patterns

  • OLTP queries
  • OLAP queries -> 100 per server/second (high latency) -> depends on the shape of the data

durable and ACID supports both tinkerpop and rdf/sparql bulk load import from s3 json documenta via rest interface.j

  • multi -az ha
  • read replicas
  • encryption at rest

Resources:

devops