Schema Design Workshop

Gary J. Murakami, Ph.D.
10gen (the MongoDB company)
@GaryMurakami

# Agenda 1. Basic schema design principles for MongoDB 2. Schema design over an application's lifetime 3. Common design patterns 4. Sharding
# Goals * Learn the schema design process in MongoDB * Practice applying common principles via exercises * Understand the implications of sharding
## What is a schema and why is it important?
# Schema * Map concepts and relationships to data * Set expectations for the data * Minimize overhead of iterative modifications * Ensure compatibility
## Normalization
users
  • username
  • first_name
  • last_name
books
  • title
  • isbn
  • language
  • created_by
  • author
authors
  • first_name
  • last_name
## Denormalization
users
  • username
  • first_name
  • last_name
books
  • title
  • isbn
  • language
  • created_by
  • author
    • first_name
    • last_name
## What is schema design like in MongoDB? * Schema is defined at the application-level * Design is part of each phase in its lifetime * There is no magic formula
## MongoDB Documents Storage in BSON → BSONSpec.org * Scalars * Doubles * Integers (32 or 64-bit) * UTF-8 strings * UTC Date, timestamp * Binary, regex, code * Object ID * `null` * Rich types * Objects * Arrays
## Terminology
{
    "mongodb"    : "relational db",
    "database"   : "database",
    "collection" : "table",
    "document"   : "row",
    "field"      : "column",
    "index"      : "index",
    "sharding" : {
        "shard"     : "partition",
        "shard key" : "partition key"
    }
}
## Three Considerations in MongoDB Schema Design 1. The data your application needs 2. Your application's read usage of the data 3. Your application's write usage of the data
# Case Study ## Library Web Application Different schemas are possible
## Author Schema
{
    "_id": int,
    "first_name": string,
    "last_name": string
}
## User Schema
{
    "_id": int,
    "username": string,
    "password": string
}
## Book Schema
{
    "_id": int,
    "title": string,
    "slug": string,
    "author": int,
    "available": boolean,
    "isbn": string,
    "pages": int,
    "publisher": {
        "city": string,
        "date": date,
        "name": string
    },
    "subjects": [ string, string ],
    "language": string,
    "reviews": [
       { "user": int, "text": string },
       { "user": int, "text": string }
    ],
}
# Example Documents
## Author Document
> db.authors.findOne()
{
    _id: 1,
    first_name: "F. Scott",
    last_name: "Fitzgerald"
}
## User Document
> db.users.findOne()
{
    _id: 1,
    username: "emily@10gen.com",
    password: "slsjfk4odk84k209dlkdj90009283d"
}
## Book Document
> db.books.findOne()
{
    _id: 1,
    title: "The Great Gatsby",
    slug: "9781857150193-the-great-gatsby",
    author: 1,
    available: true,
    isbn: "9781857150193",
    pages: 176,
    publisher: {
        name: "Everyman's Library",
        date: ISODate("1991-09-19T00:00:00Z"),
        city: "London"
    },
    subjects: ["Love stories", "1920s", "Jazz Age"],
    language: "English",
    reviews: [
       { user: 1, text: "One of the best…" },
       { user: 2, text: "It's hard to…" }
    ]
}
# Embedded Objects #### AKA embedded or sub-documents What advantages do they have? When should they be used?
## Embedded Objects
> db.books.findOne()
{
    _id: 1,
    title: "The Great Gatsby",
    slug: "9781857150193-the-great-gatsby",
    author: 1,
    available: true,
    isbn: "9781857150193",
    pages: 176,
    publisher: {
        name: "Everyman's Library",
        date: ISODate("1991-09-19T00:00:00Z"),
        city: "London"
    },
    subjects: ["Love stories", "1920s", "Jazz Age"],
    language: "English",
    reviews: [
       { user: 1, text: "One of the best…" },
       { user: 2, text: "It's hard to…" }
    ]
}
## Embedded Objects * Great for read performance * One seek to load the entire document * One round trip to the database * Writes can be slow if constantly adding to objects
# Linked Documents What advantages does this approach have? When should they be used?
## Linked Documents
> db.books.findOne()
{
    _id: 1,
    title: "The Great Gatsby",
    slug: "9781857150193-the-great-gatsby",
    author: 1,
    available: true,
    isbn: "9781857150193",
    pages: 176,
    publisher: {
        publisher_name: "Everyman's Library",
        date: ISODate("1991-09-19T00:00:00Z"),
        publisher_city: "London"
    },
    subjects: ["Love stories", "1920s", "Jazz Age"],
    language: "English",
    reviews: [
       { user: 1, text: "One of the best…" },
       { user: 2, text: "It's hard to…" }
    ]
}
## Linked Documents * More, smaller documents * Can make queries by ID very simple * Accessing linked document data requires extra read * What effect does this have on the system?
# Data, RAM and Disk
memory-1.png
memory-2.png
memory-3.png
memory-4.png
# Arrays When should they be used?
## Array of Scalars
> db.books.findOne()
{
    _id: 1,
    title: "The Great Gatsby",
    slug: "9781857150193-the-great-gatsby",
    author: 1,
    available: true,
    isbn: "9781857150193",
    pages: 176,
    publisher: {
        name: "Everyman's Library",
        date: ISODate("1991-09-19T00:00:00Z"),
        city: "London"
    },
    subjects: ["Love stories", "1920s", "Jazz Age"],
    language: "English",
    reviews: [
       { user: 1, text: "One of the best…" },
       { user: 2, text: "It's hard to…" }
    ]
}
## Array of Objects
> db.books.findOne()
{   _id: 1,
    title: "The Great Gatsby",
    slug: "9781857150193-the-great-gatsby",
    author: 1,
    available: true,
    isbn: "9781857150193",
    pages: 176,
    publisher: {
        name: "Everyman's Library",
        date: ISODate("1991-09-19T00:00:00Z"),
        city: "London"
    },
    subjects: ["Love stories", "1920s", "Jazz Age"],
    language: "English",
    reviews: [
       { user: 1, text: "One of the best…" },
       { user: 2, text: "It's hard to…" }
    ],
}
## Exercise #1 Design a schema for users and their book reviews * Users * username (string) * email (string) * Reviews * text (string) * rating (integer) * created_at (date) *Usernames are immutable*
### Exercise #1: Solution A Reviews may be queried by user or book
// db.users (one document per user)
{   _id: ObjectId("…"),
    username: "bob",
    email: "bob@example.com"
}
// db.reviews (one document per review)
{   _id: ObjectId("…"),
    user: ObjectId("…"),
    book: ObjectId("…"),
    rating: 5,
    text: "This book is excellent!",
    created_at: ISODate("2012-10-10T21:14:07.096Z")
}
### Exercise #1: Solution B Optimized to retrieve reviews by user
// db.users (one document per user with all reviews)
{   _id: ObjectId("…"),
    username: "bob",
    email: "bob@example.com",
    reviews: [
        {   book: ObjectId("…"),
            rating: 5,
            text: "This book is excellent!",
            created_at: ISODate("2012-10-10T21:14:07.096Z")
        }
    ]
}
### Exercise #1: Solution C Optimized to retrieve reviews by book
// db.users (one document per user)
{   _id: ObjectId("…"),
    username: "bob",
    email: "bob@example.com"
}
// db.books (one document per book with all reviews)
{   _id: ObjectId("…"),
    // Other book fields…
    reviews: [
        {   user: ObjectId("…"),
            rating: 5,
            text: "This book is excellent!",
            created_at: ISODate("2012-10-10T21:14:07.096Z")
        }
    ]
}
## Schema Design Over an Application's Lifetime * Development * Production * Iterative Modifications
# Development Phase Basic CRUD functionality
# CreateRUD
author = {
    _id: 2,
    first_name: "Arthur",
    last_name: "Miller"
};

db.authors.insert(author);
* The `_id` field is unique and automatically indexed * MongoDB will generate an [ObjectId](http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-BSONObjectIDSpecification) if not provided
# CReadUD
> db.authors.find({ "last_name": "Miller" })
{
    _id: 2,
    first_name: "Arthur",
    last_name: "Miller"
}
## Reads and Indexing Examine the query after creating an index.
> db.books.ensureIndex({ "slug": 1 })

> db.books.find({ "slug": "the-great-gatsby" }).explain()
{
    "cursor": "BtreeCursor slug_1",
    "isMultiKey" : false,
    "n" : 1,
    "nscannedObjects" : 1,
    "nscanned" : 1,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    // Other fields follow…
}
## Multi-key Indexes Index all values in an array field.
> db.books.ensureIndex({ "subjects": 1 })
## Indexing Embedded Fields Index an embedded object's field.
> db.books.ensureIndex({ "publisher.name": 1 })
## Query operators * Conditional operators * `$gt`, `$gte`, `$lt`, `$lte`, `$ne`, `$all`, `$in`, `$nin`, `$size`, `$and`, `$or`, `$nor`, `$mod`, `$type`, `$exists` * Regular expressions * Value in an array * `$elemMatch` * Cursor methods and modifiers * `count()`, `limit()`, `skip()`, `snapshot()`, `sort()`, `batchSize()`, `explain()`, `hint()`
# CRUpdateD
review = {
    user: 1,
    text: "I did NOT like this book."
};

db.books.update(
    { _id: 1 },
    { $push: { reviews: review }}
);
## Atomic Modifiers Update specific fields within a document * `$set`, `$unset` * `$push`, `$pushAll` * `$addToSet`, `$pop` * `$pull`, `$pullAll` * `$rename` * `$bit`
# CRUDelete
> db.books.remove({ _id: 1 })
# Production Phase Evolve schema to meet the application's read and write patterns
## Read Usage Finding books by an author's first name
authors = db.authors.find({ first_name: /^f.*/i }, { _id: 1 });

authorIds = authors.map(function(x) { return x._id; });

db.books.find({author: { $in: authorIds }});
## Read Usage "Cache" the author name in an embedded document
> db.books.findOne()
{
    _id: 1,
    title: "The Great Gatsby",
    author: {
        first_name: "F. Scott",
        last_name: "Fitzgerald"
    }
    // Other fields follow…
}
Queries are now one step
> db.books.find({ author.first_name: /^f.*/i })
## Write Usage Users can review a book
review = {
    user: 1,
    text: "I thought this book was great!",
    rating: 5
};

> db.books.update(
    { _id: 3 },
    { $push: { reviews: review }}
);
* Document size limit (16MB) * Storage fragmentation after many updates/deletes
## Exercise #2 * Display the 10 most recent reviews by a user * Make efficient use of memory and disk seeks
### Exercise #2: Solution Store users' reviews in monthly buckets
// db.reviews (one document per user per month)
{   _id: "bob-201210",
    reviews: [
        {   _id: ObjectId("…"),
            rating: 5,
            text: "This book is excellent!",
            created_at: ISODate("2012-10-10T21:14:07.096Z")
        },
        {   _id: ObjectId("…"),
            rating: 2,
            text: "I didn't really enjoy this book.",
            created_at: ISODate("2012-10-11T20:12:50.594Z")
        }
    ]
}
### Exercise #2: Solution Adding a new review to the appropriate bucket
myReview = {
    _id: ObjectId("…"),
    rating: 3,
    text: "An average read.",
    created_at: ISODate("2012-10-13T12:26:11.502Z")
};

> db.reviews.update(
      { _id: "bob-201210" },
      { $push: { reviews: myReview }}
);
### Exercise #2: Solution Display the 10 most recent reviews by a user
cursor = db.reviews.find(
    { _id: /^bob-/ },
    { reviews: { $slice: 10 }}
).sort({ _id: -1 });

num = 0;

while (cursor.hasNext() && num < 10) {
    doc = cursor.next();

    for (var i = 0; i < doc.reviews.length && num < 10; ++i, ++num) {
        printjson(doc.reviews[i]);
    }
}
### Exercise #2: Solution Deleting a review
cursor = db.reviews.update(
    { _id: "bob-201210" },
    { $pull: { reviews: { _id: ObjectId("…") }}}
);
# Iterative Modifications Schema design is evolutionary
## Allow users to browse by book subject
> db.subjects.findOne()
{
    _id: 1,
    name: "American Literature",
    sub_category: {
         name: "1920s",
         sub_category: { name: "Jazz Age" }
   }
}
* How can you search this collection? * Be aware of document size limitations * Benefit from hierarchy being in same document
## Tree Structures
> db.subjects.find()
{   _id: "American Literature" }

{   _id : "1920s",
    ancestors: ["American Literature"],
    parent: "American Literature"
}

{   _id: "Jazz Age",
    ancestors: ["American Literature", "1920s"],
    parent: "1920s"
}

{   _id: "Jazz Age in New York",
    ancestors: ["American Literature", "1920s", "Jazz Age"],
    parent: "Jazz Age"
}
## Tree Structures Find sub-categories of a given subject
> db.subjects.find({ ancestors: "1920s" })
{
    _id: "Jazz Age",
    ancestors: ["American Literature", "1920s"],
    parent: "1920s"
}

{
    _id: "Jazz Age in New York",
    ancestors: ["American Literature", "1920s", "Jazz Age"],
    parent: "Jazz Age"
}
## Exercise #3 * Allow users to borrow library books * User sends a loan request * Library approves or not * Requests time out after seven days * Approval process is asynchronous * Requests may be prioritized
### Exercise #3: Solution * Need to maintain order and state * Ensure that updates are atomic
// Create a new loan request
db.loans.insert({
    _id: { borrower: "bob", book: ObjectId("…") },
    pending: false,
    approved: false,
    priority: 1,
});

// Find the highest priority request and mark as pending approval
request = db.loans.findAndModify({
    query: { pending: false },
    sort: { priority: -1 },
    update: { $set: { pending: true, started: new ISODate() }},
    new: true
});
### Exercise #3: Solution * Updated and added fields * Modified document was returned
{
    _id: { borrower: "bob", book: ObjectId("…") },
    pending: true,
    approved: false,
    priority: 1,
    started: ISODate("2012-10-11T22:09:42.542Z")
}
### Exercise #3: Solution
// Library approves the loan request
db.loans.update(
    { _id: { borrower: "bob", book: ObjectId("…") }},
    { $set: { pending: false, approved: true }}
);
### Exercise #3: Solution
// Request times out after seven days
limit = new Date();
limit.setDate(limit.getDate() - 7);

db.loans.update(
    { pending: true, started: { $lt: limit }},
    { $set: { pending: false, approved: false }}
);
## Exercise #4 Allow users to recommend books * Users can recommend each book only once * Display a book's current recommendations
### Exercise #4: Solution
// db.recommendations (one document per user per book)
db.recommendations.insert({
    book: ObjectId("…"),
    user: ObjectId("…")
});

// Unique index ensures users can't recommend twice
db.recommendations.ensureIndex(
    { book: 1, user: 1 },
    { unique: true }
);

// Count the number of recommendations for a book
db.recommendations.count({ book: ObjectId("…") });
### Exercise #4: Solution * Indexes in MongoDB are not counting * Counts are computed via index scans * Denormalize totals on books
db.books.update(
    { _id: ObjectId("…") },
    { $inc: { recommendations: 1 }}
});
# Common Design Patterns
# One-to-one Relationship Let's pretend that authors only write one book.
## Linking Either side, or both, can track the relationship.
> db.books.findOne()
{
    _id: 1,
    title: "The Great Gatsby",
    slug: "9781857150193-the-great-gatsby",
    author: 1,
    // Other fields follow…
}

> db.authors.findOne({ _id: 1 })
{
    _id: 1,
    first_name: "F. Scott",
    last_name: "Fitzgerald"
    book: 1,
}
## Embedded Object
> db.books.findOne()
{
    _id: 1,
    title: "The Great Gatsby",
    slug: "9781857150193-the-great-gatsby",
    author: {
        first_name: "F. Scott",
        last_name: "Fitzgerald"
    }
    // Other fields follow…
}
# One-to-many Relationship In reality, authors may write multiple books.
## Array of ID's The "one" side tracks the relationship.
> db.authors.findOne()
{
    _id: 1,
    first_name: "F. Scott",
    last_name: "Fitzgerald",
    books: [1, 3, 20]
}
* Flexible and space-efficient
## Single Field with ID The "many" side tracks the relationship.
> db.books.find({ author: 1 })
{
    _id: 1,
    title: "The Great Gatsby",
    slug: "9781857150193-the-great-gatsby",
    author: 1,
    // Other fields follow…
}

{
    _id: 3,
    title: "This Side of Paradise",
    slug: "9780679447238-this-side-of-paradise",
    author: 1,
    // Other fields follow…
}
## Array of Objects
> db.authors.findOne()
{
    _id: 1,
    first_name: "F. Scott",
    last_name: "Fitzgerald",
    books: [
        { _id: 1, title: "The Great Gatsby" },
        { _id: 3, title: "This Side of Paradise" }
    ]
    // Other fields follow…
}
Use `$slice` operator to return a subset of books
# Many-to-many Relationship Some books may also have co-authors.
## Array of ID's on Both Sides
> db.books.findOne()
{
    _id: 1,
    title: "The Great Gatsby",
    authors: [1, 5]
    // Other fields follow…
}
> db.authors.findOne()
{
    _id: 1,
    first_name: "F. Scott",
    last_name: "Fitzgerald",
    books: [1, 3, 20]
}
## Array of ID's on Both Sides Query for all books by a given author
db.books.find({ authors: 1 });
Query for all authors of a given book
db.authors.find({ books: 1 });
## Array of ID's on One Side
> db.books.findOne()
{
    _id: 1,
    title: "The Great Gatsby",
    authors: [1, 5]
    // Other fields follow…
}
> db.authors.find({ _id: { $in: [1, 5] }})
{
    _id: 1,
    first_name: "F. Scott",
    last_name: "Fitzgerald"
}

{
    _id: 5,
    first_name: "Unknown",
    last_name: "Co-author"
}
## Array of ID's on One Side Query for all books by a given author
db.books.find({ authors: 1 });
Query for all authors of a given book
book = db.books.findOne(
    { title: "The Great Gatsby" },
    { authors: 1 }
);

db.authors.find({ _id: { $in: book.authors }});
## Exercise #5 * Tracking time series data * Graph recommendations per unit of time * Count by: day, hour, minute
### Exercise #5: Solution A
// db.rec_ts (time series buckets, hour and minute sub-docs)
db.rec_ts.insert({
    book: ObjectId("…"),
    day: ISODate("2012-10-11T00:00:00.000Z")
    total: 0,
    hour:   { "0": 0, "1": 0, /* … */ "23": 0 },
    minute: { "0": 0, "1": 0, /* … */ "1439": 0 }
});

// Record a recommendation created one minute before midnight
db.rec_ts.update(
    { book: ObjectId("…"), day: ISODate("2012-10-11T00:00:00.000Z") },
    { $inc: { total: 1, "hour.23": 1, "minute.1439": 1 }}
});
## BSON Storage * Sequence of key/value pairs * Not a hash map * Optimized to scan quickly
minute [0] [1] … [1439]
What is the cost of updating the minute before midnight?
## BSON Storage We can skip sub-documents
hour0 [0] [1] … [59]
hour23 [1380] … [1439]
How could this change the schema?
### Exercise #5: Solution B
// db.rec_ts (time series buckets, each hour a sub-doc)
db.rec_ts.insert({
    book: ObjectId("…"),
    day: ISODate("2012-10-11T00:00:00.000Z")
    total: 148,
    hour: {
        "0": { total: 7, "0": 0, /* … */ "59": 2 },
        "1": { total: 3, "60": 1, /* … */ "119": 0 },
        // Other hours…
        "23": { total: 12, "1380": 0, /* … */ "1439": 3 }
    }
});

// Record a recommendation created one minute before midnight
db.rec_ts.update(
    { book: ObjectId("…"), day: ISODate("2012-10-11T00:00:00.000Z") },
    { $inc: { total: 1, "hour.23.total": 1, "hour.23.1439": 1 }}
});
## Single-collection Inheritance Take advantage of MongoDB's features * Documents need not all have the same fields * [Sparsely index](http://www.mongodb.org/display/DOCS/Indexes#Indexes-sparse%3Atrue) only present fields
## Schema Flexibility
> db.books.findOne()
{
    _id: 47,
    title: "The Wizard Chase",
    type: "series",
    series_title: "The Wizard's Trilogy",
    volume: 2
    // Other fields follow…
}
Find all books that are part of a series
db.books.find({ type: "series" });

db.books.find({ series_title: { $exists: true }});

db.books.find({ volume: { $gt: 0 }});
## Index Only Present Fields Documents without these fields will not be indexed.
db.books.ensureIndex({ series_title: 1 }, { sparse: true });

db.books.ensureIndex({ volume: 1 }, { sparse: true });
## Exercise #6 Users can recommend at most 10 books
## Exercise #6: Solution
// db.user_recs (track user's remaining and given recommendations)
db.user_recs.insert({
    _id: "bob",
    remaining: 8,
    books: [3, 10]
});

// Record a recommendation if possible
db.user_recs.update(
    { _id: "bob", remaining: { $gt: 0 }, books: { $ne: 4 }},
    { $inc: { remaining: -1 }, $push: { books: 4 }}
});
## Exercise #6: Solution * One less unassigned recommendation remaining * Newly-recommended book is now linked
> db.user_recs.findOne()
{
    _id: "bob",
    remaining: 7,
    books: [3, 10, 4]
}
## Exercise #7 * Statistic buckets * Each book has a listing page in our application * Record referring website domains for each book * Count each domain independently
## Exercise #7: Solution A
> db.book_refs.findOne()
{   book: 1,
    referrers: [
        { domain: "google.com", count: 4 },
        { domain: "yahoo.com", count: 1 }
    ]
}
db.book_refs.update(
    { book: 1, "referrers.domain": "google.com" },
    { $inc: { "referrers.$.count": 1 }}
);
## Exercise #7: Solution A Update the position of the first matched element.
db.book_refs.update(
    { book: 1, "referrers.domain": "google.com" },
    { $inc: { "referrers.$.count": 1 }}
);
> db.book_refs.findOne()
{   book: 1,
    referrers: [
        { domain: "google.com", count: 5 },
        { domain: "yahoo.com", count: 1 }
    ]
}
What if a new referring website is used?
## Exercise #7: Solution B
> db.book_refs.findOne()
{   book: 1,
    referrers: {
        "google_com": 5,
        "yahoo_com": 1
    }
}
db.book_refs.update(
    { book: 1 },
    { $inc: { "referrers.bing_com": 1 }},
    true
);
* Replace dots with underscores for key names * Increment to add a new referring website * Upsert in case this is the book's first referrer
# Sharding
## Sharding * Ad-hoc partitioning * Consistent hashing * Amazon DynamoDB * Range based partitioning * Google BigTable * Yahoo! PNUTS * MongoDB
## Sharding in MongoDB * Automated management * Range based partitioning * Convert to sharded system with no downtime * Fully consistent
## Sharding a Collection
db.runCommand({ addshard : "shard1.example.com" });

db.runCommand({ enableSharding: "library" });

db.runCommand({
    shardCollection: "library.books",
    key: { _id : 1}
});
* Keys range from −∞ to +∞ * Ranges are stored as chunks
## Sharding Data by Chunks
db.books.save({ _id: 35, title: "Call of the Wild" });
db.books.save({ _id: 40, title: "Tropic of Cancer" });
db.books.save({ _id: 45, title: "The Jungle" });
db.books.save({ _id: 50, title: "Of Mice and Men" });
  • (−∞, +∞)
  • (−∞, 40)
  • [40, +∞)
  • (−∞, 40)
  • [40, 50)
  • [50, +∞)
Ranges are split into chunks as data is inserted
## Adding New Shards
shard1
  • (−∞, 40)
  • [40, 50)
  • [50, 60)
  • [60, +∞)
## Adding New Shards
db.runCommand({ addshard : "shard2.example.com" });
shard1
  • (−∞, 40)
  •  
  • [50, 60)
  •  
shard2
  •  
  • [40, 50)
  •  
  • [60, +∞)
Chunks are migrated to balance shards
## Adding New Shards
db.runCommand({ addshard : "shard3.example.com" });
shard1
  • (−∞, 40)
  •  
  •  
  •  
shard2
  •  
  • [40, 50)
  •  
  • [60, +∞)
shard3
  •  
  •  
  • [50, 60)
  •  
sharding_cache-1.png
sharding_cache-2.png
## Sharding Components * `mongos` * Config servers * Shards * `mongod` * Replica sets
## Sharded Writes * Inserts * Shard key required * Routed * Updates and removes * Shard key optional * May be routed or scattered
## Sharded Reads * Queries * By shard key: routed * Without shard key: scatter/gather * Sorted queries * By shard key: routed in order * Without shard key: distributed merge sort
## Exercise #8 Users can upload images for books
images
  • image_id: ???
  • data: binary

The collection will be sharded by `image_id`. What should `image_id` be?
## Exercise #8: Solutions What's the best shard key for our use case? * Auto-increment (ObjectId) * MD5 of data * Time (e.g. month) and MD5
shard_key-1.png Right-balanced Access
shard_key-2.png Random Access
shard_key-3.png Segmented Access
## Over-generalizations ### Normalization/Denormalization 1. Normalization reduces redundancy
good principle of Don't Repeat Yourself (DRY) 2. Denormalization introduces redundancy
common practices such as caching for performance ### Consistency techniques * background or nightly process for eventual consistency * look-aside cache for immediate consistency, etc.
## Over-generalizations ### The needs of your application * High consistency - yes: link, no: embed * High read performance - yes: embed, no: link * High write performance - yes: link, no: embed * High scaling - yes: link, no: embed
## Over-generalizations ### Relationships * one-to-one - yes: embed, no: ... * one-to-many - yes: embed, no: ... * many-to-many - yes: link, no: ...
## Over-generalizations ### The bad news There is no magic formula (that requires no thought). ### The good news Flexible document schema to match your application.
# Summary * Schema design is different in MongoDB. * Basic data design principles apply. * It's about your application. * It's about your data and how it's used. * It's about the entire lifetime of your application.
# Thanks! ### Questions?