MongoDB 2012 Notes
File and Data Structures
- Key names are stored in the BSON doc, so make sure key names are short.
- Data files are pre allocated, doubling in size each time.
- Files are accessed using memory mapped files at the OS level.
- fsync’d every 60 seconds.
- Should use 64bit systems to support the memory mapped files.
Journalling in 1.8 default in 2.0+
- Write-ahead log
- Ops written to journal before memory mapped regions
- Journal flushed every 100ms or 100 MB written
- db.getLastError({j:true}) to force a journal flush
- /journal sub directory
- 1 GB files, rotated ( only really need stuff that has not been fsync’d )
- some slowdown on high throughput systems, can be symlink’d to other drives.
- on by default in 64 systems.
When to use
- If Single node
- Replica Set – at least 1 node
- For large data sets.
Fragmentation
- files get fragmented over time if docs sizes change or deletes
- collections that have a lot of resizes get padding factor to help reduce fragmentation.
- need to improve free list
- 2.0 reduced scanning to reasonable amount
- 2.2 will change
Compaction
-
2.0+ compact command
- only needs 2 GB extra space
- off line operation (another good reason with replica sets.)
-
safemode: waits for a round trip from using getLastError, with this call you can specify how safe you want the data to be.
-
drop collection doesn’t free the data file, dropDatabase does. Sometimes it makes sense to create and drop databases.
Index and Query Evaluation
@mschireson
- indexes are lists of values associated with documents
- stored in a btree
- required for geo queries and unique constraints
- assending/descending really only matter on compound indexes.
- null == null for unique indexes, you can drop duplicates on create.
- create index is blocking, unless {background: true}, still should try to do off peak
- when dropping an index, you need to use the same document when created.
- the $where operator doesn’t use the indexes
- Regexp’s starting with /^ will use an index
- Indexes are used for updates and deletes
- Compound indexes can be used to query for the first field and sort on the second
- Only uses one index at a time
- Limited index uses:
- $ne uses the index, but doesn’t help performance much
- $not
- $where
- $mod index only limits to numbers
- Range queries only help some.
GEO
-
created using “2d”
-
$near
- sorted nearest to farthest
-
$within
-
$within{$polygon}}
-
can be in compound queries
Sparse Indexes
- only store values w/ the indexed field, results won’t have documents w/ null in that field.
- can be sparse & unique
Covering Indexes
- contains all fields in the query and the results, no db lookup
Limits and Trade offs
- max of 64
- can slow down inserts and updates
- compound index can be more valuable and handle multiple queries
- You can force an index or full scan
- use sort if you really want sorted data
- db.c.find(…).explain() => see whats going on.
- db.setProfilingLevel() – record slow queries.
- Indexes work best when they fit in RAM
Replica Set
-
One is always the primary others are secondary.
-
Chosen by election
-
Automatic fail-over and recover
-
Reads can be from primary or secondary
-
Writes will always go to primary
-
Replica Sets are 2+ nodes, at least 3 is better
-
When a failed nodes come back, they recover by getting the missed updates, then join as a secondary node
-
Setup
mongod —replSet
cfg = { _id:
, members: [{_id:0, host:‘’}] } use admin
rs.initiate(cfg)
-
rs objects has replica set commands, needs to be issued on the current primary
-
rs.status()
-
Strong Consistency is only available when reading from primary
-
Reads on the secondary machines will be eventually consistent
-
Durability Options (set by driver)
-
fire and forget
- won’t know about failures due to unique constraints, disk full, or anything else.
-
wait for error recommended
-
wait for journal sync
-
wait for fsync (slow)
-
wait for replication (really slow)
-
-
Can give nodes priorities, which will help ensure a specific machine is primary.
- 0 priorities will never be primary
- when a higher priority machine is back online, it will force an election.
-
Can have a slave delay
-
Tag replica sets with properties and can specify when waiting.
-
Arbiters Member
- Don’t have data
- vote in elections
- used to break a tie
-
Hidden Member
- not seen by the clients
-
Data is stored for replication in an oplog capped collection, all secondaries have an oplog too.
Scaling
-
Vertical Scaling is limited
-
Horizontal scaling is cheaper, can scale wider then higher
-
Vertical can be a single point a failure, can be hard to backup/maintain.
-
Replica Sets are one type
- Can scale reads, but now writes, eventual consistency is the biggest downside.
- Replication can overwhelm the secondaries, reducing performance anyway
-
Why Shard?
- Distribute the write load
- Keep working set in RAM, by using multiple machines act like one big virtual machine
- Consistent reads
- Preserve functionality, by range based portioning most(all?) the query operators are available.
-
Sharding design goals
- scale linearly
- increase capacity with no downtime
- transparent to the application / clients
- low administration to add capacity
- no joins or transactions
- BigTable / PNUTS inspired read the PNUTS paper
-
Basics
- Choose how you partition data
- Convert from a single replica set to sharding with no downtime
- Full feature set
- Fully consistent by default
- You pick a shard key, which is used to move ranges of data to a shard
Architecture
-
Shard – each shard is it’s own replica set for automated fail over
-
Config Servers – store the meta data about where the partitions of data is, which shard
- Not a replica set, writes to the config server is done with a transaction by the mongod / mognos
-
Mongos – uses the config servers to know what shard to use for the data/query
- Client talks to the mongos servers
- chunk = collection minkey, maxkey, shard
- chunks are logical, not physical
- chunk is 64MB, once you hit this point a new split happens, a new shard is created and data is moved.
shard keys
- they are immutable.
- Choose a key
-
_id? is incremental this results in all writes going to one query
-
hash? is random, this partitions well, but now great for queries
-
user_id? kinda random, useful for lookups, all data for user_id X will be on one shard
- However you can’t split on this if one user is a really heavy user.
-
user_id + md5(x)? this is the best option.
-
Other notes
- Want to add capacity way before it’s needed, at least before 70% operation capacity, this allows the data to migrate over time
- Understand working set in RAM
- Machine too small and admin overhead goes up
- Machine too big and sharding doesn’t happen smoothly
MMS – MongoDB Monitoring Service
Webmentions
These are webmentions via the IndieWeb and webmention.io. Mention this post from your site: