1 unstable release
0.1.0 | Oct 8, 2021 |
---|
#555 in Concurrency
88KB
1.5K
SLoC
mongo_sync
Mongodb realtime synchronizer, which is similar to py-mongo-sync
Features
- Support database level and collection full and concurrent synchronize. You can use
--collection-concurrent
to define how many threads to sync a database, use--doc-concurrent
to define how many threads to sync a collection. - Support oplog based synchronize, so we can synchronize incremental data in realtime.
- Support daily rotation log, you can use it through
--log-path
option. Or else log information will be output to stdout.
Support
Mongodb 3.6+ (because of official mongodb driver only support mongodb 3.6+)
Intall
The recommended way to install mongo_sync is using cargo
:
cargo +nightly install mongo_sync
You can download released binary as well.
Running tests?
To running integration tests, you need to config SYNCER_TEST_SOURCE
to a testing mongodb uri, or mongodb://localhost:27017
will be used.
Example
To run synchronizer, you need to start oplog_syncer to make a realtime mongodb oplog sync first.
Then you can run db_sync
to sync database in realtime.
oplog_syncer
./target/release/oplog_syncer --src-uri "mongodb://localhost:27017" --oplog-storage-uri "mongodb://localhost:27018/"
db_sync
db_sync --src-uri "mongodb://localhost:27017/?authSource=admin" --oplog-storage-uri "mongodb://localhost:27018/?authSource=admin" --target-uri "mongodb://localhost:27019" --db test_db
Note that the --oplog-storage-uri
in oplog_syncer and db_sync must be the same.
Usage help
oplog_syncer
USAGE:
oplog_syncer [OPTIONS] --src-uri <src-uri> --oplog-storage-uri <oplog-storage-uri>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
--log-path <log-path>
log file path, if not specified, all log information will be output to stdout
-o, --oplog-storage-uri <oplog-storage-uri> target oplog storage uri
-s, --src-uri <src-uri> source database uri, must be a mongodb cluster
db_sync
USAGE:
db_sync [OPTIONS] --src-uri <src-uri> --target-uri <target-uri> --oplog-storage-uri <oplog-storage-uri> --db <db>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
--collection-concurrent <collection-concurrent> how many threads to sync a database
-c, --colls <colls>...
collections to sync, default sync all collections inside a database
-d, --db <db> database to sync
--doc-concurrent <doc-concurrent> how many threads to sync a collection
--log-path <log-path>
log file path, if no specified, all log information will be output to stdout
-o, --oplog-storage-uri <oplog-storage-uri>
mongodb uri which save oplogs, it's saved by `oplog_syncer` binary
-s, --src-uri <src-uri> source mongodb uri
-t, --target-uri <target-uri> target mongodb uri
The basic arthitecture diagram
┌───────────────┐
│ target db │
└───┬───────────┘
│
xxxx ▼ xxxxxxx
xx ┌───────┐ x
xx │db_sync│
x └───────┘ x
xxxxxx ▲ ▲ x
xx │ │ xxxx
│ │
│ │
Full dump │ │Incr dump
│ │(Real time)
│ │
│ │
│ │
│ │ ┌─────────────────────┐
│ └─────────┤oplog storage db │
│ └──────▲──────────────┘
│ xxxxxxx │ xxxxx
│ x ┌──────┴──────┐x
│ x │Oplog syncer │x Sync oplog from source cluster
│ x └──────▲──────┘x to oplog storage in real time
│ xxxxxxxxx│ xxxxxxx
│ │
│ │
│ │
│ ┌──────┴───────┐
└────────────┤Source cluster│
└──────────────┘
According to the diagram, you can find that there are 2 basic programs provided by mongo_sync
- oplog syncer: sync mongodb cluster's oplog to target
oplog storage db
. - db sync: sync data from
source cluster
totarget db
.
Benchmark
It's not strictly benchmark test, I just test it manually.
Scenario:
When source cluster insert 50,000 records, how long the target_db
can synchronizer these new 50,000 insert.
My testing result:
db_sync
takes about 50 seconds to sync these update, and py-mongo-sync
takes about 225 seconds to sync these update. In general, it's about 3.5x faster than py-mongo-sync
.
And, please note that 50 seconds
is not accrutely, it highly depends on your database and running machine performance.
How does the core work?
- oplog_syncer just use tailable cursor to read mongodb oplogs in realtime.
- db_sync full dump just use multithread with find to read data.
- db_sync incr dump use oplog to sync incremental update (CRUD, database command.), and this is why source database must be a cluster.
Notes
- In incremental state, for now it only support the following command to be sync (which enough for my personal use.):
- rename collection
- dorp collection
- create collection
- drop indexes
- create indexes
- I haven't test mongo sharding as target, but it should be ok to work.
- during running
oplog_syncer
,oplog storage db
will create and using databse namedsource_oplog
, and create and using collection namedsource_oplog
. For now this is hardcoded. - during running
db_sync
, target databse will create a new collection namedoplog_records
, it saves the latest oplog timestamp applied to the database.
Dependencies
~27–39MB
~724K SLoC