I'm trying to simulate a production environment inside Docker. While testing and observing the behaviour, I encountered two cases I don't understand:
Case 1:
2 of 3 members become "ONLINE" after startup in "performance_schema.replication_group_members", one fails probably due to name resolution problems in the first minutes in Docker. Startup scenario is the same as in Docker's mysql/mysql-gr container, with two differences: MySQL version 5.7.17 is used and Multi-Primary mode was enabled.
- Failed member sees only itself and reports itself as "OFFLINE" or "ERROR" (I've seen both: OFFLINE if compromised member can't contact other members, ERROR if compromised member was expelled from other members due to network reasons) in performance_schema.replication_group_members".
-> Since network was functional all the time writing to offline member is possible and creates a split brain scenario.
-> START GROUP_REPLICATION will succeed but increase the split brain situation - our former OFFLINE member becomes online and creates it's own replication group.
Is write-allowed a bug or working as intended? If that's correct, which is best practice to prevent writing to this member on database layer but keeping Multi-Primary mode? A more stringent blocking mechanism might probably be helpful.
Case 2:
Lets kill one of the remaining two members.
- Last living member detects failure of other member and sets its state to "UNREACHABLE" in "performance_schema.replication_group_members", but keeps it's own state as "ONLINE".
-> Writing to last living member is not possible (and thats correct).
Why does the state remain "ONLINE"?
Thanks and br,
Alex
Case 1:
2 of 3 members become "ONLINE" after startup in "performance_schema.replication_group_members", one fails probably due to name resolution problems in the first minutes in Docker. Startup scenario is the same as in Docker's mysql/mysql-gr container, with two differences: MySQL version 5.7.17 is used and Multi-Primary mode was enabled.
- Failed member sees only itself and reports itself as "OFFLINE" or "ERROR" (I've seen both: OFFLINE if compromised member can't contact other members, ERROR if compromised member was expelled from other members due to network reasons) in performance_schema.replication_group_members".
-> Since network was functional all the time writing to offline member is possible and creates a split brain scenario.
-> START GROUP_REPLICATION will succeed but increase the split brain situation - our former OFFLINE member becomes online and creates it's own replication group.
Is write-allowed a bug or working as intended? If that's correct, which is best practice to prevent writing to this member on database layer but keeping Multi-Primary mode? A more stringent blocking mechanism might probably be helpful.
Case 2:
Lets kill one of the remaining two members.
- Last living member detects failure of other member and sets its state to "UNREACHABLE" in "performance_schema.replication_group_members", but keeps it's own state as "ONLINE".
-> Writing to last living member is not possible (and thats correct).
Why does the state remain "ONLINE"?
Thanks and br,
Alex