One of the hurdles that we frequently run into when developing is the question: “did this actually improve the product?” This can lead to some very interesting discussions about where the actual bottleneck is. One of the ways is to have a specific test harness that will perform the same actions in a predictable environment. If the data improves after changes are made, you can demonstrate that it has improved.
So I ended up making a test harness for the network. Some of the questions I had going into this test was:
- Is my P2P2 library an improvement over the old library?
- Does it make sense to make it more efficient if the limiting factor is the network?
- much EPS/TPS can Factom theoretically handle?
Very early on, I discovered that simulating a network on a single machine was just not a realistic scenario. Sending messages through a real network stack is very different and will lead to bottlenecks much faster than just piping it from app to app. I settled for the hardware I have at my disposal:
- Host: A high-end Ryzen 3700x rig used as the machine to create traffic on the network
- NUC1 & NUC2: Two similar NUCs with i3-7100Us
- PI2 & PI4: Two Raspberry PI 4’s with 2GB RAM and 4GB RAM respectively
The five machines were joined together via a simple gigabit switch.
The drawback of this setup is the low number of nodes. I ended up connecting every node to each other with a fanout higher than that, meaning every node will send every packet to every other node in the system. This obviously isn’t a realistic scenario for the Factom Protocol but it will still allow us to make a comparison between libraries in an identical environment.
The idea is to test just the network libraries without factomd slowing it down. This was accomplished by listening to real MainNet traffic of ~4 million messages captured over ~69 hours. It looks something like this:
Next, we separate those by what stays the same all the time, and what changes during load. Every federated node sends one EOM+ACK every minute, with every tenth one being a DBSig instead. Each audit node sends a HeartBeat every minute. These only scale with the size of the authority set but not load. A Factoid transaction also sends an ACK. An entry or a chain is created with one Commit Entry/Chain + ACK, and one Entry Reveal + ACK.
- EPS: Entries per second, where an entry is “a factoid transaction, new chain, or entry”.
- TPS: Transactions per second, where a transaction is defined as “an entry in the processlist”.
- MPS: Messages per second, where a message is defined as a message sent on the network. This also includes “duplicate messages” and P2P library messages.
For example, 1 EPS from entries = 2 TPS = 4 MPS, whereas 1 EPS from factoid transactions = 1 TPS = 2 MPS.
According to the data, 99.1% of EPS comes from entries, 0.7% comes from new chains, and 0.1% from factoid transactions. The average message sizes are also known, with, for example, an entry being around 130 (Commit) + 540 (Reveal) bytes big.
With this data, the app can simulate the messages sent on the Factom Network without running factomd itself. The client logic is rather simple:
- If the message arrives is a duplicate, ignore it
- If it’s a broadcast message (ACK, EOM/DBSig, Heartbeat, Commit, Reveal, Factoid Transaction), send a copy to peers
- If it’s a missing message or dbstate request, send a response to that peer only
That’s it. All messages were randomly generated bytes, using only 1 byte for identifying the message type.
Note: The test app I wrote is open-source, available on GitHub, though this is not meant to be a user-friendly setup. If you want to repeat these tests, please contact me for details.
Note 2: We typically don’t count the TPS from EOMs, even though they’re also entries in the processlist since they are constant and don’t scale with usage. With 25 federated nodes, the TPS from EOMs is roughly 0.83 TPS.
I ran the same test four times, once for the old p2p library, and three times for the new P2P2 library using the protocols V9 (old), V10 (slimmed down), and V11 (protobuf). The test starts with 1 EPS and ramps up by 500 every 30 seconds. Altogether there are about 15 of these segments, going from 1 up to 7,000 EPS in 500 EPS increments.
The full data is available as a spreadsheet with each data set in its own sheet: https://docs.google.com/spreadsheets/d/1V83kmfMZpRpEBqr7HttjBpXmssPAdSaeP2ZamZZRRh0/edit?usp=sharing
EPS by Protocol
The first graph shows the EPS recorded by the Host computer, which has the best hardware in the network:
Up to 1,000 EPS, everything is pretty stable with only minor fluctuations in the old library. Starting at around 1,500 EPS the old library’s fluctuations start getting bigger and bigger, with the P2P2 protocols showing lesser signs of destabilization. At 2,000 EPS, the machine stops being able to keep up with the demand. At 7,000 EPS, the old library suffered an outage due to unknown reasons (though it recovered on its own after ~20 seconds, which is not shown on graph).
The biggest conclusion that can be drawn from this graph is that the P2P2 library is able to handle traffic in a much smoother fashion, rather than having big spikes. The reason for this has to do with the way golang handles channels, and anyone interested in a more technical discussion of this improvement can read the conversation between Luap and myself on GitHub.
Let’s take a look at the same data, recorded by the worst hardware in the network, a Raspberry PI 4 with 2GB of ram:
It starts similar to the Host machine but both libraries start to become unstable, including dips down to zero. The difference in hardware is clearly visible and has an impact.
The EPS graph itself doesn’t tell us why the library wasn’t able to keep up with the data, so let’s take a look at what the machines tried to send. The uprate specified here is in bytes per second. 10,000,000 Bps is ~80 Mbit or 10 MB/s.
One thing to note here is that the old library doesn’t tally the byte rate accurately, it only sums up the payload of the messages but doesn’t account for the overhead. The P2P2 V9 protocol is identical to the old library for network encoded message size. For that reason, the old library has been omitted from this graph.
The first observation is that for the same amount of target EPS, the V9 protocol tries to send out significantly more data. On average, V11 adds ~3.7% overhead, V10 adds ~4.4% overhead, and V9 adds 29.3% overhead. This was the reason I added V10, which is just V9 with only the essential information.
This chart stacks the uprate of V10 together, representing the total update of all machines. As mentioned in the setup, I used a Gigabit switch to connect all 5 machines. At 1,500 EPS when things start derailing, the combined network use is 240 Mbit. After that, the system fails to keep up with EPS and the growth of Mbit slows down dramatically. Near 7,000 EPS the combined update is ~500 Mbit, which likely hits the physical limits of the switch itself.
This points toward the switch itself being the first bottleneck we encounter. To test this, let’s look at:
For this chart, I ran 5 instances of the test app on my host machine, connected to each other via the operating system’s loopback. In this test, all five apps kept up with the specified EPS up to 5,500 EPS. After that, the CPU wasn’t able to keep up with demand anymore.
The message upload rate by protocol also yielded interesting results:
V11 sets itself apart quite clearly, sending almost twice as many messages as its counterparts. If we now graph the relationship between uprate and mps, ie the ratio of how many bytes are sent for each message, we end up with:
Once again, V11 dramatically outperforms the other two, using nearly half the bandwidth to send the same amount of messages. Yet in the graphs above, V11 didn’t scale any better than the other protocols? Why is that? Likely V11’s lower CPU overhead allows the lower end hardware machines to encode more messages, though unfortunately these messages are duplicates of existing ones. Roughly 4/5th of messages are duplicates, V11 may be able to fit more of those into the same bandwidth.
EPS by Machine
So far we mostly looked at the EPS of the host machine but let’s compare the EPS of all machines for the old library, V10, and V11:
The noticeable difference is that both V10 and V11 manage to attain higher EPS on each machine, with V11 having the added effect of not spiking as much on lower-end hardware (green and magenta).
This effect becomes even more apparent if we stack the EPS together:
Both the V10 and V11 tests reached a combined EPS of ~20,000, or 4,000 EPS average per machine, whereas the old library only reached ~15,000, or 3,000 EPS average per machine. This is due to the additional hardware and bandwidth strain of the old library and V9 on the lower end nodes since the Host EPS stayed roughly the same.
This was a lot of data to analyze but there are two results that jump out in every graph: One is that P2P2 performs better on hardware with both higher message throughput and less “spiky” message delivery. V10 and V11 both have demonstrable upsides of reducing the amount of bandwidth nodes will need. The other is that EPS is very much limited by hardware constraints of the network stack, not the P2P code itself.
Only one of the questions I set out hasn’t been answered yet: How much EPS/TPS can Factom theoretically handle? Unfortunately, I still don’t know the answer. Five nodes aren’t enough to simulate the 150+ node Factom network, so this will have to be something explored at a later date.
I also had some ideas to further streamline the network. Even at low target EPS rates (around 10), messages would queue up for delivery. This presents an opportunity to bundle messages together, sending a batch of messages rather than single messages. This will reduce the load on the network layer and more efficiently transmit large amounts of data. At current EPS rates of the Factom Protocol this is not necessary but it will be if we want to scale well into the triple-digit EPS.
At some point, I’d like to refine the test app and improve output and data collection so these graphs can be automated, then test using a 32 node or more large scale environment. However, as a small scale test, this has already yielded useful data to directly compare very complex code.