Skip to content

Improve systests #6494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jellonek opened this issue Nov 26, 2024 · 5 comments
Open

Improve systests #6494

jellonek opened this issue Nov 26, 2024 · 5 comments

Comments

@jellonek
Copy link
Contributor

Description

At the moment systests contain a lot of spagetti code which seems to be time to time extended with another piece glued into it. In result:

  • instead of specifying a number of smeshers pods (named in comments as "containers") it has to be calculated from cluster size reduced by bootnodes, old nodes, possibly other pods. when attempted to use in place of "smesher" two pods (networking part, stateless worker part, as in node split) usage of cluster size (and also keys table in cluster context) went wild
  • pod definitions which are basically a configuration are constructed from code, while it should more probably be constructed from parametrised yaml templates rendered in runtime (that should improve readability of test cases and e.g. provide an option to have different configurations per smesher pod, instead of using the same config for all smesher pods)
@fasmat
Copy link
Member

fasmat commented Nov 26, 2024

Here are my 2 cents to the current problems with system tests:

There should be a single test that asserts all the functional requirements we have for go-spacemesh:

  • nodes that successfully publish ATXs are eligible to participate in Hare and receive rewards, nodes that are not do not receive rewards - currently covered by TestSmeshing
    • nodes that equivocate (e.g. by double publishing or publishing an invalid PoST) lose their eligibility for rewards - currently covered by TestEquivocation and TestPostMalfeasanceProof
    • nodes agree on the state of eligibilities - currently covered by multiple tests (e.g. TestSmeshing)
  • Spawning accounts and sending funds - currently covered by TestSmeshing
  • vesting accounts gain access to their tokens based on the schedule they are supposed to get access to - currently covered by TestSmeshing
  • check if beacon and active set generation works as expected - at the moment that would be via bootstrapping Test should behave as we do on mainnet and should be updated when we don't need bootstrapping any more
  • adding and removing nodes from the network works, i.e. an added node is able to sync from scratch and becomes eligible for rewards after publishing their first (valid) ATX. And removing multiple (up to 60% of) eligible nodes doesn't cause the network to halt or have other problems - currently covered by TestAddNodes and TestFailedNodes
  • if one (or multiple) poets fail the network still operates as expected, as long as at least one PoET is available - currently covered by TestsPoetsFailures

Non-functional tests probably require their own test each:

  • Test if the network heals from a state of partition (TestPartition_50_50 and TestPartition_30_70)
  • Test if beacon generation works even in a fallback state (TestFallback)
  • Test that a timeskew between nodes doesn't cause the network to disagree on timings (TestShortTimeskew)

I am unsure if this needs its own test or can be part of TestSmeshing:

  • Checkpoint functionality (currently tested by TestCheckpoint)

The basic idea is that the big test spawns a network at the start and then runs multiple tests (partially at the same time) on the same network with the same nodes instead of having to tear down and setup a new network multiple times.

Additionally we should consider to cover all supported modes of smeshing during the one big test: some nodes should use supervised smeshing, some should use remote smeshing, some multi-smeshing (ideally with at least one of those marrying their identities).

@ivan4th
Copy link
Contributor

ivan4th commented Nov 26, 2024

This probably should be taken into account:
https://metoro.io/blog/go-production-performance-gotcha-gomaxprocs
TL;DR: GOMAXPROCS may default to an unreasonable value in case of a containerized workload running on a machine together with a lot of other workloads. Not exactly sure if it is relevant, but probably worth checking

@poszu
Copy link
Contributor

poszu commented Nov 27, 2024

@fasmat It makes sense and it would speed up the tests significantly 👍, but we need to be careful and avoid creating "subtests" running in parallel on a single network that depend on each other as it would be a nightmare to debug if things go south. I think it's possible to have a test structure like this:

func TestNetwork(t *testing.T) {
  // setup a cluster
  t.Run("sending transactions", ....) // send transactions and observe accounts changes
  t.Run("poet failures", ....) // kills a poet and oberves that ATXs are still produced
  .... more

The problem is, that once things stop working because the poet was killed, the other test(s) would most likely start to fail too and telling why (what's the root of the problem) might be difficult.

@jellonek
Copy link
Contributor Author

@ivan4th we have that hardcoded to 4 in

corev1.EnvVar().WithName("GOMAXPROCS").WithValue("4"),

@fasmat
Copy link
Member

fasmat commented Nov 28, 2024

Also regarding GOMAXPROCS: I changed it to the suggestion in the blog @ivan4th linked, but didn't have a feeling it changed anything in terms of stability - multiple runs still were throttled, so I changed it back to the hard coded 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 📋 Backlog
Development

No branches or pull requests

5 participants