The biggest problem I've seen with CI isn't the failing part, it's what teams do when it fails. The "just rerun it" culture kills the whole point.
We had a codebase where about 15% of CI runs were flaky. Instead of fixing the root causes (mostly race conditions in tests and one service that would intermittently timeout), the team just added auto-retry. Three attempts before it actually reported failure. So now a genuinely broken build takes 3x longer to tell you it's broken, and the flaky stuff just gets swept under the rug.
The article's right that failure is the point, but only if someone actually investigates the failure instead of clicking retry.
The "just retry" approach is truly bothersome. I think it is at least partly an organizational issue, because it happens far more often when QA is a separate team.
Oversimplified click bait. The purpose never changed from catching bad bugs before it was sent to prod. The goal of CI is to prevent the resulting problems from doing damage and requiring emergency repairs.
I don't really understand the point you're trying to make, I don't see anywhere in the post nor the title claiming the purpose changed and the title is directly related to the content. In fact, it seems like you are just agreeing with the post.
I think people can get frustrated at CI when it fails, so they're explaining that that's the whole purpose of it and why it's is a actually good thing.
I would personally actually frame it slightly different than the author. Non-flaky CI errors: your code failed CI. Flaky CI errors: CI failed. Just to be clear, that's more precise but would never catch on because people would simplify "your code failed CI" to "CI failed" over time, but I don't thing that changes it from being an interesting way to frame.
This is of course true as a blanket "gotcha" headline- although I wouldn't call a failed test the CI itself failing. A real failure would be a false positive, a pass where there wasn't coverage, or a failure when there was no breaking change. Covering all of these edge cases can become as tiresome as maintaining the application in the first place (of course this is a generalization)
True, but you can't have complete tests without 100% coverage. It's a necessary, but not a sufficient condition; as long as it doesn't become the sole goal, it's still a useful metric.
The parent is talking about when the implementation is flaky, not the test. When you go to fix the problem under that scenario there is no reason for you to modify the test. The test is fine.
What you're describing is the every day reality but what you WANT is that if your implementation has a race condition, then you want a test that 100% of the time detects that there is a race condition (rather than 1% of the time).
If your test can deterministically result in a race condition 100% of the time, is that a race condition? Assuming that we're talking about a unit test here, and not a race condition detector (which are not foolproof).
> Assuming that we're talking about a unit test here
I think the categorisation of tests is sometimes counterproductive and moves the discussion away from what's important: What groups of tests do I need in order to be confident that my code works in the real world?
I want to be confident that my code doesn't have race conditions in it. This isn't easy to do, but it's something I want. If that's the case then your unit test might pass sometimes and fail sometimes, but your CI run should always be red because the race test (however it works) is failing.
This is also hints at a limitation of unit tests, and why we shouldn't be over-reliant on them - often unit tests won't show a race. In my experience, it's two independent modules interacting that causes the race. The same can be true with a memory bug caused by a mismatch in passing of ownership and who should be freeing, or any of the other issues caused by interactions between modules.
> I think the categorisation of tests is sometimes counterproductive
"Unit test" refers to documentation for software-based systems that has automatic verification. Used to differentiate that kind of testing from, say, what you wrote in school with a pencil. It is true that the categorization is technically unnecessary here due to the established context, but counterproductive is a stretch. It would be useful if used in another context, like, say: "We did testing in CS class". "We did unit testing in CS class" would help clarify that you aren't referring to exams.
Yeah, Kent Beck argues that "unit test" needs to bring a bit more nuance: That it is a test that operates in isolation. However, who the hell is purposefully writing tests that are not isolated? In reality, that's a distinction without a difference. It is safe to ignore old man yelling at clouds.
But a race detector isn't rooted in providing verifiable documentation. It only observes. That is what the parent was trying to separate.
> I want to be confident that my code doesn't have race conditions in it.
Then what you really WANT is something like TLA+. Testing is often much more pragmatic, but pragmatism ultimately means giving up what you want.
> often unit tests won't show a race.
That entirely depends on what behaviour your test is trying to document and validate. A test validating properties unrelated to race conditions often won't consistently show a race, but that isn't its intent so there would be no expectation of it validating something unrelated. A test that is validating that there isn't race condition will show the race if there is one.
You can use deterministic simulation testing to reproduce a real-world race condition 100% of the time while under test.
But that's not the kind of test that will expose a race condition 1% of the time. The kinds of tests that are inadvertently finding race conditions 1% of the time are focused on other concerns.
So it is still not a case of a flaky test, but maybe a case of a missing test.
I thought that line was kind of funny: When a CI run fails, you don't rerun it and wait for the result, you rerun it and check why the original run failed in the meantime. Is it flaky? Is it a pipeline issue? Connectivity issue? Did some Key expire?
If you just rerun and don't go to find out what exactly caused CI to fail, you end up at the author's conclusion:
It's possibly something else nondeterministic, which may be even more subtle from an external look than a race condition. That should be rare, but it’s been known to happen.
I’m one of today’s lucky 10k, because this judo-threw me with how I (didn’t) understand CI/CD. My experience with it has largely been a cumbersome add-on to existing processes that are often incredibly fragile and impossible to amend; turns out, that’s kind of the point. Understanding that it’s the equivalent of doing rocket tests on kit you expect to fail and using that to build better rockets suddenly makes its value far more recognizable, at least to my eyes.
Solid writeup. Definitely keeping in my personal notes.
Some of the other practices of CI are also important. Not explicitly mentioned by the article, but perhaps implied. CI is a lot more than just running tests on pull request. It's a whole suite of practices enabling teams to perform and ship better. Some of which include keeping branches short lived by merging back to main early and often. Keeping code ready for deployment at any time by using strategies like feature switches. This keeps the cost of shipping a feature as low as possible, avoiding issues like spending lots of time rebasing and merging long lived feature branches.
The premise of the article has some weight, but the final conclusion with the suggestion to change the icons seems completely crazy.
Green meaning "to the best of our knowledge, everything is good with the software" is well understood.
Using green to mean "we know that this doesn't work at all" is incredibly poor UI (EDITED from "beyond idiotic" due to feedback, my bad).
And whilst flaky tests are the most problematic for a CI system, it's because they often work (and usually, from my experience most flaky tests are because they are modelling situations that don't usually happen in production) and so are often potentially viable builds for deployment with a caveat. If anything, they should be marked orange if they are tests that are known to be problematic.
Hey, author here: I completely agree, that's why I also haven't used those strange colours for https://nix-ci.com.
I just thought they would make for a cool visual representation of the point of the blog post.
I agree. The same can be said for testing too: their main purpose is to find mistakes (with secondary benefits of documenting, etc.). Whenever I see my tests fail, I'm happy that they caught a problem in my understanding (manifested either as a bug in my implementation, or a bug in my test statement).
This ultimately is what shapes my view of what a good test is vs a bad test.
An issue I have with a lot of unit tests is they are too strongly coupled to the implementation. What that means is any change to the implementation ultimately means you have to change tests.
IMO, good tests are relatively immutable. You should be able to have multiple valid implementations. You should add new tests to describe the new functionality of that implementation, however, the old tests should remain relatively untouched.
If it turns out that a single change to an implementation requires you to change and update 20 tests, those are bad tests.
What I want as a dev is to immediately think "I must have broken something" when a test fails, not "I need to go fix 20 tests".
For example, let's say you have a method which sorts data.
A bad test will check "did you call this `swap` function 5 times". A good test will say "I gave the method this unsorted data set, is the data set sorted?". Heck, a good test can even say something like "was this large data set sorted in under x time". That's more tricky to do well, but still a better test than the "did you call swap the right number of times" or even worse "Did you invoke this sequence of swap calls".
> IMO, good tests are relatively immutable. You should be able to have multiple valid implementations. You should add new tests to describe the new functionality of that implementation, however, the old tests should remain relatively untouched.
Taken to extreme this would mean getting rid of unit tests altogether in favor of functional and/or end-to-end testing. Which is... a strategy. I don't know if it is a good or bad strategy, but I can see it being viable for some projects.
If you can't tell, I actually think functional tests have a lot more value than most unit tests :)
Kent Dodd agrees with me. [1]
This isn't to say I see no value in unit tests, just that they should tend towards describing the function of the code under test, not the implementation.
> Taken to extreme this would mean getting rid of unit tests all together in favor of functional and/or end-to-end testing.
The dirty little secret in CS is that unit, functional, and end-to-end tests are all the exact same thing. Watch next time someone tries to come up with definitions to separate them and you'll soon notice that they didn't actually find a difference or they invent some kind of imagined way of testing that serves no purpose and nobody would ever do.
Regardless, even if you want to believe there is a difference, the advice above isn't invalidated by any of them. It is only saying test the visible, public interface. In fact, the good testing frameworks out there even enforce that — producing compiler errors if you try to violate it.
Yep, the 'unit' is size in which one chooses to use. The exact same thing happens when trying to discuss micro services v monolith.
Really it all comes down to agreeing to what terms mean within the context of a conversation. Unit, functional, and end-to-end are all weasel words, unless defined concretely, and should raise an eyebrow when someone uses them.
> The dirty little secret in CS is that unit, functional, and end-to-end tests are all the exact same thing.
I agree that the boundaries may be blurred in practice, but I still think that there is distinction.
> visible, public interface
Visible to whom? A class can have public methods available to other classes, a module can have public members available to other modules, a service can have public API that other services can call through network etc
I think that the difference is the level of abstraction we operate on:
unit -> functional -> integration -> e2e
Unit is the lowest level of abstraction and e2e is the highest.
The user. Your tests are your contract with the user. Any time there is a user, you need to establish the contract with the user so that it is clear to all parties what is provided and what will not randomly change in the future. This is what testing is for.
Yes, that does mean any of classes, network services, graphical user interfaces, etc. All of those things can have users.
> Unit is the lowest level of abstraction and e2e is the highest.
There is only one 'abstraction' that I can see: Feed inputs and evaluate outputs. How does that turn into higher or lower levels?
It took me a bit of time (and two or three different view) to finally get this. That is mostly why I hardcode my values in the tests. Make them simpler. If something fails, either the values are wrong or the algorithm of the implementation is wrong.
Comparing actual outputs against expected ones is the ideal situation, IMHO. My own preference is for property-checking; but hard-coding a few well-chosen values is also fine.
That's made easier when writing (mostly) pure code, since the output is all we have (we're not mutating anything, or triggering other processes, etc. that would need extra checking).
I also think it's important to make sure we're checking the values we actually care about; since those might not be the literal return value of the "function under test". For example, if we're testing that some function correctly populates a table cell, I would avoid comparing the function's result against a hard-coded table, since that's prone to change over time in ways that are irrelevant. Instead, I would compare that cell of the result against a hard-coded value. (Rather than thinking about the individual values, I like to think of such assertions as relating one piece of code to another, e.g. that the "get_total" function is related to the "populate_total" function, in this way...).
The reason I find this important, is that breaking a test requires us to figure out what it's actually trying to test, and hence whether it should have broken or not; i.e. is it a useful signal that requires us to change our approach (the table should look like that!), or is it noise that needs its incidental details updated (all those other bits don't matter!). That can be hard to work out many years after the test was written!
Also agree. There’s also a diminishing returns with test cases. Which is why I focus mainly on what I do not want to fail. The goal is not really to prove that my code work (formal verification is the tool for that), but to verify that certain failure cases will not happen. If one does, the code is not merged in.
https://github.com/srid/nixci
Is this the project or is this a completely different Nix based CI/CD tool?
I can't find a Github or anything on the website.
> One dreaded and very common situation is when a failing CI run can be made to pass by simply re-running it. We call this flaky CI.
> Flaky CI is nasty because it means that a CI failure no longer reliably indicates that a mistake was caught. And it is doubly nasty because it is unfixable (in theory); sometimes machines just explode.
> Luckily flakiness can be detected: Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.
One of the specialties that I have (unwillingly!) specialized in at my current company is CI flakes. Nearly all flakes, well over 90% of them, are not "unfixable", nor are they even really some boogy man unreliable thing that can't be understood.
The single biggest change I think we made that helped was having our CI system record the order¹ in which tests are run. Rerunning the tests, in the same order, makes most flakes instantly reproduce locally. Probably the next biggest reproducer is "what was the time the test ran?" and/or running it in UTC.
But once you get from "it's flakey" (and fails "seeming" "at" "random") to "it fails 100% of the time on my laptop when run this way" then it becomes easier to debug, b/c you can re-run it, attach a debugger, etc. Database sort issues (SQL is not deterministically ordered unless you ORDER BY), issues with database IDs (e.g., test expects row ID 3, usually gets row ID 3, but some other test has bumped us to row ID 4²), timezones — those are probably the biggest categories of "flakes".
While I know what people express with "flake", "flake" as a word is usually "failure mode I don't understand".
(Excluding truly transitory issues like a network failure interfering with a docker image pull, or something.)
(¹there are a lot of reasons people don't have deterministically ordered CI runs. Parallelism, for example. Our order is deterministic, b/c we made a value judgement that random orderings introduce too much chaos. But we still shard our tests across multiple VMs, and that sharding introduces its own changes to the order, as sometimes we rebalance one test to a different shard as devs add or remove tests.)
²this isn't usually because the ID is hardcoded, it is usually b/c, in the test, someone is doing `assert Foo.id == Bar.id`, unknowningly. (The code is usually not straight-forward about what the ID is an ID to.) I call this ID type confusion, and it's basically weakly-typed IDs in langs where all IDs are just some i32 type. FooId and BarId types would be better, and if I had a real type system in my work's lang of choice…
A fairly large category of the flaky CI jobs I see is "dodgy infrastructure". For instance one recurring type for our project is one I just saw fail this afternoon, where a gitlab CI runner tries to clone the git repo from gitlab itself and gets an HTTP 502 error. We've also had issues with "the s390 VM that does CI job running is on an overloaded host, so mostly it's fine but occasionally the VM gets starved of CPU and some of the tests time out".
We do also have some genuinely flaky tests, but it's pretty tempting to hit the big "just retry" button when there's all this flakiness we can't control mixed in there.
> those are probably the biggest categories of "flakes".
Interesting. In my experience, it is always either a concurrency issue in the program under test or PBTs finding some extreme edge case that was never visited before.
I think this can be generalised into saying that the purpose of tests is to fail. I've seen far too many tests that are written to pass. You need to write tests to fail.
> When it passes, it's just overhead: the same outcome you'd get without CI.
The outcome still isn't the same. CI, even when everything passes, enables other developers to build on top of your partially-built work as it becomes available. This is the real purpose of CI. Test automation is necessary, but only to keep things sane amid you continually throwing in fractionally-complete work.
This is stupidly obvious but you'd be surprised how many people have the attitude that competent developers should have tested their code manually before making PRs so you shouldn't need CI.
The biggest problem I've seen with CI isn't the failing part, it's what teams do when it fails. The "just rerun it" culture kills the whole point.
We had a codebase where about 15% of CI runs were flaky. Instead of fixing the root causes (mostly race conditions in tests and one service that would intermittently timeout), the team just added auto-retry. Three attempts before it actually reported failure. So now a genuinely broken build takes 3x longer to tell you it's broken, and the flaky stuff just gets swept under the rug.
The article's right that failure is the point, but only if someone actually investigates the failure instead of clicking retry.
The "just retry" approach is truly bothersome. I think it is at least partly an organizational issue, because it happens far more often when QA is a separate team.
I don't understand this at all. Why not just skip CI altogether if you're not interested in the results?
If a build fails 10% of the time, it actually takes 100x longer before to fail for the 10%x10%x10% case.
Oversimplified click bait. The purpose never changed from catching bad bugs before it was sent to prod. The goal of CI is to prevent the resulting problems from doing damage and requiring emergency repairs.
I don't really understand the point you're trying to make, I don't see anywhere in the post nor the title claiming the purpose changed and the title is directly related to the content. In fact, it seems like you are just agreeing with the post.
I think people can get frustrated at CI when it fails, so they're explaining that that's the whole purpose of it and why it's is a actually good thing.
I would personally actually frame it slightly different than the author. Non-flaky CI errors: your code failed CI. Flaky CI errors: CI failed. Just to be clear, that's more precise but would never catch on because people would simplify "your code failed CI" to "CI failed" over time, but I don't thing that changes it from being an interesting way to frame.
This is of course true as a blanket "gotcha" headline- although I wouldn't call a failed test the CI itself failing. A real failure would be a false positive, a pass where there wasn't coverage, or a failure when there was no breaking change. Covering all of these edge cases can become as tiresome as maintaining the application in the first place (of course this is a generalization)
> a pass where there wasn't coverage
I always feel obliged to point out that we can have 100% coverage without making a single assertion (beware Goodhart's law)
True, but you can't have complete tests without 100% coverage. It's a necessary, but not a sufficient condition; as long as it doesn't become the sole goal, it's still a useful metric.
100% coverage is an EXPTIME problem.
> Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.
Or you have a concurrency issue in your production code?
Then the test is still flaky. If there's a bug you want the test to consistently fail, not just sometimes.
But also a flaky test is a bug by itself.
The parent is talking about when the implementation is flaky, not the test. When you go to fix the problem under that scenario there is no reason for you to modify the test. The test is fine.
What you're describing is the every day reality but what you WANT is that if your implementation has a race condition, then you want a test that 100% of the time detects that there is a race condition (rather than 1% of the time).
If your test can deterministically result in a race condition 100% of the time, is that a race condition? Assuming that we're talking about a unit test here, and not a race condition detector (which are not foolproof).
> Assuming that we're talking about a unit test here
I think the categorisation of tests is sometimes counterproductive and moves the discussion away from what's important: What groups of tests do I need in order to be confident that my code works in the real world?
I want to be confident that my code doesn't have race conditions in it. This isn't easy to do, but it's something I want. If that's the case then your unit test might pass sometimes and fail sometimes, but your CI run should always be red because the race test (however it works) is failing.
This is also hints at a limitation of unit tests, and why we shouldn't be over-reliant on them - often unit tests won't show a race. In my experience, it's two independent modules interacting that causes the race. The same can be true with a memory bug caused by a mismatch in passing of ownership and who should be freeing, or any of the other issues caused by interactions between modules.
> I think the categorisation of tests is sometimes counterproductive
"Unit test" refers to documentation for software-based systems that has automatic verification. Used to differentiate that kind of testing from, say, what you wrote in school with a pencil. It is true that the categorization is technically unnecessary here due to the established context, but counterproductive is a stretch. It would be useful if used in another context, like, say: "We did testing in CS class". "We did unit testing in CS class" would help clarify that you aren't referring to exams.
Yeah, Kent Beck argues that "unit test" needs to bring a bit more nuance: That it is a test that operates in isolation. However, who the hell is purposefully writing tests that are not isolated? In reality, that's a distinction without a difference. It is safe to ignore old man yelling at clouds.
But a race detector isn't rooted in providing verifiable documentation. It only observes. That is what the parent was trying to separate.
> I want to be confident that my code doesn't have race conditions in it.
Then what you really WANT is something like TLA+. Testing is often much more pragmatic, but pragmatism ultimately means giving up what you want.
> often unit tests won't show a race.
That entirely depends on what behaviour your test is trying to document and validate. A test validating properties unrelated to race conditions often won't consistently show a race, but that isn't its intent so there would be no expectation of it validating something unrelated. A test that is validating that there isn't race condition will show the race if there is one.
You can use deterministic simulation testing to reproduce a real-world race condition 100% of the time while under test.
But that's not the kind of test that will expose a race condition 1% of the time. The kinds of tests that are inadvertently finding race conditions 1% of the time are focused on other concerns.
So it is still not a case of a flaky test, but maybe a case of a missing test.
I thought that line was kind of funny: When a CI run fails, you don't rerun it and wait for the result, you rerun it and check why the original run failed in the meantime. Is it flaky? Is it a pipeline issue? Connectivity issue? Did some Key expire?
If you just rerun and don't go to find out what exactly caused CI to fail, you end up at the author's conclusion:
> (but it could also just have been flaky again).
It's possibly something else nondeterministic, which may be even more subtle from an external look than a race condition. That should be rare, but it’s been known to happen.
I’m one of today’s lucky 10k, because this judo-threw me with how I (didn’t) understand CI/CD. My experience with it has largely been a cumbersome add-on to existing processes that are often incredibly fragile and impossible to amend; turns out, that’s kind of the point. Understanding that it’s the equivalent of doing rocket tests on kit you expect to fail and using that to build better rockets suddenly makes its value far more recognizable, at least to my eyes.
Solid writeup. Definitely keeping in my personal notes.
Some of the other practices of CI are also important. Not explicitly mentioned by the article, but perhaps implied. CI is a lot more than just running tests on pull request. It's a whole suite of practices enabling teams to perform and ship better. Some of which include keeping branches short lived by merging back to main early and often. Keeping code ready for deployment at any time by using strategies like feature switches. This keeps the cost of shipping a feature as low as possible, avoiding issues like spending lots of time rebasing and merging long lived feature branches.
The premise of the article has some weight, but the final conclusion with the suggestion to change the icons seems completely crazy.
Green meaning "to the best of our knowledge, everything is good with the software" is well understood.
Using green to mean "we know that this doesn't work at all" is incredibly poor UI (EDITED from "beyond idiotic" due to feedback, my bad).
And whilst flaky tests are the most problematic for a CI system, it's because they often work (and usually, from my experience most flaky tests are because they are modelling situations that don't usually happen in production) and so are often potentially viable builds for deployment with a caveat. If anything, they should be marked orange if they are tests that are known to be problematic.
Hey, author here: I completely agree, that's why I also haven't used those strange colours for https://nix-ci.com. I just thought they would make for a cool visual representation of the point of the blog post.
Good insights but I'd suggest
"beyond idiotic" -> "misleading | poor UX"
(I agree it's a terrible choice, but civility matters, and strengthens your case.)
Fair point, updated my wording.
I agree. The same can be said for testing too: their main purpose is to find mistakes (with secondary benefits of documenting, etc.). Whenever I see my tests fail, I'm happy that they caught a problem in my understanding (manifested either as a bug in my implementation, or a bug in my test statement).
This ultimately is what shapes my view of what a good test is vs a bad test.
An issue I have with a lot of unit tests is they are too strongly coupled to the implementation. What that means is any change to the implementation ultimately means you have to change tests.
IMO, good tests are relatively immutable. You should be able to have multiple valid implementations. You should add new tests to describe the new functionality of that implementation, however, the old tests should remain relatively untouched.
If it turns out that a single change to an implementation requires you to change and update 20 tests, those are bad tests.
What I want as a dev is to immediately think "I must have broken something" when a test fails, not "I need to go fix 20 tests".
For example, let's say you have a method which sorts data.
A bad test will check "did you call this `swap` function 5 times". A good test will say "I gave the method this unsorted data set, is the data set sorted?". Heck, a good test can even say something like "was this large data set sorted in under x time". That's more tricky to do well, but still a better test than the "did you call swap the right number of times" or even worse "Did you invoke this sequence of swap calls".
> IMO, good tests are relatively immutable. You should be able to have multiple valid implementations. You should add new tests to describe the new functionality of that implementation, however, the old tests should remain relatively untouched.
Taken to extreme this would mean getting rid of unit tests altogether in favor of functional and/or end-to-end testing. Which is... a strategy. I don't know if it is a good or bad strategy, but I can see it being viable for some projects.
If you can't tell, I actually think functional tests have a lot more value than most unit tests :)
Kent Dodd agrees with me. [1]
This isn't to say I see no value in unit tests, just that they should tend towards describing the function of the code under test, not the implementation.
[1] https://kentcdodds.com/blog/the-testing-trophy-and-testing-c...
The goal of unit tests is to circumvent problems with performance or specificity from functional tests.
If you haven't seen those problems with yours, unit tests would be useless.
> Taken to extreme this would mean getting rid of unit tests all together in favor of functional and/or end-to-end testing.
The dirty little secret in CS is that unit, functional, and end-to-end tests are all the exact same thing. Watch next time someone tries to come up with definitions to separate them and you'll soon notice that they didn't actually find a difference or they invent some kind of imagined way of testing that serves no purpose and nobody would ever do.
Regardless, even if you want to believe there is a difference, the advice above isn't invalidated by any of them. It is only saying test the visible, public interface. In fact, the good testing frameworks out there even enforce that — producing compiler errors if you try to violate it.
Yep, the 'unit' is size in which one chooses to use. The exact same thing happens when trying to discuss micro services v monolith.
Really it all comes down to agreeing to what terms mean within the context of a conversation. Unit, functional, and end-to-end are all weasel words, unless defined concretely, and should raise an eyebrow when someone uses them.
> The dirty little secret in CS is that unit, functional, and end-to-end tests are all the exact same thing.
I agree that the boundaries may be blurred in practice, but I still think that there is distinction.
> visible, public interface
Visible to whom? A class can have public methods available to other classes, a module can have public members available to other modules, a service can have public API that other services can call through network etc
I think that the difference is the level of abstraction we operate on:
unit -> functional -> integration -> e2e
Unit is the lowest level of abstraction and e2e is the highest.
> Visible to whom?
The user. Your tests are your contract with the user. Any time there is a user, you need to establish the contract with the user so that it is clear to all parties what is provided and what will not randomly change in the future. This is what testing is for.
Yes, that does mean any of classes, network services, graphical user interfaces, etc. All of those things can have users.
> Unit is the lowest level of abstraction and e2e is the highest.
There is only one 'abstraction' that I can see: Feed inputs and evaluate outputs. How does that turn into higher or lower levels?
It took me a bit of time (and two or three different view) to finally get this. That is mostly why I hardcode my values in the tests. Make them simpler. If something fails, either the values are wrong or the algorithm of the implementation is wrong.
Comparing actual outputs against expected ones is the ideal situation, IMHO. My own preference is for property-checking; but hard-coding a few well-chosen values is also fine.
That's made easier when writing (mostly) pure code, since the output is all we have (we're not mutating anything, or triggering other processes, etc. that would need extra checking).
I also think it's important to make sure we're checking the values we actually care about; since those might not be the literal return value of the "function under test". For example, if we're testing that some function correctly populates a table cell, I would avoid comparing the function's result against a hard-coded table, since that's prone to change over time in ways that are irrelevant. Instead, I would compare that cell of the result against a hard-coded value. (Rather than thinking about the individual values, I like to think of such assertions as relating one piece of code to another, e.g. that the "get_total" function is related to the "populate_total" function, in this way...).
The reason I find this important, is that breaking a test requires us to figure out what it's actually trying to test, and hence whether it should have broken or not; i.e. is it a useful signal that requires us to change our approach (the table should look like that!), or is it noise that needs its incidental details updated (all those other bits don't matter!). That can be hard to work out many years after the test was written!
The purpose of a car's crumple zone is to crumple.
Also agree. There’s also a diminishing returns with test cases. Which is why I focus mainly on what I do not want to fail. The goal is not really to prove that my code work (formal verification is the tool for that), but to verify that certain failure cases will not happen. If one does, the code is not merged in.
https://github.com/srid/nixci Is this the project or is this a completely different Nix based CI/CD tool? I can't find a Github or anything on the website.
Author here: NixCI (https://nix-ci.com) is not open-source. https://github.com/srid/nixci has been replaced by om ci: https://omnix.page/om/ci.html
> One dreaded and very common situation is when a failing CI run can be made to pass by simply re-running it. We call this flaky CI.
> Flaky CI is nasty because it means that a CI failure no longer reliably indicates that a mistake was caught. And it is doubly nasty because it is unfixable (in theory); sometimes machines just explode.
> Luckily flakiness can be detected: Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.
One of the specialties that I have (unwillingly!) specialized in at my current company is CI flakes. Nearly all flakes, well over 90% of them, are not "unfixable", nor are they even really some boogy man unreliable thing that can't be understood.
The single biggest change I think we made that helped was having our CI system record the order¹ in which tests are run. Rerunning the tests, in the same order, makes most flakes instantly reproduce locally. Probably the next biggest reproducer is "what was the time the test ran?" and/or running it in UTC.
But once you get from "it's flakey" (and fails "seeming" "at" "random") to "it fails 100% of the time on my laptop when run this way" then it becomes easier to debug, b/c you can re-run it, attach a debugger, etc. Database sort issues (SQL is not deterministically ordered unless you ORDER BY), issues with database IDs (e.g., test expects row ID 3, usually gets row ID 3, but some other test has bumped us to row ID 4²), timezones — those are probably the biggest categories of "flakes".
While I know what people express with "flake", "flake" as a word is usually "failure mode I don't understand".
(Excluding truly transitory issues like a network failure interfering with a docker image pull, or something.)
(¹there are a lot of reasons people don't have deterministically ordered CI runs. Parallelism, for example. Our order is deterministic, b/c we made a value judgement that random orderings introduce too much chaos. But we still shard our tests across multiple VMs, and that sharding introduces its own changes to the order, as sometimes we rebalance one test to a different shard as devs add or remove tests.)
²this isn't usually because the ID is hardcoded, it is usually b/c, in the test, someone is doing `assert Foo.id == Bar.id`, unknowningly. (The code is usually not straight-forward about what the ID is an ID to.) I call this ID type confusion, and it's basically weakly-typed IDs in langs where all IDs are just some i32 type. FooId and BarId types would be better, and if I had a real type system in my work's lang of choice…
A fairly large category of the flaky CI jobs I see is "dodgy infrastructure". For instance one recurring type for our project is one I just saw fail this afternoon, where a gitlab CI runner tries to clone the git repo from gitlab itself and gets an HTTP 502 error. We've also had issues with "the s390 VM that does CI job running is on an overloaded host, so mostly it's fine but occasionally the VM gets starved of CPU and some of the tests time out".
We do also have some genuinely flaky tests, but it's pretty tempting to hit the big "just retry" button when there's all this flakiness we can't control mixed in there.
> those are probably the biggest categories of "flakes".
Interesting. In my experience, it is always either a concurrency issue in the program under test or PBTs finding some extreme edge case that was never visited before.
I think this can be generalised into saying that the purpose of tests is to fail. I've seen far too many tests that are written to pass. You need to write tests to fail.
> When it passes, it's just overhead: the same outcome you'd get without CI.
The outcome still isn't the same. CI, even when everything passes, enables other developers to build on top of your partially-built work as it becomes available. This is the real purpose of CI. Test automation is necessary, but only to keep things sane amid you continually throwing in fractionally-complete work.
It also allows for much better record keeping than just spinning up new versions in production without the pipeline.
This is stupidly obvious but you'd be surprised how many people have the attitude that competent developers should have tested their code manually before making PRs so you shouldn't need CI.
[dead]