This adds some new optimization strategies. For tests, we use Either(SETA,
SkipUnlessSchedules), thereby giving both mechanisms a chance to skip tasks. On
try, SETA is omitted.
MozReview-Commit-ID: GL4tlwyeBa6
It is not at *all* clear how multiple optimizations for a single task should
interact. No simple logical operation is right in all cases, and in fact in
most imaginable cases the desired behavior turns out to be independent of all
but one of the optimizations. For example, given both `seta` and
`skip-unless-files-changed` optimizations, if SETA says to skip a test, it is
low value and should be skipped regardless of what files have changed. But if
SETA says to run a test, then it has likely been skipped in previous pushes, so
it should be run regardless of what has changed in this push.
This also adds a bit more output about optimization, that may be useful for
anyone wondering why a particular job didn't run.
MozReview-Commit-ID: 3OsvRnWjai4
This adds some new optimization strategies. For tests, we use Either(SETA,
SkipUnlessSchedules), thereby giving both mechanisms a chance to skip tasks. On
try, SETA is omitted.
MozReview-Commit-ID: GL4tlwyeBa6
It is not at *all* clear how multiple optimizations for a single task should
interact. No simple logical operation is right in all cases, and in fact in
most imaginable cases the desired behavior turns out to be independent of all
but one of the optimizations. For example, given both `seta` and
`skip-unless-files-changed` optimizations, if SETA says to skip a test, it is
low value and should be skipped regardless of what files have changed. But if
SETA says to run a test, then it has likely been skipped in previous pushes, so
it should be run regardless of what has changed in this push.
This also adds a bit more output about optimization, that may be useful for
anyone wondering why a particular job didn't run.
MozReview-Commit-ID: 3OsvRnWjai4
We currently vary the cache name for run-task tasks whenever run-task
changes. This allows us to not worry about backwards or forwards
compatibility of caches in run-task tasks.
This strategy doesn't work for out-of-tree Docker images because
the content of run-task cannot be determined at Taskgraph time:
the content of run-task was determined when that Docker image was
built and there is no way to get that content efficiently during
Taskgraph.
So, for out-of-tree Docker images we now vary the cache name by
the Docker image value, which includes its name and a tag or
hash. This means that out-of-tree run-task tasks will get separate
caches for each distinct Docker image.
This isn't ideal. Ideally we would share caches if run-task doesn't
vary between Docker images. But without any way of proving that
at Taskgraph time, we take the safe road and force cache separation.
MozReview-Commit-ID: FMiQBqfvjqW
The image_builder Docker image doesn't set a "command" in its task
definition. The image instead relies on a RUN in its Dockerfile to
control the started command. This command is a shell script which
eventually runs run-task.
This all means that image_builder tasks are executing run-task but
the cache sanitization implemented in bug 1391476 isn't getting
applied to those tasks. This means run-task could barf due to
constraint violations due to improperly configured caches.
The fix for this is to teach the generic task transform that
image_builder tasks use run-task. The effect of this is that
some environment variables get set and the cache name changes
depending on the contents of run-task.
MozReview-Commit-ID: IFqsDxD0eDh
This creates a new "job-from" field that contains the relative filename the job was defined
in. The filename is relative to 'config.path'. If the task came from the 'jobs' key defined
in kind.yml, this field will be set to 'kind.yml'.
MozReview-Commit-ID: 9e1tEb6XuZT
Clean up and standardize Treeherder symbols for Talos and AWSY tasks:
* Stylo disabled groups include `sd`
* Stylo sequential groups include `ss`
MozReview-Commit-ID: 7cl6e0XvXNO
Convert all jobs that were exercising Stylo enabled to Stylo disabled instead.
Stylo enabled is now handled by the default jobs.
In Perfherder, Stylo enabled jobs will be untagged and take over the existing
Gecko series. Stylo disabled jobs will have a new `stylo-disabled` tag and
create a new series.
MozReview-Commit-ID: BMXBRg3A95j
This adds some new optimization strategies. For tests, we use Either(SETA,
SkipUnlessSchedules), thereby giving both mechanisms a chance to skip tasks. On
try, SETA is omitted.
MozReview-Commit-ID: GL4tlwyeBa6
It is not at *all* clear how multiple optimizations for a single task should
interact. No simple logical operation is right in all cases, and in fact in
most imaginable cases the desired behavior turns out to be independent of all
but one of the optimizations. For example, given both `seta` and
`skip-unless-files-changed` optimizations, if SETA says to skip a test, it is
low value and should be skipped regardless of what files have changed. But if
SETA says to run a test, then it has likely been skipped in previous pushes, so
it should be run regardless of what has changed in this push.
This also adds a bit more output about optimization, that may be useful for
anyone wondering why a particular job didn't run.
MozReview-Commit-ID: 3OsvRnWjai4
`run-task` is taught a --sparse-profile argument to be passed down
to `hg robustcheckout` for the main source checkout. It does what
you expect: performs a sparse checkout using the named profile.
The Taskgraph YAML for run-task is taught a "sparse-profile"
property to define the sparse profile. When defined, --sparse-profile
will be passed down to `run-task` and the cache name will be updated
to reflect the use of sparse checkout.
Our cache checking transform is updated to audit for the use of
--sparse-profile without the corresponding "-sparse" cache name
variation.
The reason we need a distinct cache name for sparse is because
clients that aren't sparse aware will be unable to read checkouts
that are sparse. By forcing sparse and non-sparse into different
cache pools, we avoid compatibility issues.
In the ideal world, we probably support sparse profiles on all the
VCS checkouts that `run-task` supports (e.g. --tools-checkout).
Perfect is the enemy of done. All of this is defined in-tree and
it is easy enough to change atomically.
MozReview-Commit-ID: 79k7Vul0hHO
The UID and GID that a task executes under is dynamic. As a result,
caches need to be aware of the UID and GID that owns files otherwise
subsequent tasks could run into permission denied errors. This is
why `run-task --chown-recursive` exists. By recursively changing
ownership of persisted files, we ensure the current task is able
to read and write all existing files.
When you take a step back, you realize that chowning of cached
files is an expensive workaround. Yes, this results in cache hits.
But the cost is you potentially have to perform hundreds of thousands
of I/O system calls to mass chown. The ideal situation is that
UID/GID is consistent across tasks on any given cache and
potentially expensive permissions setting can be avoided. So, that's
what this commit does.
We add the task's UID and GID to run-task's requirements. When we
first see a cache, we record a UID and GID with it and chown the
empty cache directory to that UID and GID. Subsequent tasks using
this cache *must* use the same UID and GID or else run-task will
fail.
Since run-task now guarantees that all cache consumers use the same
UID and GID, we can avoid a potentially expensive recursive chown.
But there is an exception. In untrusted environments (namely Try),
we recursively chown existing caches if there is a uid/gid mismatch.
We do this because Try is a sandbox and any random task could
experiment with a non-standard uid/gid. That populated cache would
"poison" the cache for the next caller. Or vice-versa. It would be
annoying if caches were randomly poisoned due to Try pushes that
didn't realize there was a UID/GID mismatch. We could outlaw "bad"
UID and GIDs. But that makes the barrier to testing things on Try
harder. So, we go with the flow and recursively chown caches in
this scenario.
This change will shine light on all tasks using inconsistent UID
and GID values on the same cache. Bustage is anticipated.
Unfortunately, we can't easily know what will break. So it will be
one of those things where we will have to fix problems as they arise.
Fortunately, because caches are now tied to the content of run-task,
we only need to back out this change and tasks should revert to caches
without UID and GID pinning requirements and everything will work
again.
MozReview-Commit-ID: 2ka4rOnnXIp
We recently introduced support for telling run-task about caches so
it could sanitize them automatically. We also recently taught
docker-worker and docker-engine how to declare volumes.
Building on that work, we now pass a list of paths corresponding
to Docker volumes to run-task.
run-task now verifies volumes behave as expected. Unless the volume
paths correspond to caches, run-task verifies they are empty and chowns
them to an appropriate owner.
Requiring empty volumes is an arbitrary decision. But as the inline
comment says, it keeps things simpler and makes caches and volumes
behave more like each other.
MozReview-Commit-ID: 5lm2uIitrS3
See the inline comment for the rationale here.
This check may not catch all volumes and caches. But after subsequent
commits refactor how permissions for caches and volumes are handled,
this edge case will likely result in permissions errors in the task,
so it isn't worth worrying about.
Several Dockerfile have been updated to add missing VOLUME so the check
passes.
In the case of desktop1604-test, we stopped removing
/home/worker/.cache because you can't remove a mount point, which is
what volumes are inside Docker containers.
MozReview-Commit-ID: GEyNkkX00kN
Docker volumes are host-mounted filesystems. We typically mount
caches at their location. But not always. The reason we define
VOLUME in Dockerfiles is we're guaranteed to get a fast host
filesystem instead of AUFS when a cache isn't mounted.
In this commit, we teach the docker-worker payload builder about
the existence of Docker volumes. Docker volumes can be declared
inline in the YAML. More conveniently, we automatically parse out
VOLUME lines from corresponding in-tree Dockerfile.
We'll do useful things with this data in subsequent commits.
MozReview-Commit-ID: BNxp8EDEYw
Previously, we conditionally added caches to a task if the current
parameters warranted it.
In order to audit that all caches fulfill basic requirements, we need
to have unconditional knowledge of all caches.
This commit introduces an optional key on each cache entry stating
whether it should be skipped in "untrusted" environments. When we
convert a task definition to a worker payload, we filter out these
caches if necessary.
This change uncovered an inconsistency with filtering caches. In
one location we filtered on the source repo name. In others, we
filtered on the SCM level.
Setting the caches in the spidermonkey kind also changed slightly
to ensure we're not overwriting existing caches. I don't think this
has any behavior changes. But the new method is more correct.
MozReview-Commit-ID: 1crpdWHqQ68
run-task just grew features to aid with cache validation.
Attempts by run-task to use caches not under its control will fail.
So, we add a transform that audits for and ensures that certain
caches are only being used with run-task. This will help catch
stragglers attempting to use e.g. the legacy VCS checkouts or
tooltool caches without run-task. Fortunately, there are no
violations for this policy. Yay!
MozReview-Commit-ID: LBCmDUdgcuM
Today, cache names are mostly static and are brittle as a result.
In theory, when a backwards incompatible change is performed on
something that touches a cache, the cache name needs to be changed
to ensure tasks running the old code don't see cached data from the
new task. (Alternatively, all code is forward compatible, but that is
hard to implement in practice.)
For many things, the process works as planned. However, not everyone
knows that cache names need changed. And, it isn't always obvious
that some things require fresh caches. When mistakes are made, tasks
break intermittently due to cache wonkiness.
One area where we get into trouble is with UID and GID mismatch.
Task A will use a Docker image where our standard "worker" user/group
is UID/GID 1000:1000. Then Task B will use UID/GID 500:500. (This is
common when mixing Debian and RedHel based distros.) If they use the
same cache, then Task B needs to chown/chmod all files in the cache
or there could be a permissions problem. This is exactly why
run-task recursively chowns certain paths before dropping root
privileges.
Permissions setting in run-task solves permissions problems. But
it doesn't solve content incompatibility problems. For that, you
need to change cache names, not use caches, or blow away content
when incompatibilities are detected.
This commit starts the process of adding a little bit more coherence
to our caching story.
There are two main features in this commit:
1) Cache names tied to run-task content
2) Cache validation in run-task
Taskgraph now detects when a task is using caches with run-task. When
caches and run-task are both being used, the cache name is adjusted to
contain a hash of run-task's content. When run-task changes, the cache
name changes. So, changing run-task ensures that all caches from that point
forward are "clean." This frees run-task and any functionality related
to run-task (such as maintaining version control checkouts) from
having to maintain backwards or forwards compatibility with any other
version of run-task. This does mean that any changes to run-task
effectively wipe out caches. But changes to run-task tend to be
seldom, so this should be acceptable.
The second part of this change is code in run-task to record per-cache
properties and validate whether a populated cache is appropriate for
use. To enable this, taskgraph passes a list of cache paths via an
environment variable. For each cache path, run-task looks for a
well-defined file containing a list of "requirements." Right now,
that list is simply a version string. But other features will be
worked into it. If the cache is empty, we simply write out a new
requirements file and are done. If the file exists, we compare
requirements and fail fast if there is a mismatch. If the cache
has content but not this special file, then we abort (because this
should never happen).
The "requirements" validation isn't very useful now because the only
entry comes from run-task's source code and modifying run-task will
change the hash and cause a new cache to be used. The implementation
at this point is more demonstrating the concept than doing anything
terribly useful with it.
MozReview-Commit-ID: HtpXIc7OD1k
The upload now uses MOZ_SCM_LEVEL to determine which secret and bucket to
upload to, so it can potentially run at any level.
This also modifies task descriptions to allow {level} in scopes, and updates
try syntax to allow `-j doc-upload` even though run-on-tasks says it doesn't
run on try by default.
MozReview-Commit-ID: Dm27TGPa7IM
The toolchain jobs produce artifacts that are going to be used by other
jobs, but there is no reliable way for the decision task to know the
name of those artifacts. So we make their definition required in the
toolchain job definitions.
MozReview-Commit-ID: 9tPAMBAkvCs
Added config via tests.yml, test-sets.yml
Added remove_installer to config for linux.
Added blank for windows as that will come later.
MozReview-Commit-ID: 9tPAMBAkvCs
Added config via tests.yml, test-sets.yml
Added remove_installer to config for linux.
Added blank for windows as that will come later.