Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converted inputs to sets + fixed inputs #87

Merged
merged 3 commits into from
Jan 31, 2025
Merged

Converted inputs to sets + fixed inputs #87

merged 3 commits into from
Jan 31, 2025

Conversation

sesquideus
Copy link
Contributor

@sesquideus sesquideus commented Jan 28, 2025

  • reconciled with jb/ifu_rsrf_dev
  • converted inputs to sets to avoid duplicates and facilitate comparisons
  • added a few diagnostic messages
  • added / removed inputs to conform to the SOF files (possibly not to DRL-D though)

@sesquideus sesquideus requested a review from janusbrink January 28, 2025 15:20
@hugobuddel
Copy link
Contributor

I don't think it is wise to convert the input to lists. The order of the input (e.g. in the SOF) file is important. As in that it affects things in the ESO stack.

Also, I would like to (at some point) derive the workflow directly from the recipes, or at least experiment with that. And then the order can be very important.

So I (strongly) oppose to this change.

@hugobuddel
Copy link
Contributor

Plus, can we fix the CI before adding new things?

@sesquideus
Copy link
Contributor Author

So the "set of files" is not actually a set, but a list? 🤔
I am not touching the contents of the framesets, only the Inputs in InputSet.inputs (the order is currently completely arbitrary there and ESO does not care either, this is our internal structure). What I need to prevent is the repetition, but of course that can be done with lists too.

@janusbrink
Copy link
Contributor

Not sure about the ordering issue, but losing this info might get us tangled down the line, compared to what we gain by going to sets? So, we want 'ordered sets'? Something like OrderedSet([1, 2, 3]) ?

@sesquideus
Copy link
Contributor Author

The source of this information is the order in which they are defined in the class hierarchy (by us). So for instance in DarkImageProcessor RAW is always before MASTER_DARK and then other recipe-specific inputs follow. I usually copied the inputs from DRLD, so there might be some logical ordering, but I never cared much -- if we access the inputs by attribute name (self.raw) order should not matter much anyway.

But with sets it is easier for me to prevent developer errors: if I derive from RawImageProcessor.InputSet, it automatically includes inputs |= {self.raw}. If I forget and include it again in the derived class, there will be too many inputs and my tests for all_frames == used_frames will fail.

@hugobuddel
Copy link
Contributor

I think that the input should be order, and that it should be the exact same order as the with_associated_input entries in the EDPS tasks. I changed my mind while writing this comment, go ahead, the input should be a set.

Technically the order should not matter for the current way things are organized. But making the input a set will make it harder to move your beautiful abstraction layer into the direction I would like. For example, I'd like all the information that is in the EDPS tasks to be also present in the recipe itself. To the extend that the EDPS task can be derived from the recipe. (FWIW, this also means I personally like to have 3 or 4 recipes for the dark instead of 1.)

However, I don't think the order of the with_associated_input actually matters in the current implementation of the EDPS. But a big point of my ADASS paper and talk is that I believe that to be wrong: making use of that order in the associated input allows one to propagate information over the dependency graph.

So I would like a way forward in which this ordering is kept.

Hmm, thinking about it more, I do think that it should be possible to infer the ordering in case it would matter. So maybe go ahead and do this.

To summarize a case where the ordering would matter in the EDPS workflow is when you'd want your standard star reduction to use the same flat as is used for the science data reduction. Then you need to first determine the flat for the science data before you can decide how to do the standard star reduction (which is indirectly also input to the science task).

But if such a selection mechanism is indeed needed, then it must be encoded. E.g. through a query like science.standard.flat == science.flat (where a dot follows the chain of input dependencies). It should be possible to encode such a query in a way that includes which of those is leading, e.g. by using another symbol than == or by convention that the right side should be evaluated first.

Also, you are right that this change does not directly affect the SOF files, but indirectly it might. For example if we create documentation from the recipe directly and that people create their SOF based on the documentation. (And you are also right that the order in the SOF files should not matter, but it can.)

So I've reduced my strong opposition to a mild one, and perhaps I now even think that the ordering should not matter. Furthermore, there are quite a lot of hypotheticals in the story above, that might never come true, so you should not worry about.

(Another hypothetical: say you actually want the same input twice? Not as a list-of-two, but as two distinct inputs. Now we can't do that anymore! But try as I might, all scenario's I can come up with would be better served with either a list, or by having different classes for the input data so they are not duplicates anymore.)

Concluding, I think the benefits outweigh the negatives, so good idea!

@hugobuddel
Copy link
Contributor

Maybe a data structure with a deterministic ordering like OrderedSet would indeed be beneficial. Because just this morning I fixed a bug in the ScopeSim tests because find is not deterministic: AstarVienna/ScopeSim#535 . We should prevent such random bugs if possible.

@sesquideus
Copy link
Contributor Author

sesquideus commented Jan 29, 2025

@hugobuddel Thanks for the stream of consciousness! I had a throbbing headache at night and decided not to touch anything, now with a clear head I think I understand. I see no downsides in using an ordered set, but then we need to make sure it matches DRLD / workflows everywhere. On the other hand, I cannot find any ordered set implementation in the standard library, only sortedcollections.

If an input file is required twice for whatever reason, it is perfectly possible to use two separate Inputs pointing to the same file (though right now I have no idea how that could be useful).

@janusbrink
Copy link
Contributor

From Python 3.7 onwards a keys-only dict should mostly work like an ordered set? Depends on the operations we expect on it after it is created, though.

@sesquideus
Copy link
Contributor Author

From Python 3.7 onwards a keys-only dict should mostly work like an ordered set?

Then I am already using that. But I get

>>> {1, 2, 3} == {3, 1, 2}
True

which is exactly what I want, but apparently not ordered.

@hugobuddel
Copy link
Contributor

I like to take a step back and think about what we are trying to solve. You claim that this is necessary for the "used frames", that is #85 (if I understand correctly), because there you say

It requires me to convert inputs from a list to a set

However, I'm not sure that #85 solves the problem as set out in #79, which I think it is aimed to do. So transitively, I'm not sure this P.R. is necessary. We might still merge this because it seems useful on its own, but we should do so for the right reasons.

The main reason to distinguish between "all frames" and "used frames" is when a recipe has multiple output products, and the different output products use different input. The two main cases are:

  • The recipe has extra (optional) outputs. The example in PipelineProduct.save(): distinguish between allframes and usedframes #79 is that many recipes calculate an intermediate data product that represents the background. This background is usually discarded, but it is worthwhile to have an option to save the background to file (in addition to saving the primary output of the recipe). The primary output of the recipe would have used all input files, but the background would only have used a subset of those. So the file with the background should have fewer files listed as dependency than the primary output.
  • A recipe can process multiple files in parallel. For example the "basic reduce" recipes could for example process all science raws from a single template in parallel. A benefit would be that calibrations like the master flat and master dark would be shared between all of them, and thus would have to be loaded into memory only once. But all frames would require their own individual persistence correction. So each file should be saved with only one persistence correction as used frame.

In both cases I don't see how having the input as a set would help. But I could be misunderstanding what we are trying to achieve. Also, we can still merge this either way.

@sesquideus
Copy link
Contributor Author

I see. No, for now I only wanted to compare frames present in the SOF and those that are also loaded. If it should mean tracking which inputs were actually used for which product, it does not attempt to solve that.

Currently I am only trying to ensure that all inputs are present in the SOF and read (unless not required) and vice versa, that there are no extra / unused files in the SOF (I silently assume that the files are correct and whenever something is missing, it is the fault of whoever wrote the recipe). So it still is a tiny step in the right direction, but for validating the Inputs, not for tracking which are used in the Products.

But I guess we will have to add something similar to the Product class anyway. Then we can populate it manually from within process_images. The hierarchy can simplify it a lot, for instance a RawImageProcessor should be able to process and mark all RAWs as used too.

Again, it seems to me that this will be achieved more easily with sets rather than lists, but I will gladly defer this discussion to the future PR.

@sesquideus
Copy link
Contributor Author

I have found the error, it is the discrepancy between what pyesorex and edps use as their SOF. After removing LM_WCU_OFF_RAW from metis_det_detlin and gain maps and linearity from the other SOFs EDPS completes all its job.

However that makes other pyesorex tests fail. I can xfail the tests now, but to be consistent with the DRLD the files have to be there.

@sesquideus sesquideus requested a review from hugobuddel January 30, 2025 15:53
Copy link
Contributor

@hugobuddel hugobuddel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sesquideus , let's merge this and get everything working again!

Some thoughts:

  • I think we should write our recipes such that they will do the best they can even if input data is missing (or weird). E.g. metis_det_detlin might still be able to do something sensible if there is no LM_WCU_OFF_RAW. Or maybe depending on some flag that can be set. Such a 'just do it' flag might be especially useful for AIT because people might deliberately want to skip some calibration steps. (Persistence should always be optional though.)
  • The recipes should emit a warning on missing input so it is immediately clear what is wrong.
  • If a change in the recipes require a change in the workflow, then we can just make that change in the workflow too.

@sesquideus sesquideus merged commit 49896b0 into main Jan 31, 2025
2 checks passed
@sesquideus sesquideus deleted the mb/usedframes branch January 31, 2025 10:50
@sesquideus sesquideus restored the mb/usedframes branch January 31, 2025 20:35
@sesquideus sesquideus deleted the mb/usedframes branch January 31, 2025 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants