Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIP-84 | Add Auth for Dags #47433

Merged
merged 15 commits into from
Mar 11, 2025
Merged

Conversation

jason810496
Copy link
Contributor

related: #42360

Why

Since #47062 was reverted in #47402, this PR will also address the full test issue here.

@boring-cyborg boring-cyborg bot added the area:API Airflow's REST/HTTP API label Mar 6, 2025
@Lee-W Lee-W added the full tests needed We need to run full set of tests for this PR to merge label Mar 6, 2025
@Lee-W Lee-W self-requested a review March 6, 2025 02:24
@Lee-W Lee-W force-pushed the feature/AIP-84/auth/dags branch from 5482c3e to e7c9e29 Compare March 6, 2025 09:36
@pierrejeambrun
Copy link
Member

pierrejeambrun commented Mar 6, 2025

As expected full tests break and need fixing. Also while testing the UI I realized that the server is not happy with this change (can't load dags internal error 500) and we will need to wait for @vincbeck fix. FabAuthManager is crashing on get_permitted_dag_ids. This is a bug in the auth manager implementation and not detected by our tests at the moment because unit tests use the SimpleAuthManager where the dev setup use FabAuthManager by default.

@jason810496
Copy link
Contributor Author

As expected full tests break and need fixing. Also while testing the UI I realized that the server is not happy with this change (can't load dags internal error 500) and we will need to wait for @vincbeck fix. FabAuthManager is crashing on get_permitted_dag_ids. This is a bug in the auth manager implementation and not detected by our tests at the moment because unit tests use the SimpleAuthManager where the dev setup use FabAuthManager by default.

Just to double-check, do you mean that for this PR, I don’t need to make any changes but should wait for FabAuthManager to be fixed and then rebase to verify everything works correctly?

If that’s the case, should #47388 also be reverted? ( Since you mentioned that the dev setup use FabAuthManager by default, I initially thought the issue was caused by not adding a JWT token to the API client. )

@pierrejeambrun
Copy link
Member

pierrejeambrun commented Mar 6, 2025

There are two problems:

  • this PR is breaking the k8s KubeExec integration tests, this need to be fixed
  • this PR uses get_permitted_dag_ids which breaks the front-end at the moment, Vincent will fix this I believe. Other auth PRs do not use get_permitted_dag_ids, this is only introduced here, so other PRs don't need to be reverted.

@jason810496
Copy link
Contributor Author

There are two problems:

  • this PR is breaking the k8s KubeExec integration tests, this need to be fixed
  • this PR uses get_permitted_dag_ids which breaks the front-end at the moment, Vincent will fix this I believe. Other auth PRs do not use get_permitted_dag_ids, this is only introduced here, so other PRs don't need to be reverted.

Thanks for confirmation, I will start fixing k8s integration tests.

@vincbeck
Copy link
Contributor

vincbeck commented Mar 6, 2025

I'll have a PR for FAB today

@vincbeck
Copy link
Contributor

vincbeck commented Mar 6, 2025

#47458

@Lee-W Lee-W force-pushed the feature/AIP-84/auth/dags branch from e7c9e29 to 3621398 Compare March 7, 2025 08:04
@Lee-W Lee-W force-pushed the feature/AIP-84/auth/dags branch 5 times, most recently from 5148e1a to c327620 Compare March 7, 2025 15:26
@jason810496 jason810496 force-pushed the feature/AIP-84/auth/dags branch from c327620 to 0813087 Compare March 7, 2025 18:43
@jason810496
Copy link
Contributor Author

jason810496 commented Mar 7, 2025

Finally fixed the Kubernetes tests locally!

Some notes:

  1. The API server still uses FAB Auth Manager, so when retrieving a JWT, the same session must be used to handle the CSRF token session issue. Otherwise, it results in:

    <html lang=en>
    <title>400 Bad Request</title>
    <h1>Bad Request</h1>
    <p>The CSRF session token is missing.</p>
    
  2. The current redirect_url from FAB is still http://localhost:8080/?token=..., but KUBERNETES_HOST_PORT is not localhost:8080.
    This can be fixed in a future PR.
    Update: Fixed in Fix Breeze K8s: Patch ConfigMap [api/base_url] with Port-Forwarded Host Port #47544

@jason810496 jason810496 force-pushed the feature/AIP-84/auth/dags branch from 0813087 to 7ca6fd4 Compare March 8, 2025 02:30
@jason810496
Copy link
Contributor Author

jason810496 commented Mar 8, 2025

I ran the Kubernetes tests locally yesterday, and they all passed. However, after rebasing today, they consistently fail.
It seems that triggering a dagRun results in a 404 error:

post_string = f"http://{host}/public/dags/{dag_id}/dagRuns"
print(f"Calling [start_dag]#2 {post_string}")
logical_date = datetime.now(timezone.utc).isoformat()
# Trigger a new dagrun
result = self.session.post(post_string, json={"logical_date": logical_date})

self = <requests.adapters.HTTPAdapter object at 0x119d835b0>, request = <PreparedRequest [POST]>, stream = False, timeout = Timeout(connect=None, read=None, total=None), verify = True, cert = None, proxies = OrderedDict()

# ...
    
        except MaxRetryError as e:
            if isinstance(e.reason, ConnectTimeoutError):
                # TODO: Remove this in 3.0.0: see #2811
                if not isinstance(e.reason, NewConnectionError):
                    raise ConnectTimeout(e, request=request)
    
            if isinstance(e.reason, ResponseError):
>               raise RetryError(e, request=request)
E               requests.exceptions.RetryError: HTTPConnectionPool(host='localhost', port=14476): Max retries exceeded with url: /public/dags/example_kubernetes_executor/dagRuns (Caused by ResponseError('too many 404 error responses'))

Tracing the issue on the API side, it looks like the DAG must be active to be triggered:

dm = session.scalar(select(DagModel).where(DagModel.is_active, DagModel.dag_id == dag_id).limit(1))
if not dm:
raise HTTPException(status.HTTP_404_NOT_FOUND, f"DAG with dag_id: '{dag_id}' not found")

However, in the patch DAG endpoint, only the is_paused field can be modified, while the is_active field cannot be changed:

dag = session.get(DagModel, dag_id)
if dag is None:
raise HTTPException(status.HTTP_404_NOT_FOUND, f"Dag with id: {dag_id} was not found")
fields_to_update = patch_body.model_fields_set
if update_mask:
if update_mask != ["is_paused"]:
raise HTTPException(
status.HTTP_400_BAD_REQUEST, "Only `is_paused` field can be updated through the REST API"
)


Update: This can be resolved by adding airflow dags reserialize before the test run.

@jason810496
Copy link
Contributor Author

jason810496 commented Mar 9, 2025

No error with Kubernetes tests / K8S System:KubernetesExecutor-3.9-v1.29.12-true
but Kubernetes tests / K8S System:KubernetesExecutor-3.9-v1.29.12-false always fail due to timeout.

=========================== short test summary info ============================
FAILED kubernetes_tests/test_kubernetes_executor.py::TestKubernetesExecutor::test_integration_run_dag_with_scheduler_failure - Failed: Timeout >300.0s
=================== 1 failed, 1 passed in 376.40s (0:06:16) ====================

Update:
I just looked into the Kubernetes test with the -false suffix, which indicates that useStandardNaming=false. I will try deploying with useStandardNaming=false and observe the differences when running the Kubernetes test against a cluster with this setting.

@Lee-W Lee-W force-pushed the feature/AIP-84/auth/dags branch from 7724236 to e62d885 Compare March 10, 2025 00:15
@jason810496
Copy link
Contributor Author

jason810496 commented Mar 10, 2025

Yep, what I meant was maybe we can extend the timeout time. but would like to confirm whether it was set that way for a reason

Sure. will increase it to 500 to ensure that the test failure is solely due to the timeout threshold.
I will push if this try still fail.

@jason810496
Copy link
Contributor Author

Base on the following traceback, I think we need to add retry mechanism for _get_jwt_token part, or maybe having HTTP Adapter for 401, 403 status code.

https://github.com/apache/airflow/actions/runs/13768914292/job/38504602702

@jason810496
Copy link
Contributor Author

New CI fail:Prepare breeze & CI image:3.9 probably related to rebase ?

cc @Lee-W

@Lee-W
Copy link
Member

Lee-W commented Mar 11, 2025

Yep, seems to be something starts to happen this morning and can be fixed by retriggering

@jason810496
Copy link
Contributor Author

jason810496 commented Mar 11, 2025

@rawwar also ran the k8s test locally and found that it only fails when running all tests but passes when running a specific test.

I'm looking into the reason now.


Update: Not found the root cause, but add the following mechanism

  • Add JWTRefreshAdapter to handle 401, 403 status code.
  • Only rollout restart airflow-api-server deployment if needed.

@jason810496
Copy link
Contributor Author

Only 1 k8s test fail flaky 🎉 , I will rebase main and run again.

kubernetes_tests/test_kubernetes_pod_operator.py::TestKubernetesPodOperator::test_kubernetes_pod_operator_active_deadline_seconds[10] - requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

@jason810496
Copy link
Contributor Author

Good news! All k8s tests are success in this run, I will rebase later ( there are conflicts with latest main ).

@Lee-W
Copy link
Member

Lee-W commented Mar 11, 2025

all green. merging

@jason810496
Copy link
Contributor Author

jason810496 commented Mar 11, 2025

All green!
I believe the Kubernetes tests have become more stable after this PR by adding additional HTTP retries at each stage of the API calls.

Lesson learned:
Retries are essential in distributed systems since components or connections can become unstable at any time.

@Lee-W Lee-W merged commit 41a8a9a into apache:main Mar 11, 2025
89 checks passed
@rawwar
Copy link
Collaborator

rawwar commented Mar 11, 2025

Lesson learned: Retries are essential in distributed systems since components or connections can become unstable at any time.

How many retries?
Answer: I'll take all of them

@Lee-W
Copy link
Member

Lee-W commented Mar 11, 2025

#protm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:API Airflow's REST/HTTP API full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants