This repository was archived by the owner on Mar 31, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 10
Alcor v1.0 Release Plan
lfu-ps edited this page Jan 28, 2022
·
53 revisions
Tentative date: 01/31/2021
Release: Alcor v1.0
Release link: TBD
- E2E performance
- VPC API throughput reaches RPS 500 | Status: Achieved
- Subnet API throughput reaches RPS 500 | Status: Achieved
- Port API throughput reaches RPS 1000 (current RPS = 800) | Status: RPS reaches 1300
- Port API latency cuts down to 200 ms when PRS = 100 and when the database is populated with 100K VPC and 1M Ports | Status: Reaches 200~250 ms
- VM boot throughput | Status: Launch 1500 VMs in one time (success rate = 100%)
- ACA + NCM
- Performance testing for Hazelcast, Ignite and ETCD | Status: test report ongoing (ETA: 1/30)
- Cut latency for 1 Million OVS rules/flows (local ports + neighbor ports) to single host from ~60 sec to 20 sec (50K ports/second) | Status: Achieved, reduced to 13 seconds
- Measure one host ACA on-demand throughput (Initial goal: 100K) and latency (under 100ms) | Status: Achieved, single-host throughput reaches 300K
- Measure NCM + with stress gRPC clients on-demand throughput and latency | Status: Testing ongoing (ETA: 1/28)
- Large VPC provisioning | Status: Research ongoing, multiple release
- Build a large-scale emulation framework for large VPC up to 1M ports per VPC (MiniNet, MaxiNet, DistriNet)
- MiniNet v2.0 stress test on single server (setup a custom tree topology, stress test with Ryu controller
- Setting up Distrinet cluster on multiple servers and build basic test cases.
- Build a large-scale emulation framework for large VPC up to 1M ports per VPC (MiniNet, MaxiNet, DistriNet)
- Microservice Development
- Goal State v2 E2E
- DPM v2.1 to support GS v2
- new gRPC clients based on GS v2
- new programming path from DPM to NCM
- ACA to support GS v2 for routing rule update
- DPM to support L2/L3 neighbor, and L3 routing rule update end-to-end
- DPM v2.1 to support GS v2
- Cache/DB schema redesign to improve latency for large scale
- Port/IP/VPC cache improvement
- DPM cache new design and implementation
- Design doc (James/Dahai)
- DPM subnetPortCache optimization
- VPC/Subnet/Route Manager v2.0 for higher concurrency and throughput
- VPC manager performance improvement
- Subnet manager performance improvement
- Route manager performance improvement
- Goal State v2 E2E
- Host network configuration optimization
- Topic I: Topology-aware policy based route reachability detection and lookup
- Phase I: fundamental lookup data structure of consolidated reachability-adjusted routes
- Phase II: (enhancement) lookup path trimming and optimization (stretch goal)
- Topic II: Multi types input policy conflict detection and routes optimization (stretch goal)
- Topic I: Topology-aware policy based route reachability detection and lookup
- Run Alcor performance test via Rally (Alcor perf test plan)
- Consolidate Rally test script to run all test at once - Yan (ETA: 1/30)
- Upgrade Rally version
- Update Rally scripts to be compatible with new Rally
- Subnet/VPC API perf test
- Alcor performance bugs (Issue list # - Dahai 2/2)
- Consolidate Rally test script to run all test at once - Yan (ETA: 1/30)
- Database optimization
- Ignite version upgraded to v2.10
- Optimization techniques
- Prefix query
- NCM to Ignite profiling and latency data collection for both goal state provisioning and on-demand requests (?)
- NCM Ignite batch write
-
ACA major refactor & On-demand workflow perf profiling
- ACA state computation/orchestration layer that processes Goal State and orchestrates programming jobs to data plane
- evaluation of cpp logic agnostic high performance framework
- ACA threading model redesign to support high concurrency
- migration of existing goal-state to flow manipulation to new mechanism
- Implementation and perf test 1 million ports (target: 50+ seconds to 20 seconds)
- Alcor benchmarking framework based on CBench
- Investigate CBench (an OF controller benchmarking tool) and leverage its packet in/out test mechanism for ACA and on-demand
- Throughput measurement for on-demand ACA standalone (within ACA, local cache + threading model)
- Throughput measurement for ACA current threading model, find bottleneck
- Throughput measurement evaluation and comparison, across upper layer threading pool candidates
- Performance measurement for ACA upper layer (GS communication) and medium layer (state programming)
- Alcor benchmarking framework based on Dubhe
- Throughput measurement for on-demand NCM standalone
- Measure grpc round trip latency
- Measure NCM internal latency
- Throughput measurement for on-demand E2E (ACA + NCM) (stretch goal)
- Throughput measurement for on-demand NCM standalone
- ACA perf profiling for aca/ovs interaction improvement
- perf profiling for new channel mechanism of ovs connection/flows
- Driver communication layer that exchanges commands and events
- lib-fluid library (for connection control) importing and wrapper migration
- openvswitch library (for flow control) importing and wrapper migration
- integration with existing ovs control based on vconn
- support of normal flow operations (add/mod/del etc.)
- support of advanced feature of bundling (requires OF1.4 and above)
- Fix ACA crash issue when concurrently processing a large number of GoalStates
- ACA state computation/orchestration layer that processes Goal State and orchestrates programming jobs to data plane
- Build a large-scale emulation framework for large VPC up to 1M ports per VPC (MiniNet, MaxiNet, DistriNet)
- MiniNet v2.0 stress test on single server
- Setup a custom tree topology
- Stress test with Ryu controller (reaches 1K nodes per server reliably)
- Stress test with Alcor controller
- Distrinet stress test on multiple servers
- Deploy Distrinet on multiple vms (deploy one master + 2 worker nodes)
- Deploy Distrinet on multiple servers
- Stress test on a customized tree topology with Alcor controller
- MiniNet v2.0 stress test on single server
-
OpenStack Cross-Service Request-Level Profiling
- Environment Setup
- Use Jaeger's openstracing features in Alcor
- Jaeger support in DPM (PR under review)
- Jaeger support in ACA (PR under review)
- Tool and script development - move to next release
- OSProfiler to Jaeger trace convertor
- Alcor DevOps and CI/CD enhancement
- Alcor autoscaling support | status: pending on Yan's test, move to next release
- Fix bugs
- VPC-based implementation for Message Queue scale path (Min Chen/Luyao Luo)
- 10/13 Status:
- Min: Submit PR #1 to upgrade GS v1.0 to GS v2.0 by 10/16 (Sat.) Submit PR #2 to add Pulsar support to DPM by 10/18 (Monday)
- Luyao: Submit PR to ACA repo by 10/17 (Sunday)
- Target: Start integration test by 10/20 (Wed.)
- 10/20 Status:
- Min: PR #1 merged/Jenkins passed PR #2 submitted (#695)
- Luyao: ACA PR submitted and under review
- Integration Test Plan:
- Step 1: Use PostMan to test basic API functionality (pub & sub)
- Step 2: Basic E2E test cases: port creation/update, routing rule etc.
- Step 3: Run Jenkins jobs (ping Liguang & Prasad on Slack)
- Integration Test Plan:
- 10/13 Status:
- Scalability test framework for 1M nodes regions and 100K ports VPC (Jiawei Liu/Hanfeng Zhan/Jing Fan)
- 10/13 Status:
- Jiawei/Hanfeng: Set up Maxinet in 3 nodes following quick setup guide; Hitting a blocking issue of a missing folder "pox";
- Target: Prepare a Maxinet demo by next meeting 10/21 (Wed.)
- 10/20 Status:
- Jiawei/Hanfeng: Maxinet Worker node registration failure; Maxinet not well maintained and working only on Ubuntu 14.02
- Next Step: Distrinet is good alternative; Try MiniNet to get max # of containers per host; Reimage containers with truncated ACA.
- 10/13 Status:
- ML-based on-demand programming (Yan Yu/Shuang Liang/Chen Min)
- 10/13 Status:
- Min: Present a draft design based on online retail recommendation system and leverage VM-VM similarity
- Shuang: Working on historical data modeling for VM-VM connectivity that covers network neighborhood, routing, security group
- Yu: Working on modeling historical data as vectors and GoalState recommendation algorithm
- Target:
- Start a paper presentation starting from next week 10/20 (bi-weekly)
- ETA for project ETA: 10/20
- 10/20 Status:
- Yu: Present a paper for online retail recommendation system and its ML algorithm (static catalog + customer purchase history)
- Shuang: Discuss one example of public cloud deployment
- 10/13 Status:
- Host network configuration optimization
- Topic I: Rule optimizer for fast lookup
- Topic II: Topology-aware state reconciliation (stretch goal)
- Topic III: Multi-service rule conflict detection and resolution (stretch goal)
- Fix existing Rally test cases and Alcor bugs
- Router-related resource cleanup issue
- Separate Neutron and VPC router scenarios for getSubnetRouteTable
- Issue 666: Ip address allocation not found, 404 error by IP manager (Under Test)
- Issue 667: 412 Precondition Failed Ip address conflict with exist (Under Test)
- Alcor Consolidated APIs (stretch goal)
- Investigate Nova-Neutron APIs workflow
- Design new Alcor APIs to simplify VM boot process
- Write python Nova Alcor client to utilize new APIs
- Alcor Monitoring by Prometheus and Grafana
- Metrics collector for K8s services, container, and bare metal
- Set up Prometheus and Grafana
- Microservice development
- UT enhancement for routing rule update in ACA and DPM
- Host network configuration optimization
- Topic I: Topology-aware policy based route reachability detection and lookup
- Phase II: (enhancement) lookup path trimming and optimization
- Topic II: Multi types input policy conflict detection and routes optimization
- Topic I: Topology-aware policy based route reachability detection and lookup
- Run Alcor performance test via Rally (Alcor perf test plan)
- Routing rule update new scenario E2E
- Subnet-level routing table update
- Network-level routing table update
- Routing rule update new scenario E2E
- Database optimization
- Ignite version upgraded to v2.10
- Optimization techniques
- Thin client Java API - async API
- Thin client continuous query
-
ACA major refactor & On-demand workflow perf profiling
- ACA perf profiling for aca/ovs interaction improvement
- perf profiling for new channel mechanism of ovsdb connection/records
- ACA memory footprint investigation and optimization
- Coding style alignment across ACA codes
- On-demand test automation script (stretch goal)
- ACA perf profiling for aca/ovs interaction improvement