Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Storage Backends to V2 Storage API #6458

Open
5 tasks
mahadzaryab1 opened this issue Jan 1, 2025 · 27 comments
Open
5 tasks

Upgrade Storage Backends to V2 Storage API #6458

mahadzaryab1 opened this issue Jan 1, 2025 · 27 comments

Comments

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Jan 1, 2025

This is a project proposed as part of LFX Mentorship term #6470read this first.

Background

Jaeger is an open-source, distributed tracing platform designed to monitor and troubleshoot microservices-based systems. A critical component of Jaeger is its storage backends, where traces captured by Jaeger are persisted for querying.

Currently, Jaeger uses a v1 Storage API, which operates on a data model specific to Jaeger. Each storage backend implements this API, requiring transformations between Jaeger's proprietary model and the OpenTelemetry Protocol (OTLP) data model, which is now the industry standard.

As part of #5079, Jaeger has introduced the more efficient v2 Storage API, which natively supports the OpenTelemetry data model (OTLP), allows batching of writes and streaming of resultes. This effort is part of a broader alignment with the OpenTelemetry Collector framework, tracked under #4843.

Objective

Upgrade Jaeger storage backends to natively implement the v2 Storage API.

  • Memory
  • Elasticsearch / OpenSearch
  • Badger
  • Cassandra
  • gRPC / Remote

The chosen storage backend should be upgraded to fully implement the v2 Storage API in place. For a rough idea of how to upgrade from the v1 model to the OTLP data model, take a look at the PRs in the following issues that do a similar upgrade for other components of Jaeger:

Desired Outcomes

Upgrade Memory and Elasticsearch backends

We prioritize these two backends as they are the mostly frequently used with Jaeger and upgrading them paves a path for upgrading other backends.

Testing

  • The storage implementations should be fully covered by unit tests
  • There are already integration tests for the storage backend so all of them should pass without needing to be modified

Bonus: Upgrade Other Backends

If time permits, upgrade Badger and Cassandra storage backends.

Risks / Open Questions

  • The v2 storage API doesn't have a distinction between primary and archive storage but v1 does. The ultimate plan is to remove the archive storage from the v1 implementation as well. That work effort is being tracked in Phase out the distinction between primary and archive storage #6065. We may want to think about how to handle the upgrades for the storages that implement the archive storage in v1 while we work on removing it. We may want to simply ignore the archive part of the storage while we resolve the aforementioned issue if that is possible.
mahadzaryab1 added a commit that referenced this issue Jan 5, 2025
…o v1 (#6485)

## Which problem is this PR solving?
- Resolves #6480

## Description of the changes
- This PR implements a reverse adapter (`SpanReader`) that wraps a
native v2 storage interface (`tracestore.Reader`) and downgrades it to
implement the v1 storage interface (`spanstore.Reader`).
- The reverse adapter was integrated with the v1 query service. This
code path will only get executed once we start upgrading the existing
storage implementations to implement the new `tracestore.Reader`
interface as a part of
#6458

## How was this change tested?
- CI 
- Added new unit tests

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `npm run lint` and `npm run test`

---------

Signed-off-by: Mahad Zaryab <[email protected]>
@Devaansh-Kumar
Copy link

Hello @mahadzaryab1. I am Devaansh Kumar and I am interested in applying for this project under LFX'25. I have some experience with open source earlier as I had participated in GSoC'24 under Kubernetes.

I have a few questions:

  1. What was the reason behind removing the distinction between primary and archive storage for v2 API? It seems intuitive to me to have both in v2.
  2. For contributors new to Jaeger’s codebase, are there specific areas or smaller tasks you’d recommend starting with to get familiarized?

Thank you

@yurishkuro
Copy link
Member

@Devaansh-Kumar

  1. Phase out the distinction between primary and archive storage #6065
  2. https://www.jaegertracing.io/get-involved/

adityachopra29 pushed a commit to adityachopra29/jaeger that referenced this issue Jan 9, 2025
…o v1 (jaegertracing#6485)

## Which problem is this PR solving?
- Resolves jaegertracing#6480

## Description of the changes
- This PR implements a reverse adapter (`SpanReader`) that wraps a
native v2 storage interface (`tracestore.Reader`) and downgrades it to
implement the v1 storage interface (`spanstore.Reader`).
- The reverse adapter was integrated with the v1 query service. This
code path will only get executed once we start upgrading the existing
storage implementations to implement the new `tracestore.Reader`
interface as a part of
jaegertracing#6458

## How was this change tested?
- CI
- Added new unit tests

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `npm run lint` and `npm run test`

---------

Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: adityachopra29 <[email protected]>
@Devaansh-Kumar
Copy link

@yurishkuro I had one more question. Do you expect to move the above storages completely to v2 or have both v1 and v2 present side-by-side and give the user choice on which API version to use?

@yurishkuro
Copy link
Member

Move completely. We're about to upgrade the write pipeline to use v2 storage API anyway, so we won't need v1 API in the future.

@yp969803
Copy link

@yurishkuro love to work on this project in lfx 2025!

@zhengkezhou1
Copy link
Contributor

zhengkezhou1 commented Feb 8, 2025

@yurishkuro @mahadzaryab1 I have gone through the READ path, where we process two types of requests: HTTP requests and gRPC requests. We have set up two processes to handle read requests.

Let's focus on HTTP: The UI sends a request to the HTTP server, and the http_handler routes it to a method, which then calls queryservice to retrieve the required data. Currently, we are still using v1 methods and models. For example, in http_handler.search, we use model.Trace and qs.spanReader.FindTraces instead of ptrace.Traces and qs.traceReader.FindTraces. The retrieved model is then serialized into a JSON object.

For the full request flow now, trace data undergoes two conversions:
Database model → Jaeger v1 model → JSON object

Questions:
Is it necessary to refactor http_handler and grpc_handler to v2 before working on this issue?
Based on Jaeger v2 for Otel collector, the read path allows us to convert the OTEL model directly into a JSON object. Can we achieve the same for our backend storages?
I found that the following storages might support this approach:
Elasticsearch
OpenSearch
Kafka
Clickhouse
Cassandra

@Manik2708
Copy link
Contributor

Is it necessary to refactor http_handler and grpc_handler to v2 before working on this issue?

It is already changed according to query service v2. See these files: https://github.com/jaegertracing/jaeger/blob/main/cmd/query/app/apiv3/http_gateway.go and https://github.com/jaegertracing/jaeger/blob/main/cmd/query/app/apiv3/grpc_handler.go

@Manik2708
Copy link
Contributor

@mahadzaryab1 @yurishkuro I have a doubt regarding this issue. Let's take an example of upgrading Writer from v1 to v2. Writer of v1 writes only 1 span whereas that of v2 takes traces to write. So while implementing writer for ElasticSearch, what is expected? Do we have to use bulk apis of elasticsearch or something else? Please see this: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html Because we are not storing traces but spans. A similar question for memory storage. In memory do we only need to iterate the traces and store the spans or something like batch writes?

@zhengkezhou1
Copy link
Contributor

@mahadzaryab1 @yurishkuro I have a doubt regarding this issue. Let's take an example of upgrading Writer from v1 to v2. Writer of v1 writes only 1 span whereas that of v2 takes traces to write. So while implementing writer for ElasticSearch, what is expected? Do we have to use bulk apis of elasticsearch or something else? Please see this: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html Because we are not storing traces but spans. A similar question for memory storage. In memory do we only need to iterate the traces and store the spans or something like batch writes?

Elasticsearch already implemented a method for writing traces into storage.

@Manik2708
Copy link
Contributor

@zzzk1 Thanks for this! So are you saying to take inspiration from this and implement our own method? Because I don't think we can use this directly in jaeger. And for memory we have to discuss for the implementaion because that is jaeger specific!

@zhengkezhou1
Copy link
Contributor

zhengkezhou1 commented Feb 9, 2025

@zzzk1 Thanks for this! So are you saying to take inspiration from this and implement our own method? Because I don't think we can use this directly in jaeger. And for memory we have to discuss for the implementaion because that is jaeger specific!

Yes, The pushTraceData method demonstrates how to send traces to Elasticsearch (ES). In Jaeger v2, built on the OpenTelemetry (OTEL) collector, data is first collected from the pipeline and efficiently(without model convert) written to storage using the v2 Writer interface via the WriteTraces(ctx context.Context, td ptrace.Traces) method.

@yurishkuro
Copy link
Member

yurishkuro commented Feb 9, 2025

Is it necessary to refactor http_handler and grpc_handler to v2 before working on this issue?

@zzzk1 no, it's not. Yes, both these handlers operate on querysvc/v1, but internally that service is already instantiated with v2 storage and it just downgrades it to v1 if necessary

func NewQueryService(traceReader tracestore.Reader, dependencyReader depstore.Reader, options QueryServiceOptions) *QueryService {
spanReader, ok := v1adapter.GetV1Reader(traceReader)
if !ok {
// if the spanstore.Reader is not available, downgrade the native tracestore.Reader to
// a spanstore.Reader
spanReader = v1adapter.NewSpanReader(traceReader)
}

@Manik2708
Copy link
Contributor

@yurishkuro I could understand that we want to get rid of the conversion from OTEL model but do we need to change the id which is getting stored in the database? If yes, then how we can ensure backward compatibilty? Or should we move forward by converting the OTEL span id to jaeger specific id?

@yurishkuro
Copy link
Member

the IDs are compatible, and besides Jaeger treats them as completely opaque (same as OTEL).

@adityachopra29
Copy link
Contributor

adityachopra29 commented Feb 15, 2025

@yurishkuro Is there any specific reason why we have implemented the tenant struct's services set as a map with the value section being an empty struct instead of a simpler array? Similarly, the operations map's value part is a map whose value portion is an empty struct.

// Tenant is an in-memory store of traces for a single tenant

@yurishkuro
Copy link
Member

map[string]struct{} is a common way to model a set.

@Manik2708
Copy link
Contributor

@yurishkuro The changes in memory storage are not expected to be backward compatible, right? I can only imagine restarting the storage after the update which will eventually lead to loss of old data (because it's in-memory). Am I right? Or is there any backup mechanism?

@yurishkuro
Copy link
Member

@Manik2708 being backwards compatible has nothing to do with persistence / backup mechanism. Memory storage is transient, it's a cache. Upgrading it to v2 is just changing its API, not it's features.

@metabiswadeep
Copy link

@yurishkuro I am interested in this opportunity. I am interested in both this and the UI. Can I add an application for both?

@Manik2708
Copy link
Contributor

@Manik2708 being backwards compatible has nothing to do with persistence / backup mechanism. Memory storage is transient, it's a cache. Upgrading it to v2 is just changing its API, not it's features.

Actually the Tenant structure looks like this:

type Tenant struct {
sync.RWMutex
ids []*model.TraceID
traces map[model.TraceID]*model.Trace
services map[string]struct{}
operations map[string]map[spanstore.Operation]struct{}
deduper adjuster.Adjuster
config Configuration
index int
}

For changing the API we need to change the jaeger specific models to OTEL specific models. Or am I understanding wrong? Do we just need to change the APIs and then convert the OTEL models to jaeger models and keeping the store intact and same?

@yurishkuro
Copy link
Member

Tenant struct is not part of the API. The API is https://github.com/jaegertracing/jaeger/blob/main/internal/storage/v1/api/spanstore/interface.go to be changed to https://github.com/jaegertracing/jaeger/tree/main/internal/storage/v2/api/tracestore

@ary82
Copy link
Contributor

ary82 commented Feb 16, 2025

Tenant struct is not part of the API. The API is https://github.com/jaegertracing/jaeger/blob/main/internal/storage/v1/api/spanstore/interface.go to be changed to https://github.com/jaegertracing/jaeger/tree/main/internal/storage/v2/api/tracestore

@yurishkuro but for that wouldn't we also have to change how the storage backend stores traces, for instance the memory backend stores traces with the Tenant.traces field, which is currently map[model.TraceID]*model.Trace, for OTLP we can change this to something like map[pcommon.TraceID][]*ptrace.Span.

If this is the wrong path, do we instead want to change how the models such as model.Trace is generated, and have the OTLP []ptrace.Span(And also other OTLP stuff like Resources attributes and Scope Tags) be contained by the model.Trace?

@yurishkuro
Copy link
Member

yes, we want to change the internal implementation. It's up to the storage backend in what format it stores the data, as long as the external API is OTLP.

@Manik2708
Copy link
Contributor

@yurishkuro I have a doubt in the design of the v2 APIs, currently we are returning traces in this form:

GetTraces(ctx context.Context, traceIDs ...GetTraceParams) iter.Seq2[[]ptrace.Traces, error]

but shouldn't it be:

GetTraces(ctx context.Context, traceIDs ...GetTraceParams) iter.Seq2[ptrace.Traces, error] // Returning an iterator of traces rather than iterator of slice of traces

Why I think so? Because I was reading about iterator in go for a while and the reason why iterators sought to be better than slices is because: we need not to wait for the whole slice. We can get and process every trace and then do whatever we want to do with it (directly stream them to the client). So if iterator takes the chunk as a slice then what is the point of using iterator? because as far as I can think is there will always be a single chunk. Also while returning this iterator we still have to wait for the whole slice to get completed (that is we have to wait for all the traces). Conclusively I think that using an iterator with slice is opposing the use of an iterator (I may be wrong, please correct me if wrong). Please see this

// Current approach
for traces, err := range GetTraces(ctx, params) {
 // Do something with the slice
}
// My proposed approach
for trace, err := range GetTraces(ctx, params) {
 // Here a single trace is a chunk. Either we can directly stream this trace to client or process it.
}

@ary82
Copy link
Contributor

ary82 commented Feb 17, 2025

but shouldn't it be:

GetTraces(ctx context.Context, traceIDs ...GetTraceParams) iter.Seq2[ptrace.Traces, error] // Returning an iterator of traces rather than iterator of slice of traces

@Manik2708 I may be wrong here, but I think it's because a trace might span over multiple otlp ptrace.Traces. A ptrace.Traces contains ptrace.Spans grouped over a particular Resource/Service. But a trace can span over multiple services. So that means for a complete trace, multiple ptrace.Traces may be required.
Let's wait for the confirmation though.

@Manik2708
Copy link
Contributor

Maybe you are right! That's why I wrote processing the traces. GetTraces is taking multiple trace ids (earlier it was a single trace id). This confused me because trace ID is unique so I thought why we need multiple traces for a single trace id. But then I am now doubtful that it can be possible if distinguishing factor is not ID but only resource/service.

@yurishkuro
Copy link
Member

The method returns a slice in each iteration because the API is optimized for storage communication, and the storage may find it more efficient to return a single payload with multiple traces, eg one iteration with 10 traces.

github-merge-queue bot pushed a commit that referenced this issue Mar 6, 2025
## Which problem is this PR solving?
Fixes a part of #6458 

## Description of the changes
- The JsonSpanWriter accepts the json span instead of jaeger span which
makes it reusabe in v2 storage APIs

## How was this change tested?
- Unit Tests

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `npm run lint` and `npm run test`

---------

Signed-off-by: Manik2708 <[email protected]>
Asatyam pushed a commit to Asatyam/jaeger that referenced this issue Mar 9, 2025
…ertracing#6796)

## Which problem is this PR solving?
Fixes a part of jaegertracing#6458 

## Description of the changes
- The JsonSpanWriter accepts the json span instead of jaeger span which
makes it reusabe in v2 storage APIs

## How was this change tested?
- Unit Tests

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `npm run lint` and `npm run test`

---------

Signed-off-by: Manik2708 <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Mar 9, 2025
…make them reusable for v2 APIs (#6828)

## Which problem is this PR solving?
Fixes a part of: #6458 

## Description of the changes
- Refactoring of SpanReader for make it reusable for v2 APIs

## How was this change tested?
- Unit Tests

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `npm run lint` and `npm run test`

---------

Signed-off-by: Manik2708 <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Co-authored-by: Yuri Shkuro <[email protected]>
yurishkuro pushed a commit that referenced this issue Mar 11, 2025
… for v2 APIs (#6831)

## Which problem is this PR solving?
Fixes a part of: #6458

## Description of the changes
- Refactoring of FindTraceIDs

## How was this change tested?
- Unit Tests

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `npm run lint` and `npm run test`

---------

Signed-off-by: Manik2708 <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Mar 14, 2025
## Which problem is this PR solving?
Fixes a part of #6458 

## Description of the changes
- As discussed in the comment
#6845 (comment),
legacy trace id is moved to feature gate

## How was this change tested?
- Unit and Integration tests

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `npm run lint` and `npm run test`

---------

Signed-off-by: Manik2708 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants