New GET /v2/status API endpoint requested #618

jcheroske · 2018-11-20T22:00:20Z

The courtesy node grant team is requesting a new API endpoint. This endpoint will help us support several use cases that have arisen during our development:

Configure our load balancer health checks with a GET URL that returns a proper 200 when everything is working correctly.

Atm, we are just making a GET against /v2 and checking to see that we got a JSON payload back with the following structure:

{"jsonrpc":"2.0","id":null,"error":{"code":-32600,"message":"Invalid Request"}}

This call returns a 400, which is really not a correct return code when everything is working correctly.

Our monitoring bot needs a full-featured status report from the node that allows us to glean the following information:
- What is the state of the node: WAITING, IGNORE, SYNCED, etc?
- Is the node following minutes?
- What is the node's minute?
In addition, other applications would probably find the following information useful:
- Identity
- Node type: LEADER, AUDIT, FOLLOWER
- Current block <-- Anton says we already have that, which is true, but it might be nice to have it here as well.

I would like to discuss the structure of a JSON return payload that could support these use cases.

The text was updated successfully, but these errors were encountered:

carryforward · 2018-11-28T03:48:49Z

This is a good question, but it might open a can of worms. Some things are clearly invalid, and some are more subjective. Things might look valid to a node but become invalid. The other way is possible too. For example, if a commit is submitted to the node, but it doesn't have enough balance, it might be considered an error, because it is not doing what the API user expected, but ECs might be purchased a few seconds later which make the commit valid. At that point it would be sent out (after having sit in the queue for a few seconds) an would make it to the blockchain. Determining errors at the API level can be tricky. Some things, you are right, might be better to pair the json error with an HTTP error. I don't have a strong opinion on this. I would love to hear other's opinions on this.

2 & 3. This is going to be part of the diagnostic API. This is one of the deliverables in the protocol grant, and is being worked on. I don't know the release timing of that though.

The json structure we figured would be added to over time as more things were added to the API return object. JSON allows for more things to be added without breaking reverse compatibility.

Here is some partial work on the diagnostic API. I would love some feedback on the json structure.
https://github.com/FactomProject/factomd/compare/FD-705_DiagnosticAPI

jcheroske · 2018-11-28T04:57:34Z

I don't think the status should involve itself in issues such as EC balance. The node is healthy but a user is using it incorrectly.

Even if, at a minimum, there was just a way to hit the API with nothing more than a GET and all we got back was a 200 and a little JSON object that said "Hello World!" that would be an improvement over the current situation of getting a 400 back. We can wait for the diagnostic API to do the more heavy lifting, but the way it's working right now it's not really supporting generic health checks from services like CloudFlare in an elegant way. We've got a hack in place to make it work, but it's not ideal.

carryforward · 2018-11-28T05:16:36Z

ah, I see where the problem lies now. I had never considered cloudflare being a frontend for the factomd API. This is completely new thinking for me.

I need to disappear for a bit, but I think I see where you are going with this. I don't know how easy it would be to have what you are asking for. it might be easy if there is something specific in one of the calls when something goes wrong. What are you trying to get a 200 code to represent?

If you want to know if everything is working correctly, then that is something the node cannot know, ont for sure at least. It would involve something like watching the End Of Minute messages and seeing them come in at a regular interval. "operating properly" sounds like a fairly big ask. Getting some elements of that in the diagnostic API will certainly be a thing, but this sounds like a hard problem. I suppose it can know if some things are incorrect, like if it is booting up.

There seems like a spectrum of (node definately not ready) through (everything seems to be chugging along fine as of now). Where to throw an error (since the Federated servers might be the ones having the problem, and your node is fine) is a hard problem.

jcheroske · 2018-11-28T05:47:05Z

Yeah, determining whether the node is healthy is complex, and requires information from outside the node. That's not really what we're looking for here. We just want a superficial, "I'm ok!" endpoint for now.

We have two different API clients that need information from the node:

Cloudflare: Just a superficial OK is all it can handle.
Node Pool Monitoring Bot: It wants detailed info from each node, so it can make a determination as to whether or not a given node belongs in the pool.

So two different clients with different needs. One just needs a simple GET endpoint, the other will use the diagnostic API when it comes along.

jcheroske · 2019-01-01T09:57:08Z

I'm curious if there has been any forward progress with this? Not having a GET health check URI on the API port is increasingly problematic for me. It's not just CloudFlare being difficult, but now Kubernetes also needs an endpoint that can be hit easily. Here is my new proposal:

Change the response code on the API port's / URI to be 200

Right now, it returns a 404. That's it. If you could do that, I would have everything I need.

stackdump · 2019-05-06T19:49:16Z

This is still on the radar - there is a version that returns status here

factomd/wsapi/wsapi.go

Lines 850 to 868 in aa0da08

    
           jsonResp.Result = statusResponse{ 
        
           	s.GetFactomdVersion(), 
        
           	s.GetFactomNodeName(), 
        
           	s.GetBootTime(), 
        
           	s.GetCurrentTime(), 
        
           	s.GetCurrentBlockStartTime(), 
        
           	s.GetCurrentMinuteStartTime(), 
        
           	s.GetCurrentMinute(), 
        
           	s.GetLeaderHeight(), 
        
           	s.GetHighestSavedBlk(), 
        
           	s.GetHighestKnownBlock(), 
        
           	s.GetEntryBlockDBHeightComplete(), 
        
           	s.GetEntryBlockDBHeightProcessing(), 
        
           	s.IsSyncing(), 
        
           	s.IsSyncingEOMs(), 
        
           	s.IsSyncingDBSigs(), 
        
           	s.Running(), 
        
           	s.GetIgnoreDone(), 
        
           	role,

NOTE: it was mentioned that it may be risky to poll the state repeatedly - it may impact performance

In the control panel we keep a cache of current state data

factomd/state/stateDisplay.go

Lines 25 to 58 in aa0da08

    
           type DisplayState struct { 
        
           	NodeName string 
        
           	ControlPanelPort    int 
        
           	ControlPanelSetting int 
        
           	// DB Info 
        
           	CurrentNodeHeight   uint32 
        
           	CurrentLeaderHeight uint32 
        
           	CurrentEBDBHeight   uint32 
        
           	LeaderHeight        uint32 
        
           	LastDirectoryBlock  interfaces.IDirectoryBlock 
        
           	// Identity Info 
        
           	IdentityChainID interfaces.IHash 
        
           	Identities      []*Identity 
        
           	Authorities     []*Authority 
        
           	PublicKey       *primitives.PublicKey 
        
           	// Process List 
        
           	PLFactoid []FactoidTransaction 
        
           	PLEntry   []EntryTransaction 
        
           	// DataDump 
        
           	RawSummary          string 
        
           	PrintMap            string 
        
           	ProcessList0        string 
        
           	ProcessList         string 
        
           	ProcessList2        string 
        
           	Election            string 
        
           	SimElection         string 
        
           	SyncingState        [256]string 
        
           	SyncingStateCurrent int 
        
           }

Currently it's not an easy refactor to expose this to the RPC port but this source of data is safer to poll repeatedly than the global state object

Also note: as mentioned above it is is not straightforward to decide if a node is ready to use - however as part of FD-869 see #720 we extended the control panel to report Ignore mode alongside the other heights - I believe https://github.com/BedrockSolutions/factomd-api-proxy will be able to make use of this in the short term.

In the long term as part of a major refactor we'll try to get the proper status reported by the RCP endpoint

stackdump · 2019-06-25T12:31:41Z

@Emyrk may have some fresh thoughts about this.
In discussing the items for refactor - on of the takeaways was wanting better isolation of state variables
from the RPC API & Control Panel - we also still have support for hashicorp go plugins which will require a similar interface

Emyrk · 2019-06-25T16:30:22Z

@jcheroske as far as a 200 status, would this suffice? https://docs.factom.com/api#properties

As for accessing the ready state of the node, we do have plans in the refactor to expose more state related information, but determining a node is "ready" from a network point of view is difficult. I know some suggestions were made, and there is a new current-minute call. You could use this to see when a node begins to sync minutes, indicating it is "alive" in some sense. I would need to look over the implementation, however there is information to determine if the minute is on a normal cadence or slow:
https://docs.factom.com/api#current-minute

stackdump · 2019-10-18T15:00:30Z

going into a rewrite - this is still on our radar

jcheroske mentioned this issue Nov 28, 2018

API port responds to undocumented http GET verb #622

Closed

stackdump added this to the WAX milestone Feb 24, 2020

PaulBernier removed this from the WAX milestone Aug 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New GET /v2/status API endpoint requested #618

New GET /v2/status API endpoint requested #618

jcheroske commented Nov 20, 2018 •

edited

Loading

carryforward commented Nov 28, 2018

jcheroske commented Nov 28, 2018

carryforward commented Nov 28, 2018

jcheroske commented Nov 28, 2018

jcheroske commented Jan 1, 2019

stackdump commented May 6, 2019 •

edited

Loading

stackdump commented Jun 25, 2019

Emyrk commented Jun 25, 2019

stackdump commented Oct 18, 2019

New GET /v2/status API endpoint requested #618

New GET /v2/status API endpoint requested #618

Comments

jcheroske commented Nov 20, 2018 • edited Loading

carryforward commented Nov 28, 2018

jcheroske commented Nov 28, 2018

carryforward commented Nov 28, 2018

jcheroske commented Nov 28, 2018

jcheroske commented Jan 1, 2019

stackdump commented May 6, 2019 • edited Loading

stackdump commented Jun 25, 2019

Emyrk commented Jun 25, 2019

stackdump commented Oct 18, 2019

jcheroske commented Nov 20, 2018 •

edited

Loading

stackdump commented May 6, 2019 •

edited

Loading