public abstract class ResourceManager<WorkerType extends ResourceIDRetrievable> extends org.apache.flink.runtime.rpc.FencedRpcEndpoint<ResourceManagerId> implements DelegationTokenManager.Listener, ResourceManagerGateway
It offers the following methods as part of its rpc interface to interact with him remotely:
registerJobMaster(JobMasterId, ResourceID, String, JobID, Time) registers a JobMaster at the resource manager
| Modifier and Type | Field and Description |
|---|---|
protected BlocklistHandler |
blocklistHandler |
protected Executor |
ioExecutor |
static String |
RESOURCE_MANAGER_NAME |
protected ResourceManagerMetricGroup |
resourceManagerMetricGroup |
| Constructor and Description |
|---|
ResourceManager(org.apache.flink.runtime.rpc.RpcService rpcService,
UUID leaderSessionId,
ResourceID resourceId,
HeartbeatServices heartbeatServices,
DelegationTokenManager delegationTokenManager,
SlotManager slotManager,
ResourceManagerPartitionTrackerFactory clusterPartitionTrackerFactory,
BlocklistHandler.Factory blocklistHandlerFactory,
JobLeaderIdService jobLeaderIdService,
ClusterInformation clusterInformation,
org.apache.flink.runtime.rpc.FatalErrorHandler fatalErrorHandler,
ResourceManagerMetricGroup resourceManagerMetricGroup,
org.apache.flink.api.common.time.Time rpcTimeout,
Executor ioExecutor) |
| Modifier and Type | Method and Description |
|---|---|
protected void |
closeJobManagerConnection(org.apache.flink.api.common.JobID jobId,
org.apache.flink.runtime.resourcemanager.ResourceManager.ResourceRequirementHandling resourceRequirementHandling,
Exception cause)
This method should be called by the framework once it detects that a currently registered job
manager has failed.
|
protected Optional<WorkerType> |
closeTaskManagerConnection(ResourceID resourceID,
Exception cause)
This method should be called by the framework once it detects that a currently registered
task executor has failed.
|
CompletableFuture<Acknowledge> |
declareRequiredResources(JobMasterId jobMasterId,
ResourceRequirements resourceRequirements,
org.apache.flink.api.common.time.Time timeout)
Declares the absolute resource requirements for a job.
|
CompletableFuture<Acknowledge> |
deregisterApplication(ApplicationStatus finalStatus,
String diagnostics)
Cleanup application and shut down cluster.
|
void |
disconnectJobManager(org.apache.flink.api.common.JobID jobId,
org.apache.flink.api.common.JobStatus jobStatus,
Exception cause)
Disconnects a JobManager specified by the given resourceID from the
ResourceManager. |
void |
disconnectTaskManager(ResourceID resourceId,
Exception cause)
Disconnects a TaskManager specified by the given resourceID from the
ResourceManager. |
CompletableFuture<List<ShuffleDescriptor>> |
getClusterPartitionsShuffleDescriptors(IntermediateDataSetID intermediateDataSetID)
Get the shuffle descriptors of the cluster partitions ordered by partition number.
|
Optional<InstanceID> |
getInstanceIdByResourceId(ResourceID resourceID) |
CompletableFuture<Integer> |
getNumberOfRegisteredTaskManagers()
Gets the currently registered number of TaskManagers.
|
protected abstract CompletableFuture<Void> |
getReadyToServeFuture()
Get the ready to serve future of the resource manager.
|
protected abstract ResourceAllocator |
getResourceAllocator() |
CompletableFuture<Void> |
getStartedFuture()
Completion of this future indicates that the resource manager is fully started and is ready
to serve.
|
protected WorkerType |
getWorkerByInstanceId(InstanceID instanceId) |
protected abstract Optional<WorkerType> |
getWorkerNodeIfAcceptRegistration(ResourceID resourceID)
Get worker node if the worker resource is accepted.
|
CompletableFuture<Void> |
heartbeatFromJobManager(ResourceID resourceID)
Sends the heartbeat to resource manager from job manager.
|
CompletableFuture<Void> |
heartbeatFromTaskManager(ResourceID resourceID,
TaskExecutorHeartbeatPayload heartbeatPayload)
Sends the heartbeat to resource manager from task manager.
|
protected abstract void |
initialize()
Initializes the framework specific components.
|
protected abstract void |
internalDeregisterApplication(ApplicationStatus finalStatus,
String optionalDiagnostics)
The framework specific code to deregister the application.
|
protected void |
jobLeaderLostLeadership(org.apache.flink.api.common.JobID jobId,
JobMasterId oldJobMasterId) |
CompletableFuture<Map<IntermediateDataSetID,DataSetMetaInfo>> |
listDataSets()
Returns all datasets for which partitions are being tracked.
|
CompletableFuture<Acknowledge> |
notifyNewBlockedNodes(Collection<BlockedNode> newNodes)
Notify new blocked node records.
|
void |
notifySlotAvailable(InstanceID instanceID,
SlotID slotId,
AllocationID allocationId)
Sent by the TaskExecutor to notify the ResourceManager that a slot has become available.
|
protected void |
onFatalError(Throwable t)
Notifies the ResourceManager that a fatal error has occurred and it cannot proceed.
|
void |
onNewTokensObtained(byte[] tokens)
Callback function when new delegation tokens obtained.
|
void |
onStart() |
CompletableFuture<Void> |
onStop() |
protected void |
onWorkerRegistered(WorkerType worker,
WorkerResourceSpec workerResourceSpec) |
CompletableFuture<RegistrationResponse> |
registerJobMaster(JobMasterId jobMasterId,
ResourceID jobManagerResourceId,
String jobManagerAddress,
org.apache.flink.api.common.JobID jobId,
org.apache.flink.api.common.time.Time timeout)
Register a
JobMaster at the resource manager. |
protected void |
registerMetrics() |
CompletableFuture<RegistrationResponse> |
registerTaskExecutor(TaskExecutorRegistration taskExecutorRegistration,
org.apache.flink.api.common.time.Time timeout)
Register a
TaskExecutor at the resource manager. |
CompletableFuture<Void> |
releaseClusterPartitions(IntermediateDataSetID dataSetId)
Releases all partitions associated with the given dataset.
|
protected void |
removeJob(org.apache.flink.api.common.JobID jobId,
Exception cause) |
CompletableFuture<Void> |
reportClusterPartitions(ResourceID taskExecutorId,
ClusterPartitionReport clusterPartitionReport)
Report the cluster partitions status in the task executor.
|
CompletableFuture<ProfilingInfo> |
requestProfiling(ResourceID taskManagerId,
int duration,
ProfilingInfo.ProfilingMode mode,
java.time.Duration timeout)
Requests the profiling instance from the given
TaskExecutor. |
CompletableFuture<ResourceOverview> |
requestResourceOverview(org.apache.flink.api.common.time.Time timeout)
Requests the resource overview.
|
CompletableFuture<TaskExecutorThreadInfoGateway> |
requestTaskExecutorThreadInfoGateway(ResourceID taskManagerId,
org.apache.flink.api.common.time.Time timeout)
Requests the
TaskExecutorGateway. |
CompletableFuture<TaskManagerInfoWithSlots> |
requestTaskManagerDetailsInfo(ResourceID resourceId,
org.apache.flink.api.common.time.Time timeout)
Requests detail information about the given
TaskExecutor. |
CompletableFuture<TransientBlobKey> |
requestTaskManagerFileUploadByName(ResourceID taskManagerId,
String fileName,
org.apache.flink.api.common.time.Time timeout)
Request the file upload from the given
TaskExecutor to the cluster's BlobServer. |
CompletableFuture<TransientBlobKey> |
requestTaskManagerFileUploadByNameAndType(ResourceID taskManagerId,
String fileName,
FileType fileType,
java.time.Duration timeout)
Request the file upload from the given
TaskExecutor to the cluster's BlobServer. |
CompletableFuture<TransientBlobKey> |
requestTaskManagerFileUploadByType(ResourceID taskManagerId,
FileType fileType,
org.apache.flink.api.common.time.Time timeout)
Request the file upload from the given
TaskExecutor to the cluster's BlobServer. |
CompletableFuture<Collection<TaskManagerInfo>> |
requestTaskManagerInfo(org.apache.flink.api.common.time.Time timeout)
Requests information about the registered
TaskExecutor. |
CompletableFuture<Collection<LogInfo>> |
requestTaskManagerLogList(ResourceID taskManagerId,
org.apache.flink.api.common.time.Time timeout)
Request log list from the given
TaskExecutor. |
CompletableFuture<Collection<org.apache.flink.api.java.tuple.Tuple2<ResourceID,String>>> |
requestTaskManagerMetricQueryServiceAddresses(org.apache.flink.api.common.time.Time timeout)
Requests the paths for the TaskManager's
MetricQueryService to query. |
CompletableFuture<Collection<ProfilingInfo>> |
requestTaskManagerProfilingList(ResourceID taskManagerId,
java.time.Duration timeout)
Request profiling list from the given
TaskExecutor. |
CompletableFuture<ThreadDumpInfo> |
requestThreadDump(ResourceID taskManagerId,
org.apache.flink.api.common.time.Time timeout)
Requests the thread dump from the given
TaskExecutor. |
CompletableFuture<Acknowledge> |
sendSlotReport(ResourceID taskManagerResourceId,
InstanceID taskManagerRegistrationId,
SlotReport slotReport,
org.apache.flink.api.common.time.Time timeout)
Sends the given
SlotReport to the ResourceManager. |
protected void |
setFailUnfulfillableRequest(boolean failUnfulfillableRequest)
Set
SlotManager whether to fail unfulfillable slot requests. |
void |
stopWorkerIfSupported(WorkerType worker)
Stops the given worker if supported.
|
protected abstract void |
terminate()
Terminates the framework specific components.
|
callAsync, closeAsync, getAddress, getEndpointId, getHostname, getMainThreadExecutor, getRpcService, getSelfGateway, getTerminationFuture, internalCallOnStart, internalCallOnStop, isRunning, registerResource, runAsync, scheduleRunAsync, scheduleRunAsync, start, stop, unregisterResource, validateRunsInMainThreadpublic static final String RESOURCE_MANAGER_NAME
protected final ResourceManagerMetricGroup resourceManagerMetricGroup
protected final Executor ioExecutor
protected final BlocklistHandler blocklistHandler
public ResourceManager(org.apache.flink.runtime.rpc.RpcService rpcService,
UUID leaderSessionId,
ResourceID resourceId,
HeartbeatServices heartbeatServices,
DelegationTokenManager delegationTokenManager,
SlotManager slotManager,
ResourceManagerPartitionTrackerFactory clusterPartitionTrackerFactory,
BlocklistHandler.Factory blocklistHandlerFactory,
JobLeaderIdService jobLeaderIdService,
ClusterInformation clusterInformation,
org.apache.flink.runtime.rpc.FatalErrorHandler fatalErrorHandler,
ResourceManagerMetricGroup resourceManagerMetricGroup,
org.apache.flink.api.common.time.Time rpcTimeout,
Executor ioExecutor)
public final void onStart()
throws Exception
onStart in class org.apache.flink.runtime.rpc.RpcEndpointExceptionpublic CompletableFuture<Void> getStartedFuture()
public final CompletableFuture<Void> onStop()
onStop in class org.apache.flink.runtime.rpc.RpcEndpointpublic CompletableFuture<RegistrationResponse> registerJobMaster(JobMasterId jobMasterId, ResourceID jobManagerResourceId, String jobManagerAddress, org.apache.flink.api.common.JobID jobId, org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewayJobMaster at the resource manager.registerJobMaster in interface ResourceManagerGatewayjobMasterId - The fencing token for the JobMaster leaderjobManagerResourceId - The resource ID of the JobMaster that registersjobManagerAddress - The address of the JobMaster that registersjobId - The Job ID of the JobMaster that registerstimeout - Timeout for the future to completepublic CompletableFuture<RegistrationResponse> registerTaskExecutor(TaskExecutorRegistration taskExecutorRegistration, org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewayTaskExecutor at the resource manager.registerTaskExecutor in interface ResourceManagerGatewaytaskExecutorRegistration - the task executor registration.timeout - The timeout for the response.public CompletableFuture<Acknowledge> sendSlotReport(ResourceID taskManagerResourceId, InstanceID taskManagerRegistrationId, SlotReport slotReport, org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewaySlotReport to the ResourceManager.sendSlotReport in interface ResourceManagerGatewaytaskManagerRegistrationId - id identifying the sending TaskManagerslotReport - which is sent to the ResourceManagertimeout - for the operationAcknowledge once the slot report has been
received.protected void onWorkerRegistered(WorkerType worker, WorkerResourceSpec workerResourceSpec)
public CompletableFuture<Void> heartbeatFromTaskManager(ResourceID resourceID, TaskExecutorHeartbeatPayload heartbeatPayload)
ResourceManagerGatewayheartbeatFromTaskManager in interface ResourceManagerGatewayresourceID - unique id of the task managerheartbeatPayload - payload from the originating TaskManagerpublic CompletableFuture<Void> heartbeatFromJobManager(ResourceID resourceID)
ResourceManagerGatewayheartbeatFromJobManager in interface ResourceManagerGatewayresourceID - unique id of the job managerpublic void disconnectTaskManager(ResourceID resourceId, Exception cause)
ResourceManagerGatewayResourceManager.disconnectTaskManager in interface ResourceManagerGatewayresourceId - identifying the TaskManager to disconnectcause - for the disconnection of the TaskManagerpublic void disconnectJobManager(org.apache.flink.api.common.JobID jobId,
org.apache.flink.api.common.JobStatus jobStatus,
Exception cause)
ResourceManagerGatewayResourceManager.disconnectJobManager in interface ResourceManagerGatewayjobId - JobID for which the JobManager was the leaderjobStatus - status of the job at the time of disconnectioncause - for the disconnection of the JobManagerpublic CompletableFuture<Acknowledge> declareRequiredResources(JobMasterId jobMasterId, ResourceRequirements resourceRequirements, org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewaydeclareRequiredResources in interface ResourceManagerGatewayjobMasterId - id of the JobMasterresourceRequirements - resource requirementspublic void notifySlotAvailable(InstanceID instanceID, SlotID slotId, AllocationID allocationId)
ResourceManagerGatewaynotifySlotAvailable in interface ResourceManagerGatewayinstanceID - TaskExecutor's instance idslotId - The SlotID of the freed slotallocationId - to which the slot has been allocatedpublic CompletableFuture<Acknowledge> deregisterApplication(ApplicationStatus finalStatus, @Nullable String diagnostics)
deregisterApplication in interface ResourceManagerGatewayfinalStatus - of the Flink applicationdiagnostics - diagnostics message for the Flink application or nullpublic CompletableFuture<Integer> getNumberOfRegisteredTaskManagers()
ResourceManagerGatewaygetNumberOfRegisteredTaskManagers in interface ResourceManagerGatewaypublic CompletableFuture<Collection<TaskManagerInfo>> requestTaskManagerInfo(org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewayTaskExecutor.requestTaskManagerInfo in interface ResourceManagerGatewaytimeout - of the requestpublic CompletableFuture<TaskManagerInfoWithSlots> requestTaskManagerDetailsInfo(ResourceID resourceId, org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewayTaskExecutor.requestTaskManagerDetailsInfo in interface ResourceManagerGatewayresourceId - identifying the TaskExecutor for which to return informationtimeout - of the requestpublic CompletableFuture<ResourceOverview> requestResourceOverview(org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewayrequestResourceOverview in interface ResourceManagerGatewaytimeout - of the requestpublic CompletableFuture<Collection<org.apache.flink.api.java.tuple.Tuple2<ResourceID,String>>> requestTaskManagerMetricQueryServiceAddresses(org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewayMetricQueryService to query.requestTaskManagerMetricQueryServiceAddresses in interface ResourceManagerGatewaytimeout - for the asynchronous operationpublic CompletableFuture<TransientBlobKey> requestTaskManagerFileUploadByType(ResourceID taskManagerId, FileType fileType, org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewayTaskExecutor to the cluster's BlobServer. The corresponding TransientBlobKey is returned.requestTaskManagerFileUploadByType in interface ResourceManagerGatewaytaskManagerId - identifying the TaskExecutor to upload the specified filefileType - type of the file to uploadtimeout - for the asynchronous operationTransientBlobKey after uploading the file
to the BlobServer.public CompletableFuture<TransientBlobKey> requestTaskManagerFileUploadByName(ResourceID taskManagerId, String fileName, org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewayTaskExecutor to the cluster's BlobServer. The corresponding TransientBlobKey is returned. To support different
type file upload with name, using requestTaskManagerFileUploadByNameAndType(org.apache.flink.runtime.clusterframework.types.ResourceID, java.lang.String, org.apache.flink.runtime.taskexecutor.FileType, java.time.Duration) as instead.requestTaskManagerFileUploadByName in interface ResourceManagerGatewaytaskManagerId - identifying the TaskExecutor to upload the specified filefileName - name of the file to uploadtimeout - for the asynchronous operationTransientBlobKey after uploading the file
to the BlobServer.public CompletableFuture<TransientBlobKey> requestTaskManagerFileUploadByNameAndType(ResourceID taskManagerId, String fileName, FileType fileType, java.time.Duration timeout)
ResourceManagerGatewayTaskExecutor to the cluster's BlobServer. The corresponding TransientBlobKey is returned.requestTaskManagerFileUploadByNameAndType in interface ResourceManagerGatewaytaskManagerId - identifying the TaskExecutor to upload the specified filefileName - name of the file to uploadfileType - type of the file to uploadtimeout - for the asynchronous operationTransientBlobKey after uploading the file
to the BlobServer.public CompletableFuture<Collection<LogInfo>> requestTaskManagerLogList(ResourceID taskManagerId, org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewayTaskExecutor.requestTaskManagerLogList in interface ResourceManagerGatewaytaskManagerId - identifying the TaskExecutor to get log list fromtimeout - for the asynchronous operationpublic CompletableFuture<Void> releaseClusterPartitions(IntermediateDataSetID dataSetId)
ClusterPartitionManagerreleaseClusterPartitions in interface ClusterPartitionManagerdataSetId - dataset for which all associated partitions should be releasedpublic CompletableFuture<Void> reportClusterPartitions(ResourceID taskExecutorId, ClusterPartitionReport clusterPartitionReport)
ClusterPartitionManagerreportClusterPartitions in interface ClusterPartitionManagertaskExecutorId - The id of the task executor.clusterPartitionReport - The status of the cluster partitions.public CompletableFuture<List<ShuffleDescriptor>> getClusterPartitionsShuffleDescriptors(IntermediateDataSetID intermediateDataSetID)
ClusterPartitionManagergetClusterPartitionsShuffleDescriptors in interface ClusterPartitionManagerintermediateDataSetID - The id of the dataset.public CompletableFuture<Map<IntermediateDataSetID,DataSetMetaInfo>> listDataSets()
ClusterPartitionManagerlistDataSets in interface ClusterPartitionManagerpublic CompletableFuture<ThreadDumpInfo> requestThreadDump(ResourceID taskManagerId, org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewayTaskExecutor.requestThreadDump in interface ResourceManagerGatewaytaskManagerId - taskManagerId identifying the TaskExecutor to get the thread
dump fromtimeout - timeout of the asynchronous operationpublic CompletableFuture<Collection<ProfilingInfo>> requestTaskManagerProfilingList(ResourceID taskManagerId, java.time.Duration timeout)
ResourceManagerGatewayTaskExecutor.requestTaskManagerProfilingList in interface ResourceManagerGatewaytaskManagerId - identifying the TaskExecutor to get profiling list fromtimeout - for the asynchronous operationpublic CompletableFuture<ProfilingInfo> requestProfiling(ResourceID taskManagerId, int duration, ProfilingInfo.ProfilingMode mode, java.time.Duration timeout)
ResourceManagerGatewayTaskExecutor.requestProfiling in interface ResourceManagerGatewaytaskManagerId - taskManagerId identifying the TaskExecutor to get the profiling
fromduration - profiling durationmode - profiling mode ProfilingInfo.ProfilingModetimeout - timeout of the asynchronous operationpublic CompletableFuture<TaskExecutorThreadInfoGateway> requestTaskExecutorThreadInfoGateway(ResourceID taskManagerId, org.apache.flink.api.common.time.Time timeout)
ResourceManagerGatewayTaskExecutorGateway.requestTaskExecutorThreadInfoGateway in interface ResourceManagerGatewaytaskManagerId - identifying the TaskExecutor.public CompletableFuture<Acknowledge> notifyNewBlockedNodes(Collection<BlockedNode> newNodes)
BlocklistListenernotifyNewBlockedNodes in interface BlocklistListenernewNodes - the new blocked node recordsprotected void registerMetrics()
protected void closeJobManagerConnection(org.apache.flink.api.common.JobID jobId,
org.apache.flink.runtime.resourcemanager.ResourceManager.ResourceRequirementHandling resourceRequirementHandling,
Exception cause)
jobId - identifying the job whose leader shall be disconnected.resourceRequirementHandling - indicating how existing resource requirements for the
corresponding job should be handledcause - The exception which cause the JobManager failed.protected Optional<WorkerType> closeTaskManagerConnection(ResourceID resourceID, Exception cause)
resourceID - Id of the TaskManager that has failed.cause - The exception which cause the TaskManager failed.WorkerType of the closed connection, or empty if already removed.protected void removeJob(org.apache.flink.api.common.JobID jobId,
Exception cause)
protected void jobLeaderLostLeadership(org.apache.flink.api.common.JobID jobId,
JobMasterId oldJobMasterId)
@VisibleForTesting public Optional<InstanceID> getInstanceIdByResourceId(ResourceID resourceID)
protected WorkerType getWorkerByInstanceId(InstanceID instanceId)
protected void onFatalError(Throwable t)
t - The exception describing the fatal errorprotected abstract void initialize()
throws ResourceManagerException
ResourceManagerException - which occurs during initialization and causes the resource
manager to fail.protected abstract void terminate()
throws Exception
Exception - which occurs during termination.protected abstract void internalDeregisterApplication(ApplicationStatus finalStatus, @Nullable String optionalDiagnostics) throws ResourceManagerException
This method also needs to make sure all pending containers that are not registered yet are returned.
finalStatus - The application status to report.optionalDiagnostics - A diagnostics message or null.ResourceManagerException - if the application could not be shut down.protected abstract Optional<WorkerType> getWorkerNodeIfAcceptRegistration(ResourceID resourceID)
resourceID - The worker resource idpublic void stopWorkerIfSupported(WorkerType worker)
worker - The worker.protected abstract CompletableFuture<Void> getReadyToServeFuture()
protected abstract ResourceAllocator getResourceAllocator()
protected void setFailUnfulfillableRequest(boolean failUnfulfillableRequest)
SlotManager whether to fail unfulfillable slot requests.failUnfulfillableRequest - whether to fail unfulfillable requestspublic void onNewTokensObtained(byte[] tokens)
throws Exception
DelegationTokenManager.ListeneronNewTokensObtained in interface DelegationTokenManager.ListenerExceptionCopyright © 2014–2024 The Apache Software Foundation. All rights reserved.