PEP

Version	Date	Author	Change Description
0.1	11 Apr 2008	Venkat Puvvada	Initial Submission
0.2	21 Apr 2008	Venkat Puvvada	Added more design details, removed CIM_IndicationServiceSettingData class, removed modification to CIM_IndicationService and CIM_IndicationServiceCapability classes.
0.3	1 May 2008	Venkat Puvvada	Full rewrite using implementation experience, Removed indication persistence
0.4	5 May 2008	Venkat Puvvada	Decided to move retry logic from IndicationService to HandlerService, Disabling/removing subscriptions does not affect indications the retry queue.
0.5	8 May 2008	Venkat Puvvada	Added RetryThread Algorithm
0.6	15 May 2008	Venkat Puvvada	Modified RetryThread Algorithm, decided to retry the indication for DeliveryRetryAttempts +1 times when the indication was not attempted for initial delivery , because retry queue already exists.
0.7	19 May 2008	Venkat Puvvada	Added flowchart/ design picture
0.8	14 July 2009	Venkat Puvvada	Full rewrite using approved concept PEP 299
1.0	21 May 2010	Venkat Puvvada	Rewrite using the DSP1054 ver 1.1.0
1.1	04 June 2010	Venkat Puvvada	Incorporated review comments, ballot version.

Version

Date

Author

Change Description

0.1

11 Apr 2008

Venkat Puvvada

Initial Submission

0.2

21 Apr 2008

Venkat Puvvada

Added more design details, removed CIM_IndicationServiceSettingData class, removed
modification to CIM_IndicationService and CIM_IndicationServiceCapability classes.

0.3

1 May 2008

Venkat Puvvada

Full rewrite using implementation experience, Removed indication persistence

0.4

5 May 2008

Venkat Puvvada

Decided to move retry logic from IndicationService to HandlerService, Disabling/removing
subscriptions does not affect indications the retry queue.

0.5

8 May 2008

Venkat Puvvada

Added RetryThread Algorithm

0.6

15 May 2008

Venkat Puvvada

Modified RetryThread Algorithm, decided to retry the indication for DeliveryRetryAttempts +1 times
when the indication was not attempted for initial delivery , because retry queue already exists.

0.7

19 May 2008

Venkat Puvvada

Added flowchart/ design picture

0.8

14 July 2009

Venkat Puvvada

Full rewrite using approved concept PEP 299

1.0

21 May 2010

Venkat Puvvada

Rewrite using the DSP1054 ver 1.1.0

1.1

04 June 2010

Venkat Puvvada

Incorporated review comments, ballot version.

Abstract: This PEP implements indication delivery retry using CIM_IndicationService DeliveryRetryAttemts and DeliveryRetryInterval properties when indication delivery has failed because of 'temporary' errors in the protocol. The proposed implementation is based on DMTF Indications Profile (DSP1054) ver 1.1.0 and CIM Schema version version 2.24.1(Minimum CIM Schema version required) and above.

Definition of the Problem

Schedule

Future Work

Discussion

Comments on version 0.1

(r_kumpf) What about this PEP is specific to CIM-XML? Why would the IndicationService care about the type of the listener destination?
(venkat_puvvada) There is no reason for not supporting the other Listener destination types. I am not sure how best we can match these parameters for Email and SNMP handlers, so i decided not include support for them at this stage.

(r_kumpf) Do indications continue to be retried for delivery after the associated CIM_IndicationSubscription and CIM_ListenerDestination instances are deleted?
(venkat_puvvada) No, indications will be discarded.

(r_kumpf) What is the rationale for persisting indications across cimserver restarts? When the cimserver is stopped, indications will cease to be generated. When it is restarted, the listener may receive stale indications and not receive more current ones for events that occurred while the cimserver was stopped. This could result in an administrator getting paged about a critical problem that was fixed months earlier.
(venkat_puvvada) The reason for persistence of indications is client may not want loose any indications. Client must be intelligent enough to discard out of date indications by looking at timestamp of delivered indication.

(r_kumpf) The traceFilePath is a poor choice for a directory to persist data needed for CIM Server operation. This directory is generally world writable.
(venkat_puvvada) yes, i agree, this needs to be discussed.

(r_kumpf) What are the contents and format of this file? How is compatibility protected on CIM Server upgrade?
(venkat_puvvada) The file will have Handler , subscription and Indication(with content language list added to indication instance) instances in XML form. Indications are saved for each subscription under for each listener destination. It will be compatible with CIMServer upgradation.

Comments on version 0.2

(r_kumpf) Doesn't the CIMHandleIndicationRequestMessage already contain all the information that is needed to deliver the indication? The only extra data the IndicationService should need to track is related to the retry algorithm. What am I missing?
(venkat_puvvada) CIMHandleIndicationRequestMessage does not have the following information.
subscriptionInstanceNames
providerName
pendingRetryCount
These are required to construct CIMProcessIndicationRequestMessage request again.

(r_kumpf) How is it determined which exceptions indicate an indication delivery failure? For example, why does CannotCreateSocketException cause a retry but not bad_alloc?
(venkat_puvvada) Though its difficult to examine the CannotCreateSocketException , its possible that we can retry when socket() returns errno with ENOBUFS or ENOMEM means resources at TCP/IP layer/memory exhausted and can be retried later.

Comments on version 0.3

(k_schopmeyer) Nit. This is only one component in moving from 'sort of best effort' to reliable delivery. I suggest that this is simply improving the protocol so that deliveries can be accomplished in case of 'temporary' errors in the protocol and not really reliable delivery.
(venkat_puvvada) ok

(r_kumpf) Why is the HandlerRetryQueue logic in the IndicationService? Retrying delivery seems like it should be the IndicationHandlerService's job. I think it is more of a protocol-level thing than an indication processing thing.
(venkat_puvvada) Yes, i agree. Actually we have decided to discard the indications on the RetryQueue when matched subscription is removed/disabled. If we implement this in IndicationService, we can directly access the ActiveSubscriptionTable to see if subscription is active or not. This will have performance benefit. Keeping this implementation in the HandlerService requires to check in repository for for subscription validity or a message needs to be sent by IndicationService to HandlerService when subscription is removed/disabled.

(r_kumpf) What happens when a delivery retry fails? Is the indication put back on the queue? At the beginning or the end? What if the queue is full?
(venkat_puvvada) If DeliveryRetry fails indication is inserted at the front of the queue. When queue is full ,indication at the front of the queue will be removed and new indication is added at the back of the queue.

(r_kumpf) Is a new exception class the best way for a handler to communicate the delivery status? It might make sense to change the CIMHandler::handleIndication return type from void to a status value. Possible values could be Success, Error, and FatalError, for example. An interesting question here is what is the behavior when the handler throws an exception which is not DeliveryFailedException? Is it assumed that the delivery was sucecssful or permanently failed?
(venkat_puvvada) This is good idea. We can have possible values Success, Error, and FatalError.
Success - Delivery success
Error - Error, can be retried later
FatalError - Permanent failure, no retry.

If handler throws other than DeliveryFailedException, thats either permanent failure or post-delivery failure, we don't retry in those cases.

Comments on version 0.4

k_schopmeyer) Should we consider some maximum limit on the number of retry queues? This is just another possible memory protector.
(venkat_puvvada) I am ok with that, need to discuss this.

r_kumpf) I presume these test cases will get pretty interesting. Do you have thoughts about how they will work?
(venkat_puvvada) Create the subscription, don't start the Listener. Provider generates the 'n' indications. Now start the listener, listener should get 'n' indications generated by the provider when DeliveryRetryInterval expired.

Comments on version 0.5

(r_kumpf) Can you characterize the threading implications here? If each of the retries is done by the RetryThread, that would mean the DestinationQueueTable would potentially be locked for a long time. If the delivery retry fails, the IndicationHandlerService will need to put the indication back into the DestinationQueueTable. Will deadlock occur?
If a new thread is started for each delivery retry, that would cause a spike of activity on each interval, affecting the delivery of indications to listeners that have not experienced failures.
(venkat_puvvada) No deadlock will occur. It works in the following way.
1. Take the lock on the queue table.
1. Iterate through queue table, get one indication from each queue, store them in array.
2. Release lock on the table.
3. Send each indication in the array to HandlerService, using SendAsync() method.
4. If DeliveryRetry fails HandlerService puts the indication on to the queue.

(r_kumpf) Is the specification clear about the meaning of the DeliveryRetryAttempts value? It seems like it should be the number of delivery retry attempts made AFTER an initial failed delivery attempt. Karl volunteered to follow up with the DMTF on this item.

Comments on version 0.7

(r_kumpf) Shouldn't the lastRetryTime be tracked per indication rather than per queue?
(dmitry_mikulin) If lastRetryTime is per queue, how are you going to tell which indications are ready to be re-tried?
(venkat_puvvada) If we maintain the lastRetryTime for each indication, we can not deliver the indications in sequence. For example if there are many indications in the retry queue if we try to deliver the indications according to the indications lastRetryTime it is possible that we deliver latest indications in the queue.

(r_kumpf) This steps seems like it would unnecessarily delay the delivery of queued indications once the intermittent problem (network error, for example) is resolved.
(venkat_puvvada) RETRY_THREAD_WAIT_TIME value is configurable. This also prevents spike of activity when suddenly all clients/listeners comes up and also solves the problem where consumers are too slow to receive the inidcations.

(b_whiteley) I would prefer to see a solution where all handler types are supported. In addition to extending this functionality to the other handler types, I suspect this implementation would be cleaner.
(venkat_puvvada) Yes, this can be tried in the next stage of implementation.

(b_whiteley) I'm not very familiar with the current Indication Handler Service, so I apologize for the lack of specifics. As I read through this PEP, my gut feeling is that the approach proposed in this PEP will introduce a lot of problems and instability.
It doesn't seem right to have the Handler Service hand indications to other components that will ultimately hand the indications back to the Handler Service.
I would prefer a design that incorporates the following:
* Refactor the HandlerService itself to handle all of the delivery retry logic, rather than having a separate component reinsert indications into the HandlerService.
* Enhance the Handler interface so that delivery retry is applicable to all types of Handlers, not just CIM-XML.
* Design it in a way that is consistent with turning Handlers into Handler Providers at a later date, so that new handlers can be added just as instance providers are added today.

Comments on version 0.8

(k_schopmeyer) The trace is primarily a development tool. Should we not be logging something when we throw indications away. The original use of the discarded data was for 'abnormal' discards, those things that were probably due to pegasus problems. This is a normal event, queue-too-big, discard.
(venkat_puvvada) Yes, i agree. Discarded indications will be just logged.

(k_schopmeyer) Since we are now going to have a mechanism that uses memory to store data for possibly long periods of time, can cause log entries when indications are discarded, and also is going to ask the adminstrator to set config variables, I think we are going to have to have some tools so that the admin can figure out what is happening. Are there indications in retry, how many, how long, possibly which destinations, what are the high-water marks, etc. Without this type of information, the admin will not really understand when his server develops memory issues because of large numbers of retries in queue and will not have any real clue how to set the config variables.
(venkat_puvvada) Yes , i will add class like PG_IndicationDeliveryQueue, which will have properties like, name, size, creation time, last delivery time, number of indications discarded, number of indications successfully delivered. User can enumerate instances of PG_IndicationDeliveryQueue and check for number of delivery queues and their status.

(k_schopmeyer) At this point, we are getting to where we will have a number of different 'scheduled' thread mechanisms between a) provider unload, pull operaitons timer timers, etc. and I wonder if it is not time to define a simple scheduler instead of everybody doing their own thread,wait mechanism. This should not be too difficult, one thread to run the scheduler and an api to enter new timed events in the scheduler.
(venkat_puvvada) With the proposed solution DispatcherThread is only created when there are delivery failures and thread automatically terminates when there are no indications to be delivered. Having a scheduler is nice idea, we can definitely have it proposed and discussed in a separate PEP.

Comments on version 1.0

(k_schopmeyer) While this PEP covers only CIM/XCML, I am having problems determinings 1) why, 2) what is the common part so we know what has to be added to other handers to make them 'reliable'.
(venkat_puvvada) we need to find the feasiblity of implementing the indication delivery retry for other handlers. While this implementation does not prohibit extending delivery retry to other handlers and it would be a future enhancement

(k_schopmeyer) Since this is defined as a statistics class should we not start keeping some more statistics that will give the admin infomration on how much this queue is being used. Currently the statistical information is about overruns effectively but how about things like a high-water-mark (i.e. the highest point the queue reached), average, etc.
(venkat_puvvada) The average rate of indications arriving to the DestinationQueue can be known using the CreationTime and NextSequenceNumber properties. No mechanism provided in this implementation to find out the highest point in size the queue reached. To some extent it can known using the QueueFullDroppedIndications property if it reached the maximum queue size.

Copyright (c) 2006 Hewlett-Packard Development Company, L.P.; IBM Corp.;
EMC Corporation; Symantec Corporation; The Open Group.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to
deal in the Software without restriction, including without limitation the
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

THE ABOVE COPYRIGHT NOTICE AND THIS PERMISSION NOTICE SHALL BE INCLUDED IN
ALL COPIES OR SUBSTANTIAL PORTIONS OF THE SOFTWARE. THE SOFTWARE IS PROVIDED
"AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT
LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Template last modified: March 26th 2006 by Martin Kirk
Template version: 1.11