Dealing with Workflow Failed On Start (retrying) errors in Microsoft SharePoint

I don’t know why I expect SharePoint to work without problems. In almost every project I’ve done, some unforeseeable problem pops up that seems to make little sense. Maybe, after beating my head against the wall so many times, I’ve developed SharePoint amnesia, but I always expect the project to be a pleasant experience. I guess that’s just my inner masochist rearing its ugly head.

Recently, I had a working Visual Studio workflow for SharePoint 2007 which was designed to set access permissions at the list item level based on the metadata assigned to the document. When a document is added or updated in the library, the workflow will update the permissions associated with the document. Simple stuff.

Then we had a new requirement. Since permissions are assigned base on metadata values assigned to documents, what happens when the SharePoint groups and/or users associated with a value needs to change? With the current workflow, it would be necessary to manually run the workflow for each and every document. In a small document library, this would not be a problem. But when you’re dealing with a document library containing several thousand documents, this would be a life sentence.

In order to accommodate the new requirement, I had to modify the workflow to be able to loop through all of the documents in the library and re-apply permissions. Making the change was easy enough; the code that sets the permissions on one document simply had to be wrapped in a foreach loop of all documents in the library. Problem solved, right?

Failed On Start (retrying)
It's lying!

I deployed the workflow to the test environment and started it. It ran for several minutes with no apparent problems. The workflow history list was logging all of the tracking messages for each document, and permissions were being set as expected. Then all of a sudden, the workflow status switched to “Failed on Start (retrying).” Workflow history logs stopped appearing, and the workflow was certainly not “retrying”.

After several Google searches came up largely empty, I began making changes to the code to see if I could get a different result. Nothing was working. No matter what I changed, the workflow would run for several minutes and fail again. Then I decided to add more workflow history logging to pinpoint a particular line of code that may be causing the error.

With the new workflow history logs added into the code, I redeployed and ran the workflow again. This time I noticed the workflow was much quicker to fail. It seemed that the workflow history list had something to do with the problem. To test my hypothesis, I commented out the workflow history logs and tried again. Sure enough, the workflow ran to completion, without any problems. After I verified on a random selection of documents that permissions had been set correctly, I did my happy dance. Problem solved.

So what happened?

My best explanation of what the problem could have been is that the workflow was logging too many workflow task items, which eventually caused the workflow to crash. Since the workflow was originally written to run on a single document, it was adding 5-10 workflow history logs per document, depending on the values of the metadata tied to the document. Multiply this by 5000, and you get 25,000 to 50,000 new list items per workflow instance.

It is generally recommended to keep list sizes below a few thousand list items in SharePoint 2007, so the plethora of workflow history items was essentially crushing the workflow under its own weight.

CampfireThe moral of this story is this: Logging is great for debugging and ensuring proper operation of your workflows, but don’t go overboard. It’s like they always say: “Too many logs in the fire means no more roasted marshmallows.” Or something like that.

 

 

  • John Adams

    Thank you for your post!  I have been pounding my head trying to figure this problem out. 

  • SharePoint Jedi

    I’m also an MCPD in SharePoint and .NET (although technically you can only have one MCPD). I had to laugh about the part that says you no longer become frustrated by anything. This platform does build up a frustration tolerance…its helped me a lot on my drive home in rush hour traffic. Peace.