Extracting versioned data from SharePoint Lists using SSIS

Why?

I used SSIS to extract data from SharePoint lists for use elsewhere in the enterprise. For "flat" lists, where we may want to extract data from a single list into a single table, this is straightforward to setup using the SharePoint connector available from CodePlex
https://sqlsrvintegrationsrv.codeplex.com/

Technical debt

I came across a poor decision that was made in the past to use SharePoint version history to store changes in metadata, a temporal representation of lists data over time, expecting this to be extracted for reuse. The original solution directly queried SharePoint internal content database to get list data, so whilst it was reasonably fast, it gave me headache to read and is at risk of being destroyed the next time Microsoft release a SharePoint service pack.

In the past we all did crazy things like stage diving, drinking snake bite and querying SharePoint content databases directly. WE DON'T DO THAT NOW! Well querying SharePoint content databases directly at least.

Using SSIS to replace this we need to:

  1. Extract the list data from the SharePoint list using the SharePoint SSIS connector, so that we have a table containing “un-versioned” data
  2. Select the ids of each SharePoint list items into another table and output to a resultset
  3. Using this table of ids, use a SSIS control flow ForEach loop, iterating a variable mapping, pointing to the resultset configured in step 2
  4. Assuming that we have a spreadsheet containing the mappings of the SharePoint internal name and the destination column, we make a connection to it
  5. SharePointColumnDestinationColumn
    TitleTitle
    User1User
    Postal_x0020_AddressPostal Address
    Favourite_x0020_ColourFavourite Colour
    Part_x0020_NumberPart Number
  6. Now that we have three parameters to make the web service call, the list Id, item id, an internal column name, we can make the web service call.
  7. You may have noticed that each field and each row are needed to make a single webservice call to get version data, so for instance if you have 10 rows, 5 fields, you will need to make 50 web service calls! There is no batching mechanism, this is is! Perhaps, the reason why Microsoft do not offer any restful mechanism or batching is that using versioning data to store essentially relational data, is so ridiculous that no one would do it.
  8. The resultant xml is processed using a script task inside a data flow process, into an output buffer
    <Versions>  
      <Version PostalAddress="1 Acacia Ave, London" Modified="2015-05-11T15:09:36Z" Editor="…" />
      <Version PostalAddress ="1 First Ave, Chicago" Modified="2015-01-27T14:44:30Z" Editor="…" />
      <Version PostalAddress ="1 Champs Elysees, Paris" Modified="2015-01-27T13:34:34Z" Editor="…" />
    </Versions>  
    
  9. Finally the outputbuffer is merged into the “unversioned” table in step one

Endnote

  1. Link back to Marc D Anderson's blog on the subject of SOAP replacement of versions service, with RESTful equivalent. However it is most probably the wrong use case to use version history in order to store metadata. Use relational lists! http://sympmarc.com/