Department for Education

National Pupil Database Privacy Controlling API - Alpha

Incomplete applications

8
Incomplete applications
5 SME, 3 large

Completed applications

13
Completed applications
11 SME, 2 large
Important dates
Opportunity attribute name Opportunity attribute value
Published Monday 17 July 2017
Deadline for asking questions Monday 24 July 2017 at 11:59pm GMT
Closing date for applications Monday 31 July 2017 at 11:59pm GMT

Overview

Overview
Opportunity attribute name Opportunity attribute value
Summary of the work Supplier will:
Create, and test, synthetic NPD dataset on which users can develop queries prior to using the API.
Prototype, and test, a privacy preserving query API.
Clearly articulate benefits of API and its role in delivering DfE's vision to 'distribute access, not data'.
Meet expectations of end Alpha assessment
Latest start date Friday 1 September 2017
Expected contract length Up to 4 months
Location No specific location, eg they can work remotely
Organisation the work is for Department for Education
Budget range

About the work

About the work
Opportunity attribute name Opportunity attribute value
Why the work is being done In the form of the National Pupil Database the Department for Education (DfE) has some one of the best data sources in government. However, how we provide access to the data needs to be modernised to realise it's potential benefit - with an emphasis on 'distributing access, not data'.

Initial discovery has been completed on a Privacy Controlling API. We want to explore this option further with users via an Alpha.
Problem to be solved The key problem we are looking solve can be summarised as:

'How can the DfE maximise the benefit of our data, initially NPD, by providing fast, safe and secure access to those who use it to add/create value – whilst reducing cost and maximising information security'.

In particular, whether a privacy controlling API can interact successfully interact with the sophisticated statistical/research methods applied to our data in order to meet the needs of existing NPD users.
Who the users are and what they need to do External:

As an external researchers/analyst
I need to have access to the National Pupil Database (NPD)
So that I can conduct essential research (that may help to shape future DfE policy)

Internal:

As the NPD data owner
I need to provide secure and controlled access to the NPD
So that internal & external researcher/analysts can produce analysis/finding that add value to the DfE, whilst minimising information risk.
Early market engagement We have not conducted any early market engagement.
Any work that’s already been done We have completed the discovery phase and are looking to move into Alpha. The discovery identified the potential for using a privacy protecting, query based, API.
Existing team The supplier will be working with the DfE NPD Access Development lead
Current phase Alpha

Work setup

Work setup
Opportunity attribute name Opportunity attribute value
Address where the work will take place Key DfE stakeholders for this work are based in Sheffield & London
Working arrangements 1) We are looking for flexibility in the approach to completing this alpha
2) We will be working to Agile principles - i.e. regular stand ups, Show & Tells etc
3) Necessary expenses will be paid in line with DfE standard policies
Security clearance In order to generate the synthetic data, a supplier may need to access a sample of real NPD data. As such, any successful supplier would be expected to meet the security terms associated with applying for an NPD Extract, outlined here:

https://www.gov.uk/guidance/national-pupil-database-apply-for-a-data-extract

Additional information

Additional information
Opportunity attribute name Opportunity attribute value
Additional terms and conditions

Skills and experience

Buyers will use the essential and nice-to-have skills and experience to help them evaluate suppliers’ technical competence.

Skills and experience
Opportunity attribute name Opportunity attribute value
Essential skills and experience
  • Have experience of applying differencial privacy
  • Be able to identify, and arrange research/testing, with users
  • Have experience of designing user centric services
  • Be able to produce and test prototypes with users
  • Work in an Agile way - in line with GDS serivce standards
Nice-to-have skills and experience Previous experience of working in the public sector

How suppliers will be evaluated

How suppliers will be evaluated
Opportunity attribute name Opportunity attribute value
How many suppliers to evaluate 3
Proposal criteria
  • How a differncial privacy solution meets user needs
  • Technical solution
  • Value for Money
  • Approach & methodology
Cultural fit criteria
  • Share knowledge and experience with other team members
  • Be transparent and collaborative
  • Can work with clients with low technical expertise
Payment approach Fixed price
Assessment methods
  • Written proposal
  • Work history
  • Presentation
Evaluation weighting

Technical competence

75%

Cultural fit

5%

Price

20%

Questions asked by suppliers

Questions asked by suppliers
Supplier question Buyer answer
1. What is process after the initial submission of skills and evidence? The full process is outlined on gov.uk - https://www.gov.uk/guidance/digital-outcomes-and-specialists-buyers-guide#shortlist-interested-suppliers

We have chosen the following assessment methods:

• Written proposal
• Work history
• Presentation
2. Is the intention for this API to be web-based e.g. access via Key To Success? What technology is the NPD based on? Can you elaborate on "sophisticated statistical/research methods" that you apply to your data? It is anticipated the API will be web-based.
The NPD is based on SQL.
The education data we hold contains the longitudinal movement of pupils through the education system over the last c15 years. Rather than just ‘sums, counts and percentages’ being applied to the data, users also typically:

- Want to match data together with other sources using unique numbers and fuzzy matching techniques
- Evaluate long term impact of ‘treatments’ by undertaking Propensity Score Matching techniques
- Understand longitudinal trends within data, perhaps using regression or other appropriate statistical techniques to test for strengths of relationships between data”
3. In order to alleviate risks and delay associated with gaining access to sensitive data, would it be possible or preferable for appropriately cleared personnel to operate at Department for Education office site and / or using DfE equipment? We are continuing to assess and investigate other ways in which we can provide access to NPD data. We have limited 'onsite' estate capacity to allow 'appropriately cleared personnel to operate at Department for Education office site and / or using DfE equipment?'
4. What, in your opinion based on what we know today, is the biggest risk faced by this project? What could stop this project from moving forward? Finding a suppliers with the skills and experience that meet the requirements.
5. Please can we ask whether there is a incumbent supplier? No, there is not an incumbent supplier
6. Can you confirm the full scope of the Alpha phase? Is it covering just the solution or actually standing up a secure service as well? The GDS Service Manual (link below) explains how the Alpha stage works.

https://www.gov.uk/service-manual/agile-delivery/how-the-alpha-phase-works
7. What is the existing back-end data store used to hold the Student details? The back end data store is a single SQL server.
8. Assuming that large scale data retrieval is required by third parties to perform data analytics, are there any indications of data volumes? This piece of work is looking to reduce the number of bespoke data extracts we produce (i.e. large scale data retrievals). Please let me know if I have misunderstood your question.
9. Will the queries be of a specific type as they are now (I.e, the result format is consistent per query type) and only the criteria is provided or is it a more generic ad-hoc query service where users can requests what data fields to return in the query. A large number of potential and current use cases of NPD data extracts require only aggregates across groups and so could be served by a privacy preserving query interface that supports: (a) flexible tabulation of counts, sums, and averages in the data (b) common statistical analysis functions, each of which can be run without viewing raw data.

Types of query identified.

1. Aggregate, SQL-style queries.
2. Specific pupil group queries: used to evaluate the success of an intervention administered to a particular group of pupils
3. Queries for longitudinal studies: used to track the progress of cohorts over time.
10. Has the Discovery phase been undertaken internally by DfE or by an external supplier? The Discovery has been undertaken by a supplier.
11. Do the 5 essential criteria have equal weighting and if not please can you confirm the weighting? They will have an equal weighting.
12. Could you please share the outcomes of discovery (i.e. recommendations for alpha)? & would you share the complete set of discovery findings with the selected supplier? The full set of discovery findings will be shared with the selected supplier. High level recommendations were to:

1) Create, and test, synthetic NPD dataset on which users can develop queries prior to using the API.
2) Prototype, and test, a privacy preserving query API.
3) Clearly articulate benefits of API and its role in delivering DfE's vision to 'distribute access, not data'.
4) Meet expectations of end Alpha assessment
13. What options have you evaluated to achieve secure and controlled access to the national pupil database? We have, and continue, to evaluate several options - including the Office for National Statistics (ONS) hosting NPD data in their secure Data Labs.
14. What was the process used to evaluate the options that lead you to identify differential privacy as a solution and what other options have you considered? We conducted a Discovery - based on GDS guidelines - and one of the recommendations was to consider a Privacy Protecting API as a potential solution. This alpha is to understand more about how such an API would meet user needs. We have considered, and continue to consider, other potential solutions - including the Office for National Statistics (ONS) hosting NPD data in their secure Data Labs.
15. Are there any other technology constraints? The link below provides useful information on technologies.

https://www.gov.uk/service-manual/technology/choosing-technology-an-introduction

Technologies would have to align to the technology code of practice:

https://www.gov.uk/government/publications/technology-code-of-practice/technology-code-of-practice
16. Are you able to give an indication of the budget for this project? No, not at this stage.
17. Q1. Is there any unstructured or semi-structured data as part of NPD dataset? If yes, would the selected supplier need to synthesize this part of Alpha?
Q2. Is NPD structured as a relational DB or as a Datawarehouse?
Q3. What is the approximate size of the NPD dataset? We are looking to understand if it is in 10s, 100s or 1000s of GBs, TBs or PBs.
A1. I'm not entirely sure what you mean. Can you please explain in more detail please? At present all data is ‘structured’ to meet business need.

A2. There are multiple relational DBs but not a formal Data Warehouse.

A3.The NPD is held on a server can hold 3.2 TB of data. The ‘live’ data totals about 2.8 TB
18. With what backend systems does the API component have to communicate (e.g. database)? I'm mot entirely sure what you mean. The NPD is stored on an SQL server with multiple databases that are queried using transact SQL, stored procedures and scripts. Does that answer your question?
19. What server platform does it need to run on? Can you clarify what you are asking please? What server platform does what need to run on?
20. Has the DfE decided the query language to be used in the API, or is this an area for development/evaluation as part of this project? This will be an area for development and evaluation within this project.
21. Do you have any restrictions if the development of APIs took place outside UK, given that all staff who have access to sample NPD data are working from your offices and have appropriate clearances? The NPD cannot be sent outside of the UK so I'm not sure how the API can be developed outside of the UK - given that suppliers will need to develop synthetic data using a real extract? Please correct me if I am wrong.
22. Does the Department have a solution in mind, which they want a contractor to prototype and test, or are they looking for a solution “from the ground up”? Initial discovery has been completed on a Privacy Controlling API. We want to explore this option further with users via an Alpha. We want a supplier to:
1) Create, and test, synthetic NPD dataset on which users can develop queries prior to using the API
2) Prototype, and test, a privacy preserving query API.
3) Clearly articulate benefits of API and its role in delivering DfE's vision to 'distribute access, not data'.
4) Meet expectations of end Alpha assessment
23. The literature is clear that there is a trade-off between the noise applied to the data and the accuracy of the statistical results. Furthermore greater accuracy may require the imposition of a 'privacy budget'. Is it envisaged that this will form part of the research? Understanding pact of the 'noise' applied to the data, and it's impact on the accuracy of statistical results, will form part of the research and testing. Understanding more about 'privacy budget' would be a reasonable assumption during the Alpha.
24. How does the Department define synthetic data? For example, should they reflect real world distribution as per our article http://blog.gide.net/en/how-to-impute-a-fictitious-raw-dataset-from-aggregate-results-part-1-3/ We would want the synthetic dataset to mimic the real data in certain ways; the key similarities would be in the qualities that allow a user to get the look and feel of data. These qualities include the following: column names, number of rows, approximate ranges of numeric variables, common options of categorical variables, frequency of missing or null values.
25. Will the Department provide a pool of users from which to convene the user research? Or will the supplier provide these users from their own research contacts? The DfE can provide a pool of NPD users for research - including those involved in the initial discovery.
26. Does the work involve both the implementation of the API and the implementation of a reference client interface to use that API? i.e. is DfE interested in evaluating both the functionality of the API, and the mechanics of how non-technical users could use that API, or just in the API functionality? During the Alpha we will need to assess the usability of the service as a whole. The Alpha will require us to build and test many prototypes to enable us to establish which designs offer the best user experience.

In summary, the alpha phase we will need to:

•build prototypes of the service
•test prototypes with users
•demonstrate that the service you want to build is technically possible
27. Is the intention that the API only provides access to aggregated results, or is there a requirement from users to have access to pupil-level datasets via the API? The API will enable users to run queries against the raw data, without exposing the raw data to the users. The outputs of the queries have 'noise' added to protect the data.