This opportunity is closed for applications

The deadline was Thursday 17 November 2022
Defence Science and Technology Laboratory

Media Authentication Evaluation Datasets

4 Incomplete applications

2 SME, 2 large

5 Completed applications

5 SME, 0 large

Important dates

Published
Thursday 3 November 2022
Deadline for asking questions
Thursday 10 November 2022 at 11:59pm GMT
Closing date for applications
Thursday 17 November 2022 at 11:59pm GMT

Overview

Off-payroll (IR35) determination
Contracted out service: the off-payroll rules do not apply
Summary of the work
This task should produce media authentication evaluation datasets - labelled, well-structured datasets of real and falsified media, across multiple modalities. These should include deepfakes, GAN-generated imagery, diffusion model outputs, image splicing, generated text, generated audio, and image-caption pairs. Additional subsets should be created by applying anti-forensic techniques.
Latest start date
Monday 9 January 2023
Expected contract length
3-4 Months
Location
No specific location, for example they can work remotely
Organisation the work is for
Defence Science and Technology Laboratory
Budget range
300-350K

About the work

Why the work is being done
The Machine Speed Strategic Analysis (MSSA) project seeks to apply AI to ISR of the sub-threshold information environment. Important aspects of this environment are online news and social media. Falsified content can be used within these domains to, e.g., spread disinformation or gather intelligence. We believe AI can be used to help check and validate content at scale, to help analysts in the open source intelligence space. To ensure that such AI techniques are trustworthy, we need to evaluate their performance using high-quality, unseen validation datasets.
Problem to be solved
Media authentication methods, including deepfake detection, often suffer from poor cross-dataset generalization. They can appear to perform well on the standard datasets on which they are trained, but then loose effectiveness when applied to 'in the wild' data. Before we can give any media authentication tools to analysts, we need to ensure that they will generalise well. To do this, we need large, bespoke datasets created using a variety of media synthesis methods including deepfakes, diffusion models, and text generation. These data must not match those found in standard datasets (such as DFDC), and should include some anti-forensic techniques. See SOR for further detail.
Who the users are and what they need to do
The users will be Dstl technical staff. We will use the datasets to evaluate the performance of tools and techniques produced separate to this contract. To ensure effectiveness, the datasets will need to be well-labelled and split into subsets corresponding to separate modalities. Further, modalities should be comprised of clean / unsynthesised, synthesised (no anti-forensics) and synthesised (with anti-forensics).
Early market engagement
N/A
Any work that’s already been done
N/A
Existing team
No additional supplier. Regular correspondence with Dstl technical partner.
Current phase
Not applicable

Work setup

Address where the work will take place
N/A
Working arrangements
Remote work, with possibility for occasional face-to-face meetings if required. Regular catch-up via MS Teams (likely every two weeks or as discussed with Dstl technical partner).
Security clearance
BPSC (work will be carried out at OFFICIAL)

Additional information

Additional terms and conditions
N/A

Skills and experience

Buyers will use the essential and nice-to-have skills and experience to help them evaluate suppliers’ technical competence.

Essential skills and experience
  • AI & data science experience - track record of applying state-of-the-art techniques , particularly in the fields of computer vision, image processing, and natural language generation.
  • Experience in the creation and curation of datasets, including an appreciation of licensing considerations and constraints.
  • Experience with the verification and validation of AI outputs (to ensure that high quality data are created).
Nice-to-have skills and experience
  • AI & data science experience - track record of applying state-of-the-art techniques , particularly in the fields of computer vision, image processing, and natural language generation.
  • Experience in the creation and curation of datasets, including an appreciation of licensing considerations and constraints.
  • Experience with the verification and validation of AI outputs (to ensure that high quality data are created).
  • Previous experience working with Dstl through other frameworks.
  • : Access to proprietary media dataset creation techniques, from which unique outputs can be produced.

How suppliers will be evaluated

All suppliers will be asked to provide a written proposal.

How many suppliers to evaluate
5
Proposal criteria
  • Technical solution
  • Approach and methodology
  • How the approach or solution meets user needs
  • How the approach or solution meets your organisation's policy or goal
  • Estimated timeframes for the work
  • How they've identified risks and dependencies and offered approaches to manage them
  • Team structure
  • Value for money
Cultural fit criteria
  • A willingness to hold regular catch-up / sprint meetings with Dstl technical staff, and the ability to adjust work based on feedback in such meetings.
  • Evidence of transparent and collaborative decision making
  • Willingness to use MS Teams for remote meetings
Payment approach
Fixed price
Additional assessment methods
Evaluation weighting

Technical competence

70%

Cultural fit

10%

Price

20%

Questions asked by suppliers

1. Please confirm that only datasets will be a required output.
Deliverables will be:

1. Review of literature and open source tools to enable a plan for novel and unseen data to be produced.
2. Labelled, well-structure datasets of real and falsified media, across multiple modalities.
3. Report and presentation summarising the work undertaken.
2. Please confirm the expected classification level that curated and/or synthesised datasets would be held at.
Classification is OFFICIAL
3. The first three questions in the nice to have section are the same as the essential skills question. Can you confirm if both sets of questions are to be answered or if the questions will be changed?
Apologies the Essential skills are:
"1) AI & data science experience - track record of applying state-of-the-art techniques , particularly in the fields of computer vision, image processing, and natural language generation.
2) Experience in the creation and curation of datasets, including an appreciation of licensing considerations and constraints.
3) Experience with the verification and validation of AI outputs (to ensure that high quality data are created). "

Nice to have are:
1) Previous experience working with Dstl through other frameworks.
2) Access to proprietary media dataset creation techniques, from which unique outputs can be produced.
4. Please confirm that the repetition of the ‘essential skills and experience’ attributes under the ‘nice-to-have skills and experience’ criteria (with the additional of two further attributes) was made in error – Can we assume a repetition of our previous responses to the previous heading will be acceptable?
This was made in error - Apologies the Essential skills are:
"1) AI & data science experience - track record of applying state-of-the-art techniques , particularly in the fields of computer vision, image processing, and natural language generation.
2) Experience in the creation and curation of datasets, including an appreciation of licensing considerations and constraints.
3) Experience with the verification and validation of AI outputs (to ensure that high quality data are created). "
5. Please also confirm your preference for how these are to be delivered – i.e. will a data platform for storage/retrieval be required?
3rd party data not required. open source, unaltered data can be used to create the synthetic datasets. However, the use of data that has not previously been used e.g. deepfake datasets would be advantageous ensuring the authentication datasets are as representative of ‘in-the-wild’ conditions as possible. What we do not want is for our held back evaluation data to contain a large amount of data that are also likely to be used in the training of authentication algorithms. A data platform should not be required,anticipate that a structure e.g.zip archives will be sufficient. data may be delivered using encrypted drive.
6. Please confirm whether plans for purchasing 3rd party data is a must-have in the proposal, or whether a mixture of open-source and synthetic datasets would be sufficient.
3rd party data not required. open source, unaltered data can be used to create the synthetic datasets. However, the use of data that has not previously been used e.g. deepfake datasets would be advantageous ensuring the authentication datasets are as representative of ‘in-the-wild’ conditions as possible. What we do not want is for our held back evaluation data to contain a large amount of data that are also likely to be used in the training of authentication algorithms. A data platform should not be required,anticipate that a structure e.g.zip archives will be sufficient. data may be delivered using encrypted drive.
7. Of the questions posed, 3 of them seem to be repeated; 3 of the ‘essential skills’ are repeated in the nice-to-haves. Can you clarify whether this is intentional and if should respond to both sets?
Apologies the Essential skills are:
"1) AI & data science experience - track record of applying state-of-the-art techniques , particularly in the fields of computer vision, image processing, and natural language generation.
2) Experience in the creation and curation of datasets, including an appreciation of licensing considerations and constraints.
3) Experience with the verification and validation of AI outputs (to ensure that high quality data are created). "

Nice to have are:
1) Previous experience working with Dstl through other frameworks.
2) Access to proprietary media dataset creation techniques, from which unique outputs can be produced.
8. • Cost of procuring datasets, who will bare this or do we factor into the 300-350k budget?
o All costs should be factored into the £300-350k budget. To be clear, the use of existing open data for the category of ‘real’ datasets is not prohibited, though the addition of unseen data would be considered beneficial.
9. • Would you be able to provide further context on the statement ‘including an appreciation of licensing considerations and constraints’.
o Some datasets, and some data generation / synthesis methods have non-commercial clauses in their licence agreements that prevent their use in this task. In some cases, alternate implementations of the same / similar techniques exist which avoid this constraint. Further, some tools are copy-left, meaning that we cannot modify any code, or use any code to create a new tool, without open-sourcing the output (not an acceptable option for us). As such, these copy-left tools can only be used ‘as-is’ to create data (which should largely not be a problem for us as this is a dataset creation task).
10. • Reference previous experience working with DSTL through other frameworks – can you confirm if this includes previous work with DSTL on DOS5?
o Yes, the ‘other’ should be ignored in this instance. Experience working with Dstl through any commercial framework is welcomed
11. • ‘Access to proprietary media dataset creation techniques’. Please could you clarify if you have any proprietary techniques in mind, or whether you have already engaged with prospective suppliers who have proprietary techniques?
o This was not written with any particular techniques in mind, nor have we engaged with suppliers in this regard. Rather, this means that if a supplier has already developed a proprietary generation technique (e.g., a deepfake creation algorithm) this would expedite the creation of bespoke dataset(s).