⭐ Star us on GitHub — it motivates us a lot!
- Prerequisites
- Quickstart
- Getting Started
- Environment Variables
- Task Configuration
- Admin Dashboard
- HITs Allocation
- Quality Checks
- Local Development
- Task Performing
- Task Results
- FAQ & Troubleshooting
- Contributing
- Release & CI Policy
- Original Article
- AWS CLI v2
- Node.js 20 LTS
- Yarn (via Corepack)
- Miniconda (to create the Python 3.11 environment)
- Docker (optional; only if
enable_solver=true)
AWS: use a profile that can create/use S3 and DynamoDB:
aws configure --profile your_iam_user
aws sts get-caller-identity --profile your_iam_user# 1) Clone and enter the repo
git clone https://github.com/Miccighel/Crowd_Frame.git
cd Crowd_Frame
# 2) Enable Yarn via Corepack and install deps
corepack enable
yarn install --immutable
# 3) Create and activate the Python env (Python 3.11 is installed here)
conda env create -f environment.yml
conda activate Crowd_Frame
# 4) Prepare env and initialize
cd data
# Create .env as shown below
python init.py
# 5) (Optional) sanity checks
aws --version; node -v; yarn -v; python --version
# 6) Open the deployed task
# https://<deploy_bucket>.s3.<region>.amazonaws.com/<task_name>/<batch_name>/index.htmlSee task examples: examples/
-
Create an AWS account.
-
Create a new IAM user called
your_iam_user. -
For quick testing, attach the AdministratorAccess policy to
your_iam_user.
(For production use, replace this with a least-privilege policy.){ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": "*", "Resource": "*" }] } -
Generate an access key pair for
your_iam_user. -
Configure an AWS CLI profile and verify it works:
aws configure --profile your_iam_user aws sts get-caller-identity --profile your_iam_user
-
Credentials file locations:
- Windows:
C:\Users\<your_os_user>\.aws\credentials - macOS/Linux:
~/.aws/credentials
Example credentials entry:
[your_iam_user] aws_access_key_id=your_key aws_secret_access_key=your_secret
- Windows:
-
Clone the repo and enter it:
git clone https://github.com/Miccighel/Crowd_Frame.git cd Crowd_Frame -
Enable Yarn via Corepack and install Node dependencies:
corepack enable yarn install --immutable -
Install Miniconda (or Anaconda), then create the Python environment:
conda env create -f environment.yml conda activate Crowd_Frame
-
Move to the
datafolder:cd data -
Create
data/.env(the file must be named exactly.env) and set the required variables:mail_contact=your_email_address budget_limit=your_usd_budget_limit task_name=your_task_name batch_name=your_batch_name admin_user=your_admin_username admin_password=your_admin_password server_config=none aws_region=your_aws_region aws_private_bucket=your_private_bucket_name aws_deploy_bucket=your_deploy_bucket_name profile_name=your_iam_user
-
Run the initializer:
python init.py
-
This script will:
- read
data/.env - set up the AWS infrastructure
- generate an empty task configuration
- deploy the task to the public bucket
- read
-
Open your task:
https://<deploy_bucket>.s3.<region>.amazonaws.com/<task_name>/<batch_name>/index.html
Crowd_Frame uses AWS services (e.g., S3, DynamoDB, optionally CloudFront) to deploy tasks and store data. Usage may fall within the AWS Free Tier, depending on your account and workload.
You can cap spending via budget_limit; when it’s reached, the scripts stop creating new resources/deployments (existing AWS charges may still apply).
The following table lists the variables you can set in your_repo_folder/data/.env.
| Variable | Description | Mandatory | Value |
|---|---|---|---|
profile_name |
IAM profile name created in Step 2. Defaults to default if unspecified. |
❌ | your_iam_user |
mail_contact |
Contact email for AWS budget notifications. | ✅ | Valid email address |
platform |
Deployment platform. Use none for manual recruitment. |
✅ | none, mturk, prolific, toloka |
budget_limit |
Monthly budget cap in USD (e.g., 5.0). |
✅ | Positive float |
task_name |
Task identifier. | ✅ | Any string |
batch_name |
Batch identifier. | ✅ | Any string |
task_title |
Custom task title. | ❌ | Any string |
batch_prefix |
Prefix to group/filter multiple batches. | ❌ | Any string |
admin_user |
Admin username. | ✅ | Any string |
admin_password |
Admin password. | ✅ | Any string |
aws_region |
AWS region (e.g., us-east-1). |
✅ | Valid region |
aws_private_bucket |
Private S3 bucket for configuration and data. | ✅ | Unique string |
aws_deploy_bucket |
Public S3 bucket used to deploy the task. | ✅ | Unique string |
aws_dataset_bucket |
Optional S3 bucket for additional datasets. | ❌ | Unique string |
server_config |
Worker logging backend: aws (managed), custom (your endpoint), or none (disabled). |
✅ | aws, custom, none |
enable_solver |
Enable the local HIT solver (automatic allocation). Requires Docker. | ❌ | true or false |
enable_crawling |
Enable crawling of search results retrieved in-task. | ❌ | true or false |
prolific_api_token |
Prolific API token used to create studies via the Researcher API (platform=prolific). |
❌ | String |
prolific_project_id |
Prolific project ID for new studies. If unset, the user’s current_project_id is used. |
❌ | String |
prolific_completion_code |
Custom Prolific completion code. If unset, a deterministic code based on task/batch is used. | ❌ | String |
toloka_oauth_token |
Toloka API token (required if platform=toloka and you use API operations). |
❌ | String |
ip_info_token |
Token for ipinfo.com. |
❌ | String |
ip_geolocation_api_key |
API key for ipgeolocation.io. |
❌ | String |
ipapi_api_key |
API key for ipapi.com. |
❌ | String |
user_stack_token |
API key for userstack.com (user-agent parsing). |
❌ | String |
brave_api_key |
API key for Brave Search Web API. | ❌ | String |
google_api_key |
API key for Google Custom Search JSON API. | ❌ | String |
google_cx |
Google Programmable Search Engine ID (Custom Search Engine cx). |
❌ | String |
pubmed_api_key |
API key for NCBI PubMed eUtils (used to increase rate limits; optional but recommended). | ❌ | String |
Crowd_Frame can optionally use external search providers inside the task UI
(Brave Search, Google Custom Search, PubMed).
All of these integrations are optional: if a key is not configured, that provider
cannot be selected in the Generator (Step 5).
- Create an account and log into the Brave Search API dashboard.
- Create an API key for the Web search endpoint.
- Set the key as
brave_api_keyindata/.env.
- In the Google Cloud Console, create a project and enable the Custom Search API.
- Under APIs & Services → Credentials, create an API key and set it as
google_api_key. - Create a Programmable Search Engine (Custom Search Engine) and copy its
cxidentifier. - Set this identifier as
google_cxindata/.env.
- Log into My NCBI (or create an NCBI account).
- Generate an API key for NCBI E-utilities from your account settings.
- Set this key as
pubmed_api_keyindata/.env. If omitted, PubMed still works, but with the default (lower) rate limits.
Use the Generator (admin panel) to configure your deployed task:
- Open the admin panel by appending
adminto the task URL, e.g.
https://<deploy_bucket>.s3.<region>.amazonaws.com/<task_name>/<batch_name>/admin - Sign in with the admin credentials from
data/.env(admin_user,admin_password). - Click the
Generatortab. - Go through each configuration step and upload the final configuration.
- Questionnaires — Create one or more pre/post questionnaires workers complete before or after the task.
- Evaluation Dimensions — Define what each worker will assess for every HIT element.
- Task Instructions — Provide general instructions shown before the task starts.
- Evaluation Instructions — Provide in-task instructions shown while workers perform the task.
- Search Engine — Choose the search provider and (optionally) filter domains from results.
- Task Settings — Configure max tries, time limits (
time_assessment), and the annotation interface; also upload the HITs file for the task. - Worker Checks — Configure additional checks on workers.
The following table details the content of each configuration file.
| File | Description |
|---|---|
hits.json |
Contains the whole set of HITs of the task. |
questionnaires.json |
Contains the definition of each questionnaire of the task. |
dimensions.json |
Contains the definitions of each evaluation dimension of the task. |
instructions_general.json |
Contains the general instructions of the task. |
instructions_evaluation.json |
Contains the evaluation instructions of the task. |
search_engine.json |
Contains the configuration of the custom search engine. |
task.json |
Contains several general settings of the task. |
workers.json |
Contains settings concerning worker access to the task. |
Note — Blocking & reset: Crowd_Frame enforces limits via a DynamoDB ACL. Workers are blocked if they exceed max tries or the time limit (
time_assessment). For testing, use a hightime_assessmentto avoid accidental blocks. To reset, clear the task’s ACL records — this irreversibly deletes access history, so use only when starting from a clean slate.
Open the console by adding /admin to your task URL.
- ACL (default) — review current access entries, see who holds which units, and resolve or release entries when needed.
- Data — look up a worker’s submissions and open any row to see the full details.
- Private bucket — browse the files for the current task/batch and remove items you no longer need.
The HITs for a task must be stored in a special JSON file. Such a file can be manually uploaded when configuring the task itself. The file must comply with a special format that satisfies 5 requirements:
- There must be an array of HITs (also called units);
- Each HIT must have a unique input token attribute;
- Each HIT must have a unique output token attribute;
- The number of elements for each HIT must be specified for each HIT;
- Each element must have an attribute named
id - Each element can have an arbitrary number of attributes.
The following fragment shows a valid configuration of a crowdsourcing task with 1 HIT.
[
{
"unit_id": "unit_0",
"token_input": "ABCDEFGHILM",
"token_output": "MNOPQRSTUVZ",
"documents_number": 1,
"documents": [
{
"id": "identifier_1",
"text": "Lorem ipsum dolor sit amet"
}
]
}
]Note that:
- Initially the deploy script creates an empty configuration
- You can upload the HITs during configuration step 6
HITs can also be built manually.
Start by choosing an attribute whose values divide the dataset into classes.
Pools of elements are then created, one per class, and four parameters are defined:
- total number of elements to allocate,
- number of elements per HIT,
- number of elements per class,
- number of repetitions per element.
Each pool is updated to include the required repetitions. HITs are then built iteratively by sampling elements from each class until a duplicate-free sample is obtained. Selected elements are removed
from the pool once used. The total number of HITs is determined by these parameters.
The allocation matrix can be serialized for later reference, and the final HITs exported in the required format.
The first algorithm below shows the main allocation procedure, while the second details the singleHIT(...) sub-procedure used to sample a set of unique elements.
If
Algorithm: Allocate dataset into HITs
elementsFiltered ← filterElements(attribute, valuesChosen)
classes ← valuesChosen
pools ← List()
for class in classes do
elementsClass ← findElements(elementsFiltered, class)
pool ← unique(elementsClass)
pools.append(pool)
end for
totalElements ← len(elementsFiltered)
classElementsNumber ← len(classes)
hitElementsNumber ← k
repetitionsElement ← p
for pool in pools do
pool ← extendPool(repetitionsElement)
end for
poolsDict ← mergePools(pools, classes)
hits ← List()
for index in range((totalElements * repetitionsElement) / hitElementsNumber) do
hitSample ← singleHit(poolsMerged)
hitSample ← shuffle(hitSample)
hits.append(hitSample)
end for
hits.serialize(pathAssignments)
hitsFinal ← List()
for hit in hits do
index ← index(hit)
unitId ← concat("unit_", index)
tokenInput ← randomString(11)
tokenOutput ← randomString(11)
hitObject ← BuildJSON(unitId, tokenInput, tokenOutput, hitElementsNumber)
for indexElem in range(hitElementsNumber) do
hitObject["documents"] ← hits[indexElem]
end for
hitsFinal.append(hitObject)
end for
hitsFinal.serialize(pathHits)
Algorithm: singleHIT (sample without duplicates)
containsDuplicates ← True
while containsDuplicates do
sample ← List()
for class in classes do
for indexClass in range(classElementsNumber) do
element ← random(poolsDict[class])
sample.append(element)
end for
end for
if checkDuplicates(sample) == False then
containsDuplicates ← False
end if
end while
for s in sample do
for c in classes do
if s ∈ pool[c] then
pool[c].remove(s)
end if
end for
end for
return sample
Crowd_Frame supports custom quality checks per evaluation dimension (enable them in the configuration). Implement your logic in the static method performGoldCheck in data/build/skeleton/goldChecker.ts.
Quality checks run only on elements marked as gold. Mark an element by prefixing its id with GOLD. In the example below, the second document in a HIT is marked for the check.
The snippet below shows the default checker generated by the initializer. goldConfiguration is an array where each entry includes the gold document, the worker answers for dimensions with checks enabled, and optional notes. Write your control between the two comment blocks; return one boolean per gold element (true = pass, false = fail).
Note: with strict TypeScript checks enabled (
noUnusedLocals/noUnusedParameters), the generated stub includes a small no-op helper (markAsUsed) to avoid TS6133 while placeholders are still unused. You can remove the call once your implementation uses all variables/parameters.
[
{
"unit_id": "unit_0",
"token_input": "ABCDEFGHILM",
"token_output": "MNOPQRSTUVZ",
"documents_number": 2,
"documents": [
{
"id": "identifier_1",
"text": "..."
},
{
"id": "GOLD-identifier",
"text": "..."
}
/* Gold element */
]
}
]export class GoldChecker {
private static markAsUsed(...values: unknown[]): void {
void values;
}
static performGoldCheck(goldConfiguration : Array<Object>, taskType = null) {
let goldChecks = new Array<boolean>()
/* If there are no gold elements there is nothing to be checked */
if(goldConfiguration.length<=0) {
goldChecks.push(true)
return goldChecks
}
for (let goldElement of goldConfiguration) {
/* Element attributes */
let document = goldElement["document"]
/* Worker's answers for each gold dimensions */
let answers = goldElement["answers"]
/* Worker's notes*/
let notes = goldElement["notes"]
GoldChecker.markAsUsed(taskType, document, answers, notes);
let goldCheck = true
/* CONTROL IMPLEMENTATION STARTS HERE */
/* Write your code; the check for the current element holds if goldCheck remains set to true */
/* CONTROL IMPLEMENTATION ENDS HERE */
/* Push goldCheck inside goldChecks array for the current gold element */
goldChecks.push(goldCheck)
}
return goldChecks
}
}You can edit and test the configuration locally without deploying the full AWS stack.
- Go to the environments folder:
cd your_repo_folder/data/build/environments - Open the dev environment file:
environment.ts - Set
configuration_localtotrueand adjust the values you want to test.
Example (environment.ts):
export const environment = {
production: false,
configuration_local: true,
platform: "mturk",
taskName: "your_task_name",
batchName: "your_batch_name",
...
};Note: Each time you run init.py, this file may be overwritten. Keep a backup of local edits if needed.
Security: Do not commit
environment.tsor.envif they contain AWS keys or secrets. Add them to.gitignore.
To publish a task, choose how you will recruit workers: via a supported platform or manually. The publishing steps vary by option. Pick one of the subsections below and follow its instructions.
To recruit workers manually (no platform integration):
- Set
platform = noneindata/.env. - (Optional) Generate and assign each worker an identifier, e.g.,
randomWorkerId. - Append the identifier to the task URL as a GET parameter (exact casing
workerId):
?workerId=randomWorkerId - Share the full link with each worker (example):
https://<deploy_bucket>.s3.<region>.amazonaws.com/<task_name>/<batch_name>/index.html?workerId=randomWorkerId - Wait for completion.
Note: Steps 2–3 are optional. If you share the base URL without workerId, Crowd_Frame generates one on first access.
To recruit via MTurk:
- Set
platform = mturkindata/.env. - In MTurk, create the task and set its general parameters and criteria.
- Go to the build output for MTurk:
data/build/mturk/ - Copy the wrapper HTML:
data/build/mturk/index.html - Paste it into the MTurk Design Layout box.
- Preview and save the task project.
- Publish the task and upload the tokens file:
data/build/mturk/tokens.csv
(Client-side validation will only enable Submit when a pasted token matches.) - Review submission statuses in the Manage tab.
Security: Keep
data/build/mturk/tokens.csvprivate. It is used for submission validation.
To recruit via Prolific:
-
Enable Prolific in Crowd_Frame
In
data/.env:platform = prolific- (optional)
prolific_api_token– Prolific Researcher API token. If set, the init script can auto-create a draft study. - (optional)
prolific_project_id– Prolific project where the study is created. If unset, the user’scurrent_project_idis used. - (optional)
prolific_completion_code– Explicit completion code. If unset, Crowd_Frame uses a deterministic<TASK_NAME>_<BATCH_NAME>_OK.
-
If
prolific_api_tokenis set (API mode)- On first deploy, the init script:
- Looks for a study with
internal_name = "<task_name>_<batch_name>". - If it exists → logs it and does not modify it.
- If it does not exist → creates a draft study in the chosen project with:
- External link pointing to
https://<deploy_bucket>.s3.<region>.amazonaws.com/<task_name>/<batch_name>/index.html?... - URL parameters storing Prolific IDs.
- Description taken from the general instructions config.
- A
COMPLETEDcompletion code (fromprolific_completion_codeor the auto-generated one).
- External link pointing to
- Looks for a study with
- Then open the draft in Prolific, configure audience and cost, and publish as usual.
- On first deploy, the init script:
-
If you do not use the API
- Create the study manually in Prolific:
- Data collection → External study link:
https://<deploy_bucket>.s3.<region>.amazonaws.com/<task_name>/<batch_name>/index.html?workerID={{PROLIFIC_PID}}&platform=prolific - Enable URL parameters so
PROLIFIC_PIDis passed asworkerID. - Configure completion redirect to:
using the same
https://app.prolific.com/submissions/complete?cc=<COMPLETION_CODE><COMPLETION_CODE>value asprolific_completion_code(or the auto-generated one printed by the init script).
- Data collection → External study link:
- Set audience, places, and cost as you normally would in Prolific.
- Create the study manually in Prolific:
To recruit via Toloka:
- Set
platform = tolokaindata/.env. - In Toloka, create the project and set its general parameters.
- Go to the build output for Toloka:
data/build/toloka/ - Copy the wrapper files:
- HTML:
data/build/toloka/interface.html - JS:
data/build/toloka/interface.js - CSS:
data/build/toloka/interface.css
- HTML:
- In Toloka’s Task Interface, paste each file into the corresponding HTML / JS / CSS editors.
- Copy the data specifications:
- Input:
data/build/toloka/input_specification.json - Output:
data/build/toloka/output_specification.json
- Input:
- Paste them into the Data Specification fields.
- Copy the general instructions:
data/build/task/instructions_general.json
- Paste the text into Instructions for Tolokers (source-code edit mode).
- Create a pool and define audience parameters and reward.
- Publish and upload the tokens file for the pool:
data/build/toloka/tokens.tsv - Review submission statuses from each pool’s page.
Security: Keep
data/build/toloka/tokens.tsvprivate. It is used for submission validation.
Use the download script to fetch all results for a deployed task.
- Go to the project root:
cd ~/path/to/project
- Enter the
datafolder:cd data - Run the downloader:
python download.py
The script will:
- download per-worker snapshots of raw data;
- refine raw data into tabular files;
- save the deployed task configuration;
- generate support files with worker IP and user-agent attributes;
- (if
enable_crawling=true) crawl pages retrieved by the in-task search engine.
All outputs are stored under:
data/result/<task_name>/
where <task_name> matches your environment variable. The folder is created if it does not exist.
| Folder | Description |
|---|---|
Data |
Per-worker snapshots of raw data. |
Dataframe |
Tabular, analysis-ready files derived from raw data. |
Resources |
Two support files per worker with IP and user-agent attributes. |
Task |
Backup of the task configuration. |
Crawling |
Source and metadata for pages retrieved by the in-task search engine (created only if crawling is on). |
Privacy: IP and user-agent data in
Resourcesmay be personal data. Handle according to your organization’s policies and applicable laws (for example GDPR).
The Task folder contains a backup of the task configuration.
The Data folder stores a per-worker snapshot of everything the system recorded.
Each worker has a JSON file whose top level is an array. The download script adds one object per batch the worker participated in.
The source_* attributes reference the originating DynamoDB tables and the local source path.
[
{
"source_path": "result/Your_Task/Data/ABEFLAGYVQ7IN4.json",
"source_data": "Crowd_Frame-Your-Task_Your-Batch_Data",
"source_acl": "Crowd_Frame-Your-Task_Your-Batch_ACL",
"source_log": "Crowd_Frame-Your-Task_Your-Batch_Logger",
"task": {
"...": "..."
},
"worker": {
"id": "ABEFLAGYVQ7IN4"
},
"ip": {
"...": "..."
},
"uag": {
"...": "..."
},
"checks": [
"..."
],
"questionnaires_answers": [
"..."
],
"documents_answers": [
"..."
],
"comments": [
"..."
],
"logs": [
"..."
],
"questionnaires": {
"...": "..."
},
"documents": {
"...": "..."
},
"dimensions": {
"...": "..."
}
}
]The Resources folder contains two JSON files per worker.
For worker ABEFLAGYVQ7IN4, these are ABEFLAGYVQ7IN4_ip.json and ABEFLAGYVQ7IN4_uag.json.
<worker>_ip.json: reverse-lookup of IPs (geolocation, provider, headers).<worker>_uag.json: parsed user-agent details (browser/OS/device).
Examples (subset):
{
"203.0.113.42": {
"country_name": "Kenya",
"country_code_iso2": "KE",
"region_name": "Nairobi",
"timezone_name": "Africa/Nairobi",
"provider_name": "Example ISP",
"content_type": "text/html; charset=utf-8",
"status_code": 200
}
}{
"Mozilla/5.0 (Linux; Android 11; ... )": {
"browser_name": "Chrome",
"browser_version": "115.0",
"os_family": "Android",
"device_brand": "Samsung",
"device_type": "mobile",
"device_max_touch_points": 5,
"ua_type": "mobile-browser"
}
}The Crawling folder stores captures of the web pages retrieved by the in-task search engine.
Crawling is optional and is enabled via the enable_crawling environment variable.
- The download script creates two subfolders:
Metadata/andSource/. - Each retrieved page is assigned a UUID (for example
59c0f70f-c5a6-45ec-ac90-b609e2cc66d7). - The script attempts to download the page source. If successful, the raw content is saved to
Source/<UUID>_source.<ext>(the extension depends on the content type). - Metadata for each fetch is written to
Metadata/<UUID>_metadata.json(always; success or failure).
result/
└─ Crawling/
├─ Metadata/
│ ├─ 59c0f70f-c5a6-45ec-ac90-b609e2cc66d7_metadata.json
│ └─ ...
└─ Source/
├─ 59c0f70f-c5a6-45ec-ac90-b609e2cc66d7_source.html
└─ ...
Each <UUID>_metadata.json includes, at minimum:
response_uuid,response_url,response_timestampresponse_status_code,response_error_code(if any)response_content_type,response_content_length,response_encodingresponse_source_path,response_metadata_path- a
dataobject with selected response headers
{
"attributes": {
"response_uuid": "59c0f70f-c5a6-45ec-ac90-b609e2cc66d7",
"response_url": "...",
"response_timestamp": "...",
"response_error_code": null,
"response_source_path": "...",
"response_metadata_path": "...",
"response_status_code": 200,
"response_encoding": "utf-8",
"response_content_length": 125965,
"response_content_type": "text/html; charset=utf-8"
},
"data": {
"date": "Wed, 08 Jun 2022 22:33:12 GMT",
"content_type": "text/html; charset=utf-8",
"content_length": "125965",
"...": "..."
}
}Notes
- Non-HTML resources (PDF, images) are saved with the appropriate extension. Metadata is still JSON.
- If crawling is disabled, the
Crawling/directory is not created.
The Dataframe folder contains a refined, tabular view of each worker snapshot.
Data are loaded into DataFrames (2-D tables with labeled rows and columns) and exported as CSV files. The number of exported files (up to about 10) depends on your configuration.
Granularity varies by file. For example, workers_urls has one row per result returned by the search engine for each query and element and try. workers_answers has one row per element and try with the values
for the evaluation dimensions. Use care when interpreting each file’s grain.
The following fragments show (i) a sample access-control snapshot for a single worker and (ii) two answer rows for two elements of an assigned HIT.
worker_id,paid,task_id,batch_name,unit_id,try_last,try_current,action,time_submit,time_submit_parsed,doc_index,doc_id,doc_fact_check_ground_truth_label,doc_fact_check_ground_truth_value,doc_fact_check_source,doc_speaker_name,doc_speaker_party,doc_statement_date,doc_statement_description,doc_statement_text,doc_truthfulness_value,doc_accesses,doc_time_elapsed,doc_time_start,doc_time_end,global_outcome,global_form_validity,gold_checks,time_spent_check,time_check_amount
ABEFLAGYVQ7IN4,False,Task-Sample,Batch-Sample,unit_1,1,1,Next,"Wed, 09 Nov 2022 10:19:16 GMT",2022-11-09 10:19:16 00:00,0.0,conservative-activist-steve-lonegan-claims-social-,false,1,Politifact,Steve Lonegan,REP,2022-07-12,"stated on October 1, 2011 in an interview on News 12 New Jersey's Power & Politics show:","Today, the Social Security system is broke.",10,1,2.1,1667989144,1667989146.1,False,False,False,False,False
ABEFLAGYVQ7IN4,False,Task-Sample,Batch-Sample,unit_1,1,1,Next,"Wed, 09 Nov 2022 10:19:25 GMT",2022-11-09 10:19:25 00:00,1,yes-tax-break-ron-johnson-pushed-2017-has-benefite,true,5,Politifact,Democratic Party of Wisconsin,DEM,2022-04-29,"stated on April 29, 2022 in News release:","The tax carve out (Ron) Johnson spearheaded overwhelmingly benefited the wealthiest, over small businesses.",100,1,10.27,1667989146.1,1667989156.37,False,False,False,False,False
Rules of thumb (keep in mind when analyzing):
paidappears in most files. Use it to separate completed vs. not-completed work. Failures can be insightful.batch_nameappears in some files. Analyze results per batch when needed.try_currentandtry_last(where present) split data by attempts.try_lastmarks the most recent. Account for multiple tries per worker.action(when present) is one ofBack,Next,Finish. OnlyNextandFinishrows reflect the latest answer for an element.index_selected(inworkers_urls) marks results the worker clicked (-1means not selected). A value of4means three results had already been selected,7means six, and so on.type(inworkers_logs) identifies the log record type. Logs are globally time-sorted.workers_aclholds useful worker-level info. Join to other files onworker_id.workers_urlslists all retrieved results.workers_crawlingcontains crawling info. Join them onresponse_uuid.workers_dimensions_selectionshows the time order in which answers were chosen. Rows for one worker can be interleaved with others if multiple workers act concurrently.workers_commentscontains final comments. It is optional, so it may be empty.
Produced files (may vary by configuration):
| Dataframe | Description |
|---|---|
workers_acl.csv |
Snapshots of raw access/control data per worker. |
workers_ip_addresses.csv |
IP address information per worker. |
workers_user_agents.csv |
Parsed user-agent information per worker. |
workers_answers.csv |
Answers per evaluation dimension. |
workers_documents.csv |
Elements evaluated during the task. |
workers_questionnaire.csv |
Questionnaire answers. |
workers_dimensions_selection.csv |
Temporal order of dimension selections. |
workers_notes.csv |
Text annotations from workers. |
workers_urls.csv |
Search queries and retrieved results. |
workers_crawling.csv |
Data about crawled web pages. |
workers_logs.csv |
Logger events produced during the task. |
workers_comments.csv |
Final comments from workers. |
workers_mturk_data.csv |
MTurk worker/assignment exports. |
workers_prolific_study_data.csv |
Prolific study/submission exports. |
workers_prolific_demographic_data.csv |
Prolific worker demographics. |
workers_toloka_data.csv |
Toloka project/submission exports. |
-
Paths with special characters
Avoid special characters in project paths (for exampleº). Some tools (CLI, Docker, Python) may fail to resolve these paths reliably. -
VS Code and working directory
VS Code terminals and tasks can start in the wrong working directory. Run scripts from an external shell (Terminal, PowerShell, cmd) at the repo root (or the expected subfolder), or set:File > Preferences > Settings > terminal.integrated.cwd. -
.envfilename
The file must be named exactly.env(not.env.txt).
On Windows, enable “File name extensions” in Explorer; on macOS, use “Get Info” to verify the name. -
ImportError: cannot import name
VerifyTypesfromhttpx._types(toloka-kit)
Pinhttpxto<0.28(for example0.27.2), which is compatible withtoloka-kit==1.2.3:pip install "httpx==0.27.2" "httpcore<2,>=1.0" --upgrade
This is already handled if you install from the provided
requirements.txt. -
How do I reset a task that is blocked or unresponsive?
If workers cannot continue or the admin panel will not accept changes, you can restore a clean state by deleting the task’s records from the DynamoDB tables:*_ACL,*_Data,*_Logger. This unlocks the task so you can reconfigure or redeploy.Warning: Deleting these records is irreversible and permanently erases progress and submissions. Do this only if you intend to restart from scratch.
- Docker on Windows (
pypiwin32error): on certain Windows-based Python distributions, thedockerpackage triggers the following exception because thepypiwin32dependency fails to run its post-install script:To fix this, run the following command from an elevated command prompt:NameError: name 'NpipeHTTPAdapter' is not defined. Install pypiwin32 package to enable npipe:// supportpython your_python_folder/Scripts/pywin32_postinstall.py -install
Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/dev-branch) - Commit your Changes (
git commit -m 'Add some Feature') - Push to the Branch (
git push origin feature/dev-branch) - Open a Pull Request
- Source-only releases — tagged versions publish the repository source as GitHub Releases; no prebuilt
dist/is shipped. - CI scope — CI runs on pull requests:
quality: type-check (no emit) and lintbuild: Angular production build (runs only ifqualitypasses)
- Rationale — builds depend on user-provided environment and platform settings (AWS buckets, base href, etc.), so prebuilt bundles are not portable.
This software was presented at the Fifteenth ACM International Conference on Web Search and Data Mining (WSDM 2022).
If you use Crowd_Frame in your research, please cite the paper below. A repository-level citation file (CITATION.cff) is included, so you can also use GitHub’s Cite this repository button to
export in multiple formats.
DOI: https://doi.org/10.1145/3488560.3502182
@inproceedings{conference-paper-wsdm2022,
author = {Soprano, Michael and Roitero, Kevin and Bombassei De Bona, Francesco and Mizzaro, Stefano},
title = {Crowd_Frame: A Simple and Complete Framework to Deploy Complex Crowdsourcing Tasks Off-the-Shelf},
booktitle = {Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining},
series = {WSDM '22},
year = {2022},
pages = {1605--1608},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3488560.3502182},
url = {https://doi.org/10.1145/3488560.3502182},
isbn = {9781450391320},
keywords = {framework, crowdsourcing, user behavior}
}Soprano, M., Roitero, K., Bombassei De Bona, F., & Mizzaro, S. (2022). Crowd_Frame: A Simple and Complete Framework to Deploy Complex Crowdsourcing Tasks Off-the-Shelf. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (WSDM ’22) (pp. 1605–1608). Association for Computing Machinery. https://doi.org/10.1145/3488560.3502182