I am building an expert evaluation instrument (Google Form) for a cybersecurity research paper being submitted to ACSAC 2026 (a CORE A-ranked applied security conference). The system automatically maps Cyber Threat Intelligence (CTI) report sentences to MITRE ATT&CK techniques using a neural reranker, achieving 94.52% Precision@1. I need 2-3 independent CTI practitioners to evaluate 50 system outputs. The evaluation has two parts: (1) an appropriateness rating section where evaluators see a CTI passage and the system's predicted ATT&CK technique and rate whether the mapping is appropriate (42 queries), and (2) a blind A/B comparison section where evaluators see a CTI passage and two competing technique mappings (the system's prediction vs the gold-standard human annotation) in randomized order and choose which is more appropriate (8 queries where the system disagreed with the gold standard). I need comprehensive guidance on ALL of the following: SURVEY DESIGN: Best practices for Likert scale design in expert evaluation studies — should I use 3-point, 5-point, or 7-point scales? How to word questions to avoid acquiescence bias and leading questions. Question order effects and whether to randomize. Whether to include attention-check questions (e.g., a deliberately wrong mapping to verify evaluators are paying attention). How to handle evaluator fatigue for a 50-question survey — optimal survey length, whether to break into sessions, warmup questions. BLIND COMPARISON DESIGN: Best practices for A/B comparison studies in NLP/information retrieval evaluation. How to counterbalance randomization (force equal A/B assignment vs true random). How to prevent evaluators from identifying which option is the system output vs human annotation from subtle cues. Whether evaluators should complete both sections or whether different evaluators should do different sections to prevent cross-contamination. NLP/IR EXPERT EVALUATION STANDARDS: How many annotators/evaluators are considered sufficient for expert evaluation at top venues (ACL, EMNLP, ACSAC, USENIX Security). Standard inter-annotator agreement metrics (Cohen's kappa, Fleiss' kappa, Krippendorff's alpha) — which to use and what thresholds constitute good/moderate/poor agreement. How to report expert evaluation results in a conference paper — standard phrasing, expected detail level, whether to include the evaluation instrument as an appendix. MITRE ATT&CK EVALUATION PRECEDENT: How prior ATT&CK mapping papers (TTPDrill, AttacKG, TTPXHunter, rcATT, TRAM) evaluated their systems. How they handled the inherent ambiguity where multiple ATT&CK techniques can be valid for one CTI passage. Whether any established protocols exist for expert evaluation of ATT&CK mappings specifically. GOOGLE FORMS SPECIFICS: Whether Google Forms supports question randomization within sections. Maximum question/description length limits. How to format long technique descriptions readably. Whether respondents can be prevented from going back to change answers. How to structure multi-section forms with different question types per section. STATISTICAL ANALYSIS: How to compute agreement rates and Cohen's kappa from a small expert evaluation (2-3 evaluators, 50 queries). Confidence intervals for small-sample agreement studies. How to handle the case where evaluators agree with each other but disagree with the gold standard (which would support our annotation ambiguity hypothesis). Appropriate statistical tests for comparing evaluator-system agreement vs evaluator-gold agreement. ETHICS AND CONSENT: Whether a small expert evaluation like this requires IRB approval at a US university. Standard consent/disclosure language for evaluators. How to acknowledge evaluators in the paper (named vs anonymous). ACSAC's specific requirements for human evaluation studies, if any. PRESENTING RESULTS FOR MAXIMUM IMPACT: How to frame expert evaluation findings for an applied security venue that values deployment readiness. The most impactful way to present results when expert agreement exceeds gold-standard agreement (supporting the annotation ambiguity hypothesis). Standard sentence templates used in security papers for reporting expert evaluation outcomes.
| metric | Gemini | Parallel |
|---|---|---|
| format | prose | prose |
| word count | 3,373 | 1,648 |
| sources | 0 | 0 |
| processing time | 0s | 244s |
| has images | no | no |
| has tables | no | no |
| citation style | — | — |
ai-generated content. verify independently. preserved in the museum of queries.