How we measure and define performance shapes the systems we build. Current Natural Language Processing (NLP) benchmarks often prioritize leaderboard scores over practical utility, failing to capture how models behave in real-world, socially situated contexts. This PhD project treats evaluation methodology as a research problem in its own right, advancing both the conceptual foundations and computational tools for assessing NLP systems in ways that are reliable, valid, and human-centred. Application domains include multilingual settings, such as machine translation, and emerging agentic and interactive multimodal NLP systems involving human-AI collaboration present frontier evaluation challenges. The ultimate goal is to develop evaluation frameworks that capture not only overall system performance, but also real-world utility, fairness, and responsiveness to the needs of diverse stakeholders.