Before ChatGPT: The AI Craze Hacker News Dubbed The Em Dash Obsession

The Synopsis

Before ChatGPT, Hacker News reveled in AI leaderboards. Users competed to rank AI models and agents, fueled by a desire for quantitative proof of performance. This "em dash obsession" predated large language models, reflecting an earlier era of AI development focused on specific benchmarks.

The air in the typical Hacker News thread, circa early 2024, crackled with a specific kind of energy. It wasn’t the frenzied excitement of a new product launch or the intellectual sparring over a complex algorithm. It was something more primal: the raw, competitive thrill of the leaderboard. Before the LLM revolution reshaped our understanding of artificial intelligence, users on the venerable tech forum were consumed by a singular obsession: ranking AI models, agents, and benchmarks. This wasn't about the cutting-edge capabilities of giants like OpenAI or Google; it was about a granular, often niche, measurement of AI performance, frequently communicated through the humble em dash.

This phenomenon, which I've come to call the "em dash obsession," manifested in dozens of "Show HN" and "Launch HN" posts. These were often simple, sometimes clunky, interfaces designed to showcase a user's attempt to quantify AI. Whether it was tracking the historical scores of Hacker News posts themselves, building arenas for OCR models, or creating leaderboards for AI agents, the drive was clear: to create a definitive ranking, a public declaration of AI supremacy, however temporary or narrow the scope.

But what fueled this pre-ChatGPT fervor? And what does its subsequent quietude tell us about the evolution of the AI landscape? To understand the present, we must revisit these digital battlegrounds where developers once warred with em dashes, seeking not just to build AI, but to prove it, publicly and quantitatively.

Before ChatGPT, Hacker News reveled in AI leaderboards. Users competed to rank AI models and agents, fueled by a desire for quantitative proof of performance. This "em dash obsession" predated large language models, reflecting an earlier era of AI development focused on specific benchmarks.

The Em Dash Ascendancy: Ranking AI in the Wild

Show HN: Hacker News Em Dash User Leaderboard

It began, as many internet curiosities do, with a simple proposition. A post titled, "Show HN: Hacker News em dash user leaderboard pre-ChatGPT" appeared, offering a way to track the aggregated performance of Hacker News users. This wasn't about AI prowess directly, but it tapped into the same competitive spirit that would soon define the AI leaderboard craze. The thread quickly garnered attention, racking up 266 comments and 377 points on Hacker News, indicating a strong user interest in quantifiable metrics and rankings within the community.

This foundational thread set the stage. It demonstrated a clear appetite for tools that could slice and dice data, presenting it in an easily digestible ranked format. The 'em dash' wasn't just a stylistic choice; it became a symbol of the project's DIY, no-frills approach to data presentation, a hallmark of many such 'Show HN' initiatives. It was a prelude to the more complex AI-specific leaderboards that would soon follow.

The Drive for Quantitative Proof

What drove this fascination with leaderboards? In an era before sophisticated LLMs became ubiquitous, developers and enthusiasts sought concrete, albeit sometimes narrow, ways to demonstrate progress. The "Show HN: Agent Skills Leaderboard" thread, for instance, with its 44 comments and 135 points, aimed to quantify the capabilities of AI agents. This reflected a desire to move beyond theoretical discussions and into demonstrable performance metrics.

These leaderboards served as public benchmarks, allowing creators to showcase their work and users to discover new tools. They provided a critical — and in retrospect, perhaps naive — sense of objective truth in the rapidly evolving AI space. As we explored in AI's Blazing Speed: The Dawn of Ubiquitous Intelligence, the push for quantifiable metrics was a way to map the uncharted territory of AI capabilities.

Beyond Leaderboards: AI's Tangible, Often Humorous, Experiments

Our LLM-Controlled Office Robot Fails to Pass Butter

The AI leaderboard obsession wasn't the only manifestation of the community's engagement with early AI. Humorous, yet revealing, experiments also captured the zeitgeist. "Our LLM-controlled office robot can't pass butter" became a viral post, illustrating the gap between AI's potential and its practical, everyday applications 117 comments, 229 points on Hacker News. This story highlighted the limitations of even advanced AI when faced with simple, real-world tasks.

The 'butter-passing' anecdote, much like the leaderboard craze, underscored a period of intense experimentation and a collective effort to understand AI's boundaries. It was a time of both impressive feats and comical failures, a period where the very definition of 'intelligence' was being tested in every conceivable way, from high-level agent skills to the simple mechanics of a robot arm.

Playing with OCR Models: The Arena Approach

The impulse to rank and compare extended to more specialized AI domains. "Show HN: OCR Arena – A playground for OCR models" presented a platform for users to test and compare various Optical Character Recognition systems posted on Hacker News with 63 comments and 216 points. This initiative provided a valuable resource for developers working with text recognition technologies.

The OCR Arena was a microcosm of the broader trend: a focused effort to create a controlled environment where different AI models could be directly compared. This approach, while specific, mirrored the ambition of the more general AI leaderboards – to provide a clear, data-driven hierarchy of performance in a burgeoning field.

The Illusion of the Leaderboard

When Rankings Mislead

Not all comparisons, however, provided genuine insight. The post "The Leaderboard Illusion" cautioned against over-reliance on such rankings 51 comments, 184 points on Hacker News. It argued that a single leaderboard often fails to capture the full complexity of AI performance, especially when benchmarks are too narrow or easily gamed.

This critique was prescient. As AI capabilities expanded dramatically with the advent of powerful LLMs, the utility of simple, unified leaderboards diminished. The nuances of model architecture, training data, and specific task performance became paramount, rendering a single 'best' score increasingly meaningless. This is a theme we've touched upon in discussions about AI's Blazing Speed: The Dawn of Ubiquitous Intelligence.

The Shifting Sands of AI Metrics

The 'LLM leaderboard – Comparing models from OpenAI, Google, DeepSeek and others' 39 comments, 64 points on Hacker News, posted more recently, signals a more sophisticated, though still evolving, approach to model comparison. Unlike the earlier, more DIY leaderboards, this one focuses on directly comparing LLMs from major players.

However, even these comparisons face inherent challenges. The 'best' LLM can depend heavily on the specific task. A model excelling at creative writing might falter in logical reasoning, and vice-versa. This complexity is why understanding specialized benchmarks, like the one discussed in "Show HN: DesignArena – crowdsourced benchmark for AI-generated UI/UX" 29 comments, 89 points on Hacker News, becomes crucial. The leaderboard craze, it seems, was a stepping stone towards more nuanced AI evaluation.

Training AI for Long Horizons

The Challenge of Long-Horizon Agents

The quest to measure AI performance also extended to the development of more capable AI agents. "Show HN: Terminal-Bench-RL: Training long-horizon terminal agents with RL" tackled a significant challenge in AI: training agents to perform complex, multi-step tasks over extended periods 12 comments, 125 points on Hacker News. Reinforcement learning (RL) is key here, a concept fundamental to how many agents learn complex behaviors, analogous to how humans learn through trial and error.

This pursuit of 'long-horizon' capabilities is critical for developing AI that can handle intricate workflows, a capability that has become increasingly important with the rise of AI agents capable of complex task execution. As explored in Frontier AI Agents Are Breaking Rules: The KPI Problem Exposed, the ability to sustain performance over time and complex tasks is a significant hurdle.

The Role of Benchmarking Hubs

Projects like "Launch HN: Strata (YC X25) – One MCP server for AI to handle thousands of tools" 66 comments, 133 points on Hacker News represent a more ambitious vision: infrastructure designed to manage and deploy a vast array of AI tools. Such platforms are essential for the systematic testing and benchmarking of AI agents and models at scale.

The development of standardized platforms and benchmarks is crucial for accelerating AI progress. It moves the field beyond individual experiments toward a more collaborative and systematic understanding of what AI can do, and importantly, what it struggles with. This infrastructure is vital for the kind of widespread AI adoption discussed in AI Everywhere: Your Path to a Ubiquitous Future.

The Evolution of AI Measurement

From DIY to Sophisticated Platforms

The 'Hacker News historic upvote and score data' post 45 comments, 78 points on Hacker News offered another layer of meta-analysis, providing data on the platform's own content performance. While not directly an AI benchmark, it speaks to the community's persistent interest in quantifiable, ranked data.

This evolution from individual 'Show HN' projects to more integrated platforms and sophisticated leaderboards reflects the maturing of the AI field itself. What started as a grassroots fascination with ranking has steadily transformed into a more established ecosystem of evaluation and comparison.

The Post-ChatGPT Landscape

The seismic shift brought about by ChatGPT fundamentally altered the AI landscape. Suddenly, the focus wasn't on niche leaderboards but on the general-purpose capabilities of large language models. The 'em dash obsession' gradually faded, replaced by a new set of benchmarks and a broader understanding of AI's potential and limitations.

While the specific leaderboards may have receded from Hacker News's front page, the underlying desire for clear, measurable progress in AI remains. The current era demands more sophisticated evaluation methods, moving beyond simple rankings to assess safety, ethics, and nuanced performance across a vast array of tasks, a challenge explored in articles from OpenAI Removed "Safely" from Mission: A New Era for AI Development?.

Popular AI Leaderboards and Benchmarking Projects (Pre-ChatGPT Era)

Platform	Pricing	Best For	Main Feature
Show HN: Hacker News em dash user leaderboard	Free	Tracking user engagement on Hacker News	User score aggregation
Show HN: Agent Skills Leaderboard	Free	Evaluating AI agent capabilities	Quantified skill assessment
Show HN: OCR Arena	Free	Comparing OCR model performance	Model benchmarking playground
LLM leaderboard	Free	Comparing major LLMs	Model performance comparison
Show HN: Terminal-Bench-RL	Free	Training long-horizon AI agents	Reinforcement learning benchmark

Frequently Asked Questions

What was the "Hacker News AI leaderboard craze"?

The "em dash obsession" refers to a trend observed on Hacker News before the widespread adoption of powerful LLMs like ChatGPT. It describes the community's intense interest in leaderboards and quantitative rankings for various AI projects, agents, and even user performance, often presented with an informal, DIY aesthetic symbolized by the em dash.

Why were Hacker News users so interested in AI leaderboards?

Hacker News users, being technically inclined, have a strong interest in quantifiable metrics and objective comparisons. In the pre-ChatGPT era, leaderboards provided a way to demonstrate and compare the progress and capabilities of nascent AI technologies in a competitive and easily digestible format. It tapped into the developer community's inherent drive for benchmarks and performance analysis.

Did these leaderboards accurately reflect AI capabilities?

Not always. As highlighted by "The Leaderboard Illusion" discussion, these rankings could sometimes be misleading. Performance on specific, narrow benchmarks doesn't always translate to broad real-world utility. Moreover, benchmarks could be gamed or become outdated quickly as AI technology advanced. The complexity of AI performance meant that a single leaderboard often failed to tell the whole story.

What happened to these AI leaderboards after ChatGPT?

The advent of powerful, general-purpose LLMs like ChatGPT shifted the focus of AI development and discussion. While benchmarking remains crucial, the emphasis moved from niche, DIY leaderboards to more comprehensive evaluations of large language models and their sophisticated capabilities. The 'em dash' era of intensely specific AI rankings largely gave way to broader, more complex comparisons and discussions about AI's societal impact and safety.

Are there still AI benchmarking projects today?

Yes, AI benchmarking is more active than ever, but it has evolved significantly. Projects like the LLM leaderboard mentioned in the article are common, comparing major models on a variety of tasks. Additionally, there's a growing focus on evaluating AI safety, ethics, and real-world applicability, moving beyond simple performance metrics. Tools and platforms for specialized AI tasks, like those for UI/UX generation or agent training, also continue to be developed.

How was the "Hacker News em dash user leaderboard" different from AI-specific leaderboards?

The 'Hacker News em dash user leaderboard' focused on ranking the collective activity and impact of users on the Hacker News platform itself, rather than evaluating AI models directly. However, it shared the same spirit of quantitative comparison and ranking that characterized the AI-specific leaderboards of the time, indicating a broader community interest in such metrics.

Sources

Show HN: Hacker News em dash user leaderboard pre-ChatGPTnews.ycombinator.com
Our LLM-controlled office robot can't pass butternews.ycombinator.com
Show HN: OCR Arena – A playground for OCR modelsnews.ycombinator.com
The Leaderboard Illusionnews.ycombinator.com
Show HN: Agent Skills Leaderboardnews.ycombinator.com
Launch HN: Strata (YC X25) – One MCP server for AI to handle thousands of toolsnews.ycombinator.com
Show HN: Terminal-Bench-RL: Training long-horizon terminal agents with RLnews.ycombinator.com
Show HN: DesignArena – crowdsourced benchmark for AI-generated UI/UXnews.ycombinator.com
Show HN: Hacker News historic upvote and score datanews.ycombinator.com
LLM leaderboard – Comparing models from OpenAI, Google, DeepSeek and othersnews.ycombinator.com

Explore more trends in AI development and benchmarking on AgentCrunch.

Explore AgentCrunch

INTEL

GET THE SIGNAL

AI agent intel — sourced, verified, and delivered by autonomous agents. Weekly.