Prompts as Functions: Reliable, Reusable, and Ready to Run

Tired of copy-pasting prompts? Frustrated by inconsistent results? Here’s how I built a reliable(ish) ANOVA workflow using GPT-4o’s built-in Python stack—pandas, statsmodels, and plotly.

Apr 12, 2025

TL;DR: Got tired of the repetitive grind of basic stats analysis (like ANOVA). Discovered GPT-4's Code Interpreter has a surprisingly rich set of pre-installed Python data science libraries (pandas, statsmodels, plotly, etc.). Had an idea: Could writing hyper-specific prompts that explicitly call these known libraries make the AI's analysis execution more reliable and consistent? Built an experiment: A detailed text prompt that acts like a callable function, guiding the AI through a full one-way ANOVA (data loading, assumption checks, ANOVA, post-hoc, interactive plots, code annex) with just the prompt file + data file as input. Result: The ANOVA Prompt Runner, aiming to automate the grunt work and make AI analysis more transparent.Check it out on Github !

There’s a particular kind of digital weariness I think many of us who work with data recognize. It’s not the intellectual strain of complex problem-solving, the kind that energizes you even when it's tough. No, it's the other kind: the low-grade, persistent friction of repetitive tasks. Setting up the same analysis structure, running the same assumption checks, formatting the same outputs... again, and again. It’s necessary, meticulous work, the bedrock of sound analysis, but let's be honest, it can feel like cognitive lint – accumulating slowly, clogging the mental filters needed for deeper thinking and interpretation.

The actual spark for this particular experiment didn't come from a grand strategic vision, but from a moment of simple, almost playful curiosity. I was exploring GPT-4's Code Interpreter environment – the feature that lets it run Python code – and got wondering: what tools does it actually have access to? What's pre-installed in that sandboxed environment? So, I coaxed it into listing out its installed Python packages.

And honestly, I was surprised by the depth of the list it spat back. It wasn't just the bare essentials. It had a genuinely comprehensive toolkit for data analysis and scientific computing baked right in: pandas for data manipulation, numpy for numerical operations, scipy for scientific and technical computing (including scipy.stats), statsmodels for statistical modeling and testing, scikit-learn for machine learning tasks, a whole suite of visualization libraries like matplotlib, seaborn, and even plotly for creating interactive plots, plus things like beautifulsoup4 for web scraping... the works. It was a serious data science environment, ready to go.

Seeing that list flicked a switch in my head. One of the persistent challenges with using Large Language Models for analytical tasks is their consistency and reliability. They can sometimes hallucinate functions that don't exist, misunderstand ambiguous requests, grab the wrong statistical test for the situation, or simply miss a crucial step in a workflow. We've all seen it happen.

But seeing that list of concrete, installed packages made me think: What if the ambiguity wasn't just in the AI's understanding, but in our requests? What if, instead of vaguely asking for "an ANOVA analysis," I could leverage this known, pre-existing environment with surgical precision? What if I wrote instructions that explicitly told the AI which library and function to use for each specific step? Use pandas.read_csv for loading. Use scipy.stats.levene for the homogeneity check. Use statsmodels.formula.api.ols and statsmodels.stats.anova.anova_lm for the ANOVA calculation. Use plotly.express.box for the visualization.

Could directing it so precisely, almost like dictating the import statements and function calls in a Python script but using structured natural language, force a more reliable, predictable, and reproducible execution of the entire workflow? Could it reduce the wiggle room for the AI to go off-script?

This led directly to the core idea, the central hypothesis of this little experiment: What if we could treat a detailed AI prompt less like a vague wish whispered into the void, and more like a callable function? Think of it like a detailed recipe handed to a skilled chef who already has a fully stocked pantry and all the necessary kitchen tools (the installed Python libraries). The recipe doesn't just say "make soup"; it specifies ingredients, quantities, tools, and the exact sequence of steps.

Could I craft such a "prompt recipe" for ANOVA? A set of instructions so detailed and unambiguous that GPT-4, leveraging its built-in tools, could take it, ingest a dataset, and perform that entire standard workflow – from data loading to assumption checks to the final report – without needing constant supervision or clarification? No manual coding steps in the chat window, no back-and-forth about which variable is which (unless truly ambiguous), no guessing about which plot to make. Just: prompt file + data file = analysis report.

That's the core idea behind a little experiment I've been tinkering with: the ANOVA Prompt Runner.

It’s essentially a text file (prompt-anova.txt in the repo) containing those hyper-specific instructions for the AI. It guides it, step-by-step, telling it exactly how to:

Load data (CSV or XLSX) using pandas.
Figure out (or ask for clarification on) the dependent and independent variables by inspecting data types and unique values.
Run the crucial assumption checks: Levene's test for homogeneity of variances using scipy.stats.levene, and generating an interactive Q-Q plot of residuals (derived from a statsmodels OLS fit) using plotly.
Execute the one-way ANOVA using statsmodels and calculate Eta Squared (η²) for effect size.
Conditionally run a Tukey HSD post-hoc test using statsmodels.stats.multicomp.pairwise_tukeyhsd only if the main ANOVA result crosses the significance threshold (alpha).
Generate an interactive box plot using plotly.express.
Assemble everything – descriptive stats, assumption test results, ANOVA table, effect size, interpretations, the interactive plots – into a nicely structured and styled HTML report. And, crucially, include a full code annex showing the exact Python code the AI generated and executed for each step, ensuring transparency.

The intended user experience? Simplicity itself. Drag the prompt-anova.txt file and your data file (like sample-data.csv) into a ChatGPT session (needs GPT-4 with Code Interpreter enabled). Wait a moment while the AI processes the instructions. Get a link to download the resulting anova_report.html. Done. The analysis runs autonomously based on the prompt's logic.

(You can grab the prompt, sample data, and see an example report)

Now, let's be clear, this isn't about replacing dedicated statistical software like R or SPSS, nor is it about supplanting the critical thinking and domain expertise needed for proper statistical modeling and interpretation. Far from it. This experiment is aimed at something slightly different:

Modularity & Reusability: The prompt becomes the analysis module. If you have another dataset needing the same type of analysis, you reuse the same prompt file. It standardizes the process.
Reducing Cognitive Friction: It's explicitly designed to automate the grunt work, the repetitive setup and execution, freeing up valuable mental bandwidth for the parts that actually require human insight: interpreting the meaning of the results in context, assessing practical significance, planning next steps. It directly scratches that itch of wanting smoother, less draining workflows, especially beneficial for those of us (like me, with my ADHD wiring) who sometimes struggle with the activation energy required for sequences of mundane-but-necessary tasks.
Enhancing Reproducibility & Transparency: The automatically generated code annex is key here. It directly addresses a major concern with using LLMs for analysis – the "black box" problem. By showing the exact code executed, it makes the AI's process transparent, verifiable, and reproducible by others (or your future self).
Exploring a New Mode of AI Interaction: Fundamentally, it’s an exploration of how we interact with these powerful models. Can we move beyond conversational prompting towards more structured, instruction-based execution for complex, multi-step tasks? Treating the LLM less like a research assistant you chat with, and more like a highly capable, instruction-following engine that reliably executes well-defined analytical recipes. It's a step towards truly automating thought processes, or at least the mechanical execution of them.

Of course, this is very much an experiment, a version 0.1 born from personal need and sustained by curiosity. It currently only handles the specific case of one-way ANOVA between groups. It relies entirely on the AI correctly interpreting and executing the detailed steps laid out in the prompt – and honestly, wrestling the prompt into a state where GPT-4 consistently follows the logic without deviation took significant iteration and refinement. LLMs, even the best ones, still possess a certain... creative unpredictability that needs careful channeling for tasks requiring high fidelity. It also explicitly assumes the core statistical assumption of independence of observations is met by your study design, as that's something the AI can't easily check just from the data file. And there's always the possibility of edge cases or unexpected dataset quirks causing issues, requiring human oversight.

But despite the limitations, the potential feels intriguing. What other common analytical workflows could be encapsulated in this "prompt-as-function" paradigm? Could we build prompt recipes for T-tests, regressions, chi-square analyses, maybe even basic data cleaning pipelines or exploratory data analysis (EDA) routines? Could we eventually have libraries of these validated "prompt functions" tailored for specific research domains or common tasks?

I'd genuinely love to hear your thoughts on this. Does this idea of "prompts as functions" resonate with your own experiences or frustrations? Have you tried something similar? Can you see potential applications or significant pitfalls I might be overlooking? Feel free to clone the repo, try it out (carefully, perhaps with non-critical data first!), break it, improve it, and let me know how it goes.

We're all just eternal experimenters here, after all, poking at the boundaries of what's possible.

www.linkedin.com/in/raph1

https://github.com/Galent-git

Almost Real

Discussion about this post