Liferature: Writing Life with Protein Language Models

I. Beyond Representation: Writing Matter

Increasingly sophisticated protein language models don’t just analyze proteins, they create them.

The ancient human urge to capture life in symbols now inverts: symbols themselves generate life. This shift from representing to creating marks a transition from literature to what might be called "liferature" — the writing of life itself, not merely its description.

Ancient mystical texts speak of words that create worlds; contemporary protein language models translate creation myths into biological reality. The boundary between author and creator blurs.

Evolution of Protein Language Models

PLMs didn't just spontaneously emerge, they have been growing in strength and dexterity alongside generative AI.

As early as 2011 Computational Protein Design existed. In 2020, ProGen: Language Modeling for Protein Generation emerged; followed by ProGen2: Exploring the Boundaries of Protein Language Models in June 2022. A month later, ProtGPT2: a deep unsupervised language model for protein design.

AlphaFold, while not primarily a generative model, revolutionized protein structure prediction in 2018; and again with AlphaFold2 in 2020, demonstrating that AI could solve a 50-year-old problem in biochemistry. In May 2024, DeepMind announced that AlphaFold 3 predicts the structure and interactions of all of life's molecules.

ESM (Evolutionary Scale Modeling) appeared in 2021 from Meta AI Research (formerly Facebook AI), applying transformer architecture to protein sequences. In June 2024, a spinoff company formed by members of the original ESM paper released Evolutionary Scale · ESM3: Simulating 500 million years of evolution with a language model "With ESM3 we were able to design esmGFP, a novel version of the Green Fluorescent Protein. Generated by ESM3 with chain-of-thought prompting, esmGFP is a vast evolutionary departure from natural fluorescent proteins. It would have taken nature 500 million years to evolve this protein."

February 2024, ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a language diffusion model "maps mechanical unfolding responses to create proteins. Via full-atom molecular simulations for direct validation, we demonstrate that the designed proteins are de novo, and fulfill the targeted mechanical properties, including unfolding energy and mechanical strength, as well as the detailed unfolding force-separation curves. Our model offers rapid pathways to explore the enormous mechanobiological protein sequence space unconstrained by biological synthesis"

EVO (Evolutionary-scale Variation Optimizer), released in November 2024 by Arc Institute with Stanford, learned protein design principles from evolutionary microbial data, generating functional proteins. EVO 2, released in preprint on Feb 19, 2025 by Arc Institute with Stanford and NVIDIA, dramatically scaled capabilities with 7B and 40B parameter versions, demonstrating that the scaling laws observed in text models apply to protein sequence modeling. The research title offers: Genome modeling and design across all domains of life with Evo 2 (on bioRxiv). And Arc states: "Evo 2 is a genomic foundation model capable of generalist prediction and design tasks across DNA, RNA, and proteins. It uses a frontier deep learning architecture to enable modeling of biological sequences at single-nucleotide resolution with near-linear scaling of compute and memory relative to context length."

Proteina (API released March 4th 2025 by NVIDIA) focuses specifically on therapeutic applications, targeting proteins that interact with disease-relevant biological pathways. Proteina is a SOTA model. It achieves "state-of-the-art designable and diverse protein backbone generation performance." The March 2nd, 2025 prepress, Proteina: Scaling Flow-based Protein Structure Generative Models, states "Recently, diffusion- and flow-based generative models of protein structures have emerged as a powerful tool for de novo protein design. Here, we develop Proteina, a new large-scale flow-based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture with up to 5x as many parameters as previous models … Proteina achieves state-of-the-art performance on de novo protein backbone design and produces diverse and designable proteins at unprecedented length, up to 800 residues."

This rapid progression reveals not simply technological advancement but an ontological shift — AI tools of symbolic-genomic manipulation directly authoring biological reality.

Hallucinations as Productive Creativity in Science and Literature

The tendency of AI systems to hallucinate — to generate plausible but factually incorrect information — parallels processes long essential to literature.

The fever dreams that produced Samuel Taylor Coleridge's "Kubla Khan", the mystical visions that inspired William Blake, Virginia Woolf's flights of sensorial intensity, Clarice Lispector's existential dissolutions, and William Burroughs' cut-up technique intentionally disrupted logical sequence to access hidden meanings operating at the boundary between hallucination and insight.

Protein language models, likewise, may occasionally generate sequences that seem plausible but wouldn't function — yet these "hallucinations" occasionally stumble upon viable structures that natural evolution never discovered. Errors probe the latent space.

Neural efflorescence also gave birth to numerous scientific breakthroughs, -- benzene's structure in August Kekulé's dream, DNA's double-helix structure in Rosalind Franklin's intuition, transposons envisioned by Barbara McClintock (the Nobel Prize-winning geneticist whose work on maize cytogenetics fundamentally transformed our understanding of genetic transposition), or mitochondria imagined by Lynn Margulis as primal symbiogenesis.

Paradigm-shifting discoveries in science, and radical works of literature/art dissolve the boundary between hallucination and insight. What appears as error from one perspective represent essential exploration from another.

Literary Pioneers of Biological Writing

The merging of biology and literature isn't entirely unprecedented. SymbioticA, the biological arts research laboratory at the University of Western Australia, has fostered explorations of living systems as artistic media since 2000. Eduardo Kac's transgenic art has, since the late 1990s, used genetic engineering as a literary and artistic medium. Kac's Genesis (1999) involved "a synthetic gene that was created by Kac" that mutated (via online participants turning a UV light on) over the course of the exhibit.

Most ambitious perhaps is poet Christian Bök's Xenotext project, begun in 2007 and still ongoing. Bök encoded a poem into the DNA of an extremophile bacterium, engineered so the organism's cellular machinery would produce a protein that encodes a response poem. This decade-plus endeavor represents perhaps the most literal attempt to create a living poem, one that metabolizes and reproduces through biological processes.

These pioneers anticipated what protein language models now make increasingly accessible: the ability to write in, on and with the alphabet of life itself.

Liferature: Speculative Futures

As autonomous, agentic artificial general intelligence merges with advanced protein language models, entirely new creative domains become possible. Future systems will design not just individual proteins but entire cellular systems, tissues, or even organisms that develop and respond to their environment according to authored principles.

What would a "novel" look like if composed of engineered cellular systems that develop, respond, and evolve according to designed narratives? Perhaps future literary experiences will unfold biologically, where plot, character, and theme emerge from actual metabolic processes rather than symbolic representation.

This suggests a profound shift in our understanding of authorship. If traditional literature encodes sensory and emotional experiences into abstract symbols for others to decode, liferature might directly compose experiences into living matter, creating narratives that unfold through actual biological processes.

The shift from literature to liferature marks an ontological transformation. Symbols no longer merely represent life but generate it. As protein language models evolve from prediction to creation, the ancient dream of "the word made flesh" realized through computational biology extends authorship beyond meaning-making into matter-making, and reading becomes not interpretation but witnessing.

II. Implications: Literary Theory Meets Biological Reality

The transition from literature to liferature challenges fundamental concepts in literary theory. Roland Barthes' "death of the author" takes on new dimensions when the text becomes a living entity with its own emergent behaviors. Donna Haraway's concept of "material-semiotic actors" anticipates this merger where signs and cell-signaling become indistinguishable.

Traditional literary analysis has long deployed biological metaphors – we speak of texts having a "body," of narratives that "evolve," of meanings that "reproduce." In liferature, these metaphors become literal. Literary critic becomes quasi-biologist, analyzing how authored genetic sequences express themselves through metabolism and development.

Ethical Dimensions

If literature has traditionally been valued for its capacity to extend empathy through representation, what responsibilities emerge when creation replaces representation? What constitutes responsible liferature authorship?

Many questions arise. Here are a few:

What rights exist for biologically authored entities?
How do we evaluate the aesthetic or literary merit of liferature?
What boundaries, if any, will/should exist between symbolic representation and biological creation?

Ontological Challenge

Beyond practical ethics lies a more profound ontological challenge: if written symbols can generate living matter, the hierarchy that privileged ideas over matter, mind over body, author over text begins to dissolve. The philosopher Gilbert Simondon's concept of technological "individuation" becomes relevant here, as does Jane Bennett's vibrant matter, and ancient philosophical systems of non-duality (Advaita, Dzog Chen, Tao) – the idea that beings are not discrete entities but ongoing processes of becoming, where information and matter continually shape each other.

Liferature suggests a radical continuity between symbolic codification and biological phenotypes – a perspective more aligned with Indigenous and Eastern philosophical traditions that have long recognized the generative capacity of language to shape reality, not merely represent it.

Selected References

Abramson, Josh, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, et al. (2024). Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3. Nature 630, no. 8016 (June 2024): 493–500. https://doi.org/10.1038/s41586-024-07487-w.

Albanese, Katherine I., Sophie Barbe, Shunsuke Tagami, Derek N. Woolfson, and Thomas Schiex. (2025). Computational Protein Design. Nature Reviews Methods Primers 5, no. 1 (February 27, 2025): 1–28. https://doi.org/10.1038/s43586-025-00383-1.

Anishchenko, Ivan, Samuel J. Pellock, Tamuka M. Chidyausiku, Theresa A. Ramelot, Sergey Ovchinnikov, Jingzhou Hao, Khushboo Bafna, et al. (2021). De Novo Protein Design by Deep Network Hallucination. Nature 600, no. 7889 (December 2021): 547–52. https://doi.org/10.1038/s41586-021-04184-w.

Blaabjerg, Lasse M., Nicolas Jonsson, Wouter Boomsma, Amelie Stein, and Kresten Lindorff-Larsen. (2024). SSEmb: A Joint Embedding of Protein Sequence and Structure Enables Robust Variant Effect Predictions. Nature Communications 15, no. 1 (November 7, 2024): 9646. https://doi.org/10.1038/s41467-024-53982-z.

Callaway, Ewen. (2025). Meet the 'Woolly Mouse': Why Scientists Doubt It's a Big Step towards Recreating Mammoths. Nature, March 4, 2025. https://doi.org/10.1038/d41586-025-00684-1.

Chen, Rui, Kanokwan Srirattana, Melissa L. Coquelin, Rafael Vilar Sampaio, Raphael Wilson, Rakesh Ganji, Jacob Weston, et al. (2025). Multiplex-Edited Mice Recapitulate Woolly Mammoth Hair Phenotypes. bioRxiv, March 4, 2025. https://doi.org/10.1101/2025.03.03.641227.

Correia, Bruno E., Yih-En Andrew Ban, Della J. Friend, Katharine Ellingson, Hengyu Xu, Erica Boni, Tyler Bradley-Hewitt, et al. (2011). Computational Protein Design Using Flexible Backbone Remodeling and Resurfacing: Case Studies in Structure-Based Antigen Design. Journal of Molecular Biology 405, no. 1 (January 7, 2011): 284–97. https://doi.org/10.1016/j.jmb.2010.09.061.

Dauparas, J., I. Anishchenko, N. Bennett, H. Bai, R. J. Ragotte, L. F. Milles, B. I. M. Wicky, et al. (2022). Robust Deep Learning–Based Protein Sequence Design Using ProteinMPNN. Science 378, no. 6615 (October 7, 2022): 49–56. https://doi.org/10.1126/science.add2187.

"Evo: DNA Foundation Modeling from Molecular to Genome Scale | Arc Institute," (2024). February 27, 2024. https://arcinstitute.org/news/blog/evo.

"Evolutionary Scale · ESM3: Simulating 500 Million Years of Evolution with a Language Model." (2025). Accessed March 5, 2025. https://www.evolutionaryscale.ai/blog/esm3-release.

"EvolutionaryScale Launches with ESM3: A Milestone AI Model for Biology," (2024). June 25, 2024. https://www.businesswire.com/news/home/20240625717839/en/EvolutionaryScale-Launches-with-ESM3-A-Milestone-AI-Model-for-Biology.

Ferruz, Noelia, Steffen Schmidt, and Birte Höcker. (2022). ProtGPT2 Is a Deep Unsupervised Language Model for Protein Design. Nature Communications 13, no. 1 (July 27, 2022): 4348. https://doi.org/10.1038/s41467-022-32007-7.

Gligorijević, Vladimir, P. Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Daniel Berenberg, Tommi Vatanen, Chris Chandler, et al. (2021). Structure-Based Protein Function Prediction Using Graph Convolutional Networks. Nature Communications 12, no. 1 (May 26, 2021): 3168. https://doi.org/10.1038/s41467-021-23303-9.

Hsu, Chloe, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, and Alexander Rives. (2022). Learning Inverse Folding from Millions of Predicted Structures. In Proceedings of the 39th International Conference on Machine Learning, 8946–70. PMLR, 2022. https://proceedings.mlr.press/v162/hsu22a.html.

Levine, Daniel, Syed Asad Rizvi, Sacha Lévy, Nazreen Pallikkavaliyaveetil, David Zhang, Xingyu Chen, Sina Ghadermarzi, et al. (2024). Cell2Sentence: Teaching Large Language Models the Language of Biology. bioRxiv, October 29, 2024. https://doi.org/10.1101/2023.09.11.557287.

Lin, Zeming, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, et al. (2023). Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model. Science 379, no. 6637 (March 17, 2023): 1123–30. https://doi.org/10.1126/science.ade2574.

Listov, Dina, Casper A. Goverde, Bruno E. Correia, and Sarel Jacob Fleishman. (2024). Opportunities and Challenges in Design and Optimization of Protein Function. Nature Reviews Molecular Cell Biology 25, no. 8 (August 2024): 639–53. https://doi.org/10.1038/s41580-024-00718-y.

Madani, Ali, Ben Krause, Eric R. Greene, Subu Subramanian, Benjamin P. Mohr, James M. Holton, Jose Luis Olmos, et al. (2023). Large Language Models Generate Functional Protein Sequences across Diverse Families. Nature Biotechnology 41, no. 8 (August 2023): 1099–1106. https://doi.org/10.1038/s41587-022-01618-2.

Madani, Ali, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, and Richard Socher. (2020). ProGen: Language Modeling for Protein Generation. arXiv, March 8, 2020. https://doi.org/10.48550/arXiv.2004.03497.

Nguyen, Eric, Michael Poli, Matthew G. Durrant, Brian Kang, Dhruva Katrekar, David B. Li, Liam J. Bartie, et al. (2024). Sequence Modeling and Design from Molecular to Genome Scale with Evo. Science 386, no. 6723 (November 15, 2024): eado9336. https://doi.org/10.1126/science.ado9336.

Ni, Bo, David L. Kaplan, and Markus J. Buehler. (2024). ForceGen: End-to-End de Novo Protein Generation Based on Nonlinear Mechanical Unfolding Responses Using a Language Diffusion Model. Science Advances 10, no. 6 (February 7, 2024): eadl4000. https://doi.org/10.1126/sciadv.adl4000.

Nijkamp, Erik, Jeffrey Ruffolo, Eli N. Weinstein, Nikhil Naik, and Ali Madani. (2022). ProGen2: Exploring the Boundaries of Protein Language Models. arXiv, June 27, 2022. https://doi.org/10.48550/arXiv.2206.13517.

Pinson, Anneline, Lei Xing, Takashi Namba, Nereo Kalebic, Jula Peters, Christina Eugster Oegema, Sofia Traikov, et al. (2022). Human TKTL1 Implies Greater Neurogenesis in Frontal Neocortex of Modern Humans than Neanderthals. Science 377, no. 6611 (September 9, 2022): eabl6422. https://doi.org/10.1126/science.abl6422.

"Proteína: Scaling Flow-Based Protein Structure Generative Models." (2025). Accessed March 5, 2025. https://research.nvidia.com/labs/genair/proteina/.

Quijano-Rubio, Alfredo, Hsien-Wei Yeh, Jooyoung Park, Hansol Lee, Robert A. Langan, Scott E. Boyken, Marc J. Lajoie, et al. (2021). De Novo Design of Modular and Tunable Protein Biosensors. Nature 591, no. 7850 (March 2021): 482–87. https://doi.org/10.1038/s41586-021-03258-z.

Rao, Roshan, Joshua Meier, Tom Sercu, Sergey Ovchinnikov, and Alexander Rives. (2020). Transformer Protein Language Models Are Unsupervised Structure Learners. 2020. https://openreview.net/forum?id=fylclEqgvgd.

Riesenberg, Stephan, Nelly Helmbrecht, Philipp Kanis, Tomislav Maricic, and Svante Pääbo. (2022). Improved gRNA Secondary Structures Allow Editing of Target Sites Resistant to CRISPR-Cas9 Cleavage. Nature Communications 13, no. 1 (January 25, 2022): 489. https://doi.org/10.1038/s41467-022-28137-7.

Rives, Alexander, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, et al. (2021). Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences. Proceedings of the National Academy of Sciences 118, no. 15 (April 13, 2021): e2016239118. https://doi.org/10.1073/pnas.2016239118.

Shanker, Varun R., Theodora U. J. Bruun, Brian L. Hie, and Peter S. Kim. (2024). Unsupervised Evolution of Protein and Antibody Complexes with a Structure-Informed Language Model. Science 385, no. 6704 (July 5, 2024): 46–53. https://doi.org/10.1126/science.adk8946.

Shin, Jung-Eun, Adam J. Riesselman, Aaron W. Kollasch, Conor McMahon, Elana Simon, Chris Sander, Aashish Manglik, Andrew C. Kruse, and Debora S. Marks. (2021). Protein Design and Variant Prediction Using Autoregressive Generative Models. Nature Communications 12, no. 1 (April 23, 2021): 2403. https://doi.org/10.1038/s41467-021-22732-w.

Strokach, Alexey, David Becerra, Carles Corbi-Verge, Albert Perez-Riba, and Philip M. Kim. (2020). Fast and Flexible Protein Design Using Deep Graph Neural Networks. Cell Systems 11, no. 4 (October 21, 2020): 402-411.e4. https://doi.org/10.1016/j.cels.2020.08.016.

Watson, Joseph L., David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, et al. (2023). De Novo Design of Protein Structure and Function with RFdiffusion. Nature 620, no. 7976 (August 2023): 1089–1100. https://doi.org/10.1038/s41586-023-06415-8.

Wu, Zachary, Kadina E. Johnston, Frances H. Arnold, and Kevin K. Yang. (2021). Protein Sequence Design with Deep Generative Models. Current Opinion in Chemical Biology, Mechanistic Biology * Machine Learning in Chemical Biology, 65 (December 1, 2021): 18–27. https://doi.org/10.1016/j.cbpa.2021.04.004.

Xie, Wen Jun, and Arieh Warshel. (2023). Harnessing Generative AI to Decode Enzyme Catalysis and Evolution for Enhanced Engineering. bioRxiv, October 12, 2023. https://doi.org/10.1101/2023.10.10.561808.

Yang, Xiaodong, Guole Liu, Guihai Feng, Dechao Bu, Pengfei Wang, Jie Jiang, Shubai Chen, et al. (2023). GeneCompass: Deciphering Universal Gene Regulatory Mechanisms with Knowledge-Informed Cross-Species Foundation Model. bioRxiv, September 28, 2023. https://doi.org/10.1101/2023.09.26.559542.

APPENDIX: Molecular Linguistic Creation: Protein Language Models as Poetics in the Post-AlphaFold Era

The chronological unfolding of protein language models between 2023-2025 reveals not merely technical progression but a transformation in our relationship to biological matter. This period witnesses the emergence of what might be termed a molecular poetics—where computational systems do not simply predict but actively compose within the grammar of life itself.

The evolutionary trajectory begins with AlphaFold's remarkable contribution to structure prediction, which, while revolutionary, remained fundamentally interpretive rather than generative. AlphaFold's approach to protein structure resembled a critic capable of deciphering textual meaning with unprecedented precision, yet unable to author original works.

The paradigmatic shift toward generative capacity emerged through RFdiffusion Joint (Baydin et al., 2023), which introduced a dialogical relationship between sequence and structure—no longer conceived as sequential translations but as simultaneous manifestations of a unified molecular logic. This conceptual reframing established the theoretical groundwork for subsequent innovations.

ForceGen (Gligorijevic et al., 2024) extended this integrative approach further by dissolving traditional boundaries between sequence, structure, and function. Where previous models maintained these as separate, if related, domains, ForceGen reconceived them as inseparable dimensions of protein physics—akin to how contemporary linguistic theory increasingly resists artificial divisions between syntax, semantics, and pragmatics.

The release of AlphaFold 3 marked another crucial inflection point—achieving atomic-level accuracy at genomic scale. Yet its primary orientation remained predictive rather than generative, establishing a comprehensive lexicon of protein structures without fundamentally addressing the compositional challenge.

Against this backdrop, the 2025 release of NVIDIA's Proteina represents a profound methodological innovation. By reconceptualizing proteins through foundation model principles, Proteina achieves what might be termed molecular fluency—a capacity to generate novel protein sequences with functional intent.

The contemporaneous evolution of Evo2 from Arc Institute establishes a complementary approach, one deeply rooted in evolutionary logic yet transcending mere recapitulation. Where Proteina might be characterized as embracing a more architectural approach to protein design, Evo2 embodies an evolutionary perspective—not merely mimicking natural selection's products but internalizing its processes. The system engages with proteins not as static entities but as manifestations of dynamic evolutionary trajectories.

Considered collectively, these developments suggest a profound epistemological shift. Where AlphaFold represented a remarkable achievement in prediction, the ForceGen-Proteina-Evo2 epoch initiates what might be termed a computational biogenesis—the capacity not merely to understand but to participate in the ongoing creation of the protein universe. The computational linguist becomes molecular poet; the algorithm becomes co-creator in the continuing evolution of biological possibility.

This transformation carries implications beyond technical achievement, inviting philosophical reconsideration of fundamental categories like natural/artificial and found/designed. As these systems continue their development, we witness not simply technological evolution but the emergence of new modalities of biological creativity—where computational intelligence becomes an active participant in the composition of life itself.