Interspecies Communication—Fantasy or Reality?

Steven Shepard

If you would like to listen to this as an audio essay, complete with the calls of marine mammals, please go here.

In the inaugural issue of National Geographic in 1888, Gardiner Hubbard wrote, “When we embark on the great ocean of discovery, the horizon of the unknown advances with us and surrounds us wherever we go. The more we know, the greater we find is our ignorance.” Gardiner was the founder and first president of the National Geographic Society, the first president of AT&T, and a founder of the prestigious journal, Science. He knew what he was talking about. 

104 years later, NASA scientist Chris McKay had this to say: “If some alien called me up and said, ‘Hello, this is Alpha, and we’d like to know what kind of life you have,’ I’d say, water-based. Earth organisms figure out how to make do without almost anything else. The single non-negotiable thing life requires is water.”

Several years ago I met Alaska-based sound recordist and author Hank Lentfer. During one of several conversations, he asked me to imagine that all the knowledge we each have is ‘inside of this circle,’ which he drew in the air. “Everything that touches the outside the circle is what we don’t know,” he told me. “But as we learn, the circle gets bigger, as more knowledge is added to it. But notice also, that as the circle gets bigger, so too does its surface area, which touches more of what we don’t know.” In other words, the more we know, the more we don’t. As Hubbard said in 1888, “The more we know, the greater we find is our ignorance.”

I love science, and one of the things about it that I love the most is how it constantly refreshes itself in terms of what’s new and fresh and exciting and worthy of exploration as a way to add to what’s inside Hank’s circle. Think about it: Over the last twenty years, scientists have mapped the entire human genome; developed CRISPR/CAS9 to do gene editing; created synthetic cells and DNA; discovered the Higgs Boson, gravitational waves, and water on Mars; developed cures or near-cures for HIV, some forms of cancer, and Hepatitis-C; and created functional AI and reusable rockets. And those are just the things I chose to include.

One of the themes in my new novel, “The Sound of Life,” is interspecies communication—not in a Doctor Doolittle kind of way—that’s silly—but in a more fundamental way, using protocols that involve far more listening on our part than speaking.

The ability to communicate with other species has long been a dream among scientists, which is why I’m beyond excited by the fact that we are close to engaging in a form of two-way communication with other species. So, I want to tell you a bit about where we are, how we got here, and why 2026 is widely believed to be the year we make contact, to steal a line from Arthur C. Clarke.

Some listeners may remember when chimps, bonobos and gorillas were taught American Sign Language with varying degrees of success. Koko the gorilla, for example, who was featured on the cover of National Geographic, learned to use several dozen American Sign Language symbols, but the degree to which she actually understood what she was saying remains a hotly-debated topic, more than 50 years later. 

But today is a very different story. All the research that’s been done so far in interspecies communication has been based on trying to teach human language to non-human species. Current efforts turn that model on its head: researchers are using Artificial Intelligence to meet animals on their own terms—making sense of their natural communications rather than forcing them to use ours. Said another way, it’s time we learned to shut up and listen for a change. And that’s what researchers are doing.

There have been significant breakthroughs in the last few years, many of them the result of widely available AI that can be trained to search for patterns in non-human communications. Now, before I go any further with this, I should go on record. Anybody who’s a regular listener to The Natural Curiosity Project Podcast knows that I don’t take AI with a grain of salt—I take it with a metric ton of it. As a technologist, I believe that AI is being given far more credit than it deserves. I’m not saying it won’t get there—far from it—but I think humans should take a collective breath here. 

I’ve also gone on record many times with the observation that ‘AI’ as an abbreviation has been assigned to the wrong words. Instead of being associated with Artificial and Intelligence, I think AI should stand for Accelerated Insight, because that’s the deliverable that it makes available to us when we use it properly. It makes long, slow, complex, and let’s face it, boring jobs, usually jobs that involve searching for patterns in a morass of data, enormously faster. Here’s an example that I’ve used many times. A dermatologist who specializes in skin cancers has thousands of photographs of malignant skin lesions. She knows that the various forms of skin cancer can differ in terms of growth rate, shape, color, topology, texture, surface characteristics, and a host of other identifiers. She wants to look at these lesions collectively to find patterns that might link causation to disease. Now: she has a choice. She can sit down at her desk with a massive stack of photographs and a note pad, and months later she may have identified repeating patterns. Or, she asks an AI instance to do it for her, and gets the same results in five minutes. It’s all about speed.

That’s a perfect application for AI, because it takes advantage of AI’s ability to quickly and accurately identify patterns hidden within a chaos of data. And that’s why research into interspecies communication today is increasingly turning to AI as a powerful tool—with many promising outcomes, and a few spellbinding surprises. 

Let’s start with the discovery of the “Sperm Whale Phonetic Alphabet.” Project CETI,  the Cetacean Translation Initiative, has produced what bioacoustics researchers are calling the “Rosetta Stone” for marine interspecies communication. Here’s what we know. Researchers have identified structural elements in the sounds generated by sperm whales that are similar  to human vowels, like a, e, i, o, and u, and diphthongs, like the ‘ow’ in the word sound, the ‘oy’ in noise, and the ‘oo’ in tour. They’ve also identified a characteristic called “rubato,” which is measurable variation in tempo that conveys meaning, and “ornamentation,” which is the addition of extra clicks, which suggest that sperm whales may have what’s called a combinatorial grammar that could transmit enormous amounts of information. Combinatorial grammar: let me explain. “He had a bad day” is a perfectly acceptable statement. “He had a no-good, horrible, terrible, very bad day” is an example of combinatorial grammar. It adds richness and nuance to whatever’s being said.

This is the first time researchers have ever found a non-human communication system that relies on the same kinds of phonetic building blocks that human speech relies on. This is a very big deal.

So: Using machine learning, scientists have analyzed almost 9,000 codas, which are uniquely identifiable click sequences, and in the process have discovered that sperm whale communication is enormously more complicated and nuanced than we previously believed.

So, how did they do it? Well, in the same way that ChatGPT is trained on huge databases of human text, new models are being trained on the sounds of the natural world. For example, NatureLM-audio is a system that was launched by the Earth Species Project in 2025. It’s the first large audio-language foundation model specifically built for bioacoustics. Not only can it identify unique species, it can also determine the life stage they’re in and the emotional state of the animal when it was recorded—for example, whether the creature was stressed, playing, relaxed, and so on. And it can do this across thousands of species, simultaneously.

Then there’s WhAM,  the Whale Acoustics Model. This is a transformer-based model that can generate synthetic, contextually accurate whale codas, which could someday lead to two-way real-time engagement with whales.

I should probably explain what a transformer-based model is, because it’s important. In bioacoustics, a transformer-based model uses a technique called the self-attention mechanism, which is part of natural language processing, to analyze animal sounds. The self-attention mechanism asks a question: To truly understand the context and meaning of this particular word (or in this case, sound), what do I need to know about the other words that are being used by the speaker at the same time? This allows the system to capture long-range patterns in audio spectrograms, which allow for highly accurate species identification, sound event detection (like bird calls or bat echolocation), and other identifiers, especially when the data to be analyzed is limited. Models like Audio Spectrogram Transformer and custom systems like animal2vec convert captured audio into small segments called patches, then process them to identify patterns. 

In bioacoustics—such as studying the meaning and context of whale song, or in the case of animal2vec, the vocalizations of meerkats—the raw audio is converted into visual representations called spectrograms, which display the changing frequency of the recording against elapsed time. These are then broken into smaller patches. Each patch then gets a unique “position” tag so the model knows the order of the sounds in the sequence. This is called Positional Encoding.

Next, the system unleashes the Self-Attention Mechanism, which allows the model to weigh the importance of different sound patches relative to each other, which creates a better understanding of context and relationships across long audio segments.

The next step is Feature Extraction. The model learns deep, complex acoustic features, such as nuanced meaning in bird songs or bat calls, which can be tagged to different species or behaviors.

Finally, the model classifies the sounds, in the process identifying unique species, or detecting specific identifiable events, such as a predator call.

The implications of all this are significant. First, contextual understanding is created because the system captures long-range dependencies among the audio patches, which are crucial for understanding complex animal vocalizations. Second, the performance of these systems is better than any other model, including Convoluted Neural Networks, which are considered to be on the forefront of AI learning. Third, it works well in what is called a Few-Shot Learning Environment, which is an environment where the amount of labeled data that can be analyzed is limited. And finally, because of the use of the Self-Attention Mechanism, the system creates a high degree of interpretability.

These are tools which have utility well beyond the moonshot project of interspecies communication. They can be used to monitor wildlife populations through sound alone; they can detect and identify bird and bat calls; and they can even be used to identify bee species from the frequency and tonality of the buzz their wings create when they fly by a microphone. Remember— Karl von Frisch won a Nobel Prize for deciphering the complex dance of the honeybee and how that dance conveys complex, species-specific information to other members of the hive.

All of these are important in the world of ecology and habitat monitoring and protection. 

Here’s another fascinating example that has gotten a lot of attention. Recent field studies have proven that elephants and carrion crows engage in a form of unique naming behavior for others of their own species. In the case of elephants, researchers have used machine learning tools to prove that wild African elephants use unique vocal labels to address each other. Unlike dolphins, who mimic other dolphins’ whistles, elephants appear to use arbitrary names for other elephants—a sign of advanced abstract thought.

In the case of crows, researchers using miniature biologgers—essentially tiny microphones and recorders about the size of a pencil eraser that are attached to wild animals—have discovered that carrion crows have a secret “low-volume” vocabulary that they use for intimate family communication, very different from the loud, raucous sounds that are used for territory protection and alarm calls.

Finally, we’re seeing breakthroughs in animal welfare practices in the farming and ranching industries because of bioacoustics. In the poultry business, for example, a “chicken translator” is now in use that can identify specific distress calls, which allow farmers to locate sick or stressed birds among thousands, significantly improving the welfare of the flock.

Before I continue with this discussion, let’s talk about why all this is happening now, in the final days of 2025, and why scientists believe we may be on the verge of a major breakthrough in interspecies communications. It has to do with three factors. 

First, we have Big Data, both as a theory and as a hard practice. The idea that patterns can be found in massive volumes of data has been around for a while, but we’re just now developing reliable tools that can predictably find those patterns and make sense of them. Initiatives like the Earth Species Project are aggregating millions of hours of animal audio into a single database, which can then be analyzed.

Second, we have data aggregation techniques and mechanisms that allow for data to be collected around the clock, regardless of climate or weather. The tiny biologgers I mentioned earlier are examples, as are weatherproof field recorders that can record for weeks on a single memory card and set of batteries.

Finally, we have one of the basic characteristics of AI, which is unsupervised learning—the ability to find patterns in vast stores of data without being told what to look for.

I’m going to add a fourth item to this list, which is growing professional recognition that sound is as good an indicator of ecosystem details as sight. I may not be able to see that whale in the ocean, but I can hear it, which means it’s there.

Okay, let’s move on and talk about the nitty-gritty: how do those sperm whale vowel sounds that I described earlier actually work? And to make sure you know what I’m talking about, here’s what they sound like. This recording comes from Mark Johnson, and it can be found at the “Discovery of Sound in the Sea” Web site.

Amazing, right? Some scientists say it’s the loudest natural sound in the ocean. Anyway, to answer this question about how the sperm whale vowel sounds work,  we have to stop thinking about sound as a “message” and start looking at its internal architecture. Here’s what I mean. For decades, researchers believed that the clicks made by sperm whales, the codas, were like Morse Code: a simple sequence of on/off pulses, kind of like a binary data transmission. However, in 2024 and 2025, Project CETI discovered that the clicks made by sperm whales have a sophisticated internal structure that functions exactly the way vowels do in human speech.

In the same way that human speech is made up of small units of sound called phonemes, whale codas are characterized by four specific “acoustic dimensions.” By analyzing thousands of hours of recorded whale song, researchers using AI determined that whales mix and match these dimensions to create thousands of unique signals. The four dimensions are rhythm, which is the basic pattern of the clicks; tempo, the overall speed of the coda; rubato, which is the subtle stretching or squeezing of time between clicks; and ornamentation, which are short “extra” clicks added at the end of a sequence, similar to a suffix or punctuation mark.

That discovery was a game-changer, and it really knocked the researchers back on their heels. But the most important discovery, which happened in late 2024, was the identification of formants in whale speech. In human language, formants are the specific resonant frequencies created in the throat and mouth that result in the A, E and O vowel sounds. Well, researchers discovered that whales use their “phonic lips,” which are vocal structures in their nose, to modulate the frequency of their clicks in the same way that humans do with their lips and mouth. For example, the a-vowel is a click with a specific resonant frequency peak. The i-vowel is a click with two distinct frequency peaks. Whales can even “slide” the frequency in the middle of a click to create a rising or falling sound similar to the “oi” in “noise” or the “ou” in “trout.” These are called diphthongs.

So, how does this actually work? It turns out that whale vocalization is based on what linguists call Source-Filter Theory. Compared to human language, the similarities are eerie. In human speech, air passes through vocal chords to create sound; in whales, it passes through what are called phonic lips. In human speech, variation is accomplished by changing the shape of the mouth and tongue; in sperm whales, it happens using the spermaceti organ and nasal sacs.

In humans, the result is recognizably-unique vowels, like A, E, I, O, U; in whales, the result is a variety of spectral patterns. And in terms of complexity, there isn’t much difference between the two. Humans generate thousands of words; whales generate thousands of codas.

So … the ultimate question. Why do we care? Why does this research matter? Why is it important? Several reasons. 

First, before these most recent discoveries about the complexity of animal communication, scientists believed that animal “language”—and I use that word carefully—was without nuance. In other words, one sound meant one thing, like ‘food’ or ‘danger’ or ‘come to me.’ But the discovery of these so-called “whale vowels” now make us believe that their language is far more complex and is in fact combinatorial—they aren’t just making a sound; they’re “building” a meaningful signal out of smaller parts, what we would call phonemes. This ability is a prerequisite for true language, because it allows for the creation of an almost infinite variety of meanings from a limited set of sounds.

So: one of the requirements for true communication is the ability to anticipate what the other person is going to say before they say it. This is as true for humans as it is for other species. So, to predict what a whale is going to say next, researchers use a specialized Large Language Model called WhaleLM. It’s the equivalent of ChatGPT for the ocean: In the same way that ChatGPT uses the context of previous words in a conversation to predict what the next word will be in a sentence, WhaleLM predicts the next coda or whale song based on the “conversation history” of the pod of whales to which the individual belongs. Let me explain how it works.

Large Language Models, the massive databases used to train AI, rely on a process called ‘tokenization.’ A token is a unit of the system—like a word, for example, or in the case of sperm whales, the clicks they make. Since whale clicks sound like a continuous stream of broadband noise to humans, researchers use AI to “tokenize” the whale audio into unique, recognizable pieces. The difference, of course, is that they don’t feed text into the LLM, because text isn’t relevant for whales. Instead, they feed it the acoustic dimensions we talked about earlier: Rhythm,  Tempo, Rubato, and Ornamentation.

Next comes the creation of a vocabulary. From analysis of the four acoustic dimensions, the AI identifies specific sound sequences, which are then treated as the vocabulary of the pod that uttered the sounds in the first place.

Next comes the creation of context, or meaning. WhaleLM made a critical discovery in late 2024, which was the identification of what are called long-range dependencies. These dependencies are described in what researchers call the “Eight Coda Rule.” Scientists determined conclusively that a whale’s next call is heavily influenced by the previous eight codas in the conversation, which is typically about 30 seconds or so of conversation time. 

WhaleLM also has the benefit of multi-whale awareness. It doesn’t track the “speech” of a single whale; it tracks and analyzes the sounds uttered by all whales in the pod and the extent to which they take turns vocalizing. If Whale A says “X,” the model can predict with high accuracy whether Whale B will respond with “Y” or “Z.” But here’s a very cool thing that the researchers uncovered: Not only does WhaleLM predict a sound that will soon follow, it also predicts actions that the sounds are going to trigger. For example, researchers identified a specific sequence of codas, called the diving motif, that indicates with extreme accuracy—like 86 percent accuracy—that if uttered by all the whales in an exchange, the pod is about to dive to hunt for food. In other words, these sound sequences aren’t just noise—the equivalent of whales humming, for example—they’re specific instructions shared among themselves with some intended action to follow. I don’t know about you, but I find that pretty mind-blowing.

The natural next step, of course, is to ask how we might use this analytical capability to carry on a rudimentary conversation with a non-human creature. Because researchers can now predict what a “natural” response should be, they can use WhaleLM to design what are called Playback Experiments. Here’s how they work. Researchers play an artificial coda, generated by WhaleLM, to a wild whale to see if the whale responds the way the AI predicts it might. If the whale does respond, it confirms that the researchers have successfully decoded a legitimate whale grammar rule.

Let’s be clear, though. We don’t have a “whale glossary of terms” yet that we can use to translate back and forth between human language and whale language. What we have are the rules. We’re still in the early stage of understanding syntax—how words are constructed. We aren’t yet into the semantics phase—what words mean.

In the leadership workshops I used to deliver I would often bring up what I called “The Jurassic Park Protocol.” It simply said, just because you CAN make dinosaurs doesn’t mean you SHOULD. And we know they shouldn’t have, because there are at least six sequels to the original movie and they all end badly.

The same rule applies to interspecies communication. Just because we may have cracked the code on some elements of whale communication doesn’t mean that we should inject ourselves into the conversation. This is heady stuff, and the likelihood of unintended consequences is high. In 2025, researchers from Project CETI and the More-Than-Human Life Program at NYU, MOTH, introduced a formal Ethical Roadmap known as the PEPP Framework. PEPP stands for Prepare, Engage, Prevent, and Protect, and it treats whales as “subjects” with rights rather than “objects” to be studied.

So, PEPP stipulates four inviolable commitments that researchers must meet before they’re allowed to engage in cross-species conversations using AI-generated signals. The first is PREPARE: Before a sound is played back to a whale, researchers must prove they have minimized the potential for risk to the animal by doing so. For example, scientists worry that if they play an AI-generated whale call, they might inadvertently say something that causes panic, disrupts a hunt, or breaks a social bond. Similarly, PEPP requires that researchers use equipment that doesn’t add noise pollution that interferes with the whales’ natural sonar. We’ll talk more about that in a minute.

The next commitment is ENGAGE. To the best of our current knowledge, whales don’t have the ability to give us permission to engage with them, so PEPP requires researchers to look for any kind of identifiable behavioral consent. If the whale demonstrates evasive behavior such as diving, moving away, or issuing a coda rhythm that indicates distress, the experiment must stop immediately. The ultimate goal is to move toward a stage called Reciprocal Dialog, in which the whale has the right and ability to end the conversation at any time.

The third pillar of the PEPP protocol is PREVENT. This is very complicated stuff: researchers must take deliberate steps to ensure that they do not inadvertently become members of the pod. There is concern, for example, that whales might become “addicted” to interacting with the AI, or that it might change how they teach their calves to speak. A related concern is Cultural Preservation. Different whale pods have different “dialects,” and PEPP forbids researchers from playing foreign dialects to groups of whales—for example, playing a recording captured in the Caribbean to a pod of whales in the Pacific Ocean—because it could contaminate their own vocal culture.

The final commitment is PROTECT, and it has less to do with the process of establishing communication and more to do with what occurs after it happens. The PEPP protocol argues that if we prove whales have a language, then we’re ethically and morally obligated to grant them legal rights. And, since AI can now “eavesdrop” on private pod conversations, PEPP establishes data privacy rules for the whales, ensuring their locations aren’t shared with commercial fisheries or whaling interests.

There’s an old joke about what a dog would do if it ever caught the car it was chasing. The same question applies to the domain of interspecies communication. If we are successful, what should we say? Most researchers agree that first contact should not be a casual meet and greet, but should instead be what are called mirroring experiments. One of these is called the Echo Test, in which the AI listens to a whale and repeats back a slightly modified version of the same coda. The intent is not to tell the whale something new, but to see if they recognize that the “voice” in the water is following the rules of their grammar. It’s a way of asking, “Do you hear me?” Instead of “How you doin’?”

Researchers have identified three major risks that must be avoided during conversational engagement with whales. The first is the risk of social disruption. To avoid this, only “low-stakes” social codas can be used for playback, never alarm or hunt calls. 

The second risk is human bias. To avoid this outcome, the AI is trained only on wild data to avoid “human-sounding” accents in the whale’s language.

Finally, we have the very real risk of exploitation. To prevent this from happening, the data is open-source but “de-identified” to protect whale locations from poachers.

The discovery of vowels in whale speech has given lawyers who advocate for whale rights significant power in the courtroom. For centuries, whales have been classified as property—as things rather than as sentient creatures. Recently, though, lawyers have begun to argue that whales meet the criteria for legal personhood. They base this on several hard-to-deny criteria. For example, lawyers from the More-Than-Human Life Program at NYU and the Nonhuman Rights Project are moving away from general “sentience” arguments to specific “communication” arguments. If an animal has a complex language, it possesses autonomy—the ability to make choices and have preferences. In many legal systems, autonomy is the primary qualification for having rights.

Another argument makes the case that by proving that whales use combinatorial grammar—the vowels we’ve been discussing—scientists have provided evidence that whale thoughts are structured and abstract. Lawyers argue that the law can’t logically grant rights to a human with limited communication skills, like a baby, while at the same time denying them to a whale with a sophisticated “phonetic alphabet.” 

In March 2024, Indigenous leaders from the Māori of New Zealand, Tahiti, and the Cook Islands signed a treaty which recognizes whales as legal persons with the right to “cultural expression.” That includes language. Because we now know that whales have unique “regional dialects,” the treaty argues that whales have a right to their culture. This means that destroying a pod isn’t just killing animals; it amounts to the “cultural genocide” of a unique linguistic group.

Then, there’s the issue of legal representation of whales in a court of law. We have now seen the first attempts to use AI-translated data as evidence in maritime court cases. For example, in late 2025, a landmark paper in the Ecology Law Quarterly argued that human-made sonar and shipping noise amounts to “torture by noise” and is the acoustic equivalent of “shining a blinding light into a human’s eyes 24 hours a day.” And instead of relying on the flimsy argument that whales can just swim away from noise (clearly demonstrating a complete ignorance of marine acoustics and basic physics), lawyers are using WhaleLM data to demonstrate how human noise disrupts their vowels, making it impossible for whales to communicate with their families. And the result? We’re moving from a world where we protect whales because they’re pretty, to a world where we protect them because they’re peers. 

Human-generated noise has long been a problem in the natural world. Whether it’s the sound of intensive logging in a wild forest, or noise generated by shipping or mineral exploration in the ocean, there’s significant evidence that those noises have existentially detrimental effects on the creatures exposed to them—and from which they can’t escape. The good news is that as awareness has risen, there have been substantial changes in how we design underwater technology so that it is more friendly to marine creatures like whales. Essentially, there is a shift underway toward  Biomimetic Technology—hardware that mimics how whales communicate as a way to minimize the human acoustic footprint. These include the development of acoustic modems that use transmission patterns modeled after whale and dolphin whistles instead of the loud sonar pings used in traditional technology. Whales and other creatures hear it as background noise.

Another advance is the use of the SOFAR Channel. SOFAR is an acronym that stands for Sound Fixing and Ranging, and it refers to a deep layer in the ocean, down around 3,300 feet, where sound travels for great distances, much farther than in other regions of the ocean. The layer acts as a natural waveguide that traps low-frequency sounds, allowing them to travel thousands of miles and enabling long-distance monitoring of phenomena such as whale communication.  Technology is now being designed to transmit over the SOFAR layer, allowing marine devices to use 80% less power by working with the ocean’s physics rather than against it, and at the same time being less disruptive to the creatures who live there.

Gardiner Hubbard said, “When we embark on the great ocean of discovery, the horizon of the unknown advances with us and surrounds us wherever we go. The more we know, the greater we find is our ignorance.” Interspecies communication is a great example of this. The more we learn, the more we unleash our truly awesome technologies on the challenge of listening to our non-human neighbors, the more we realize how much we don’t know. I’m good with that. Given the current state of things, it appears that 2026 may be the year when the great breakthrough happens. But the great question will be, when given the opportunity to shut up and listen, will we?

Leave a comment