As you begin to read this, have you ever wondered what natural language is anyway? Why this field of AI, which has been growing in popularity in the last decade isn’t called just language processing? And What is not natural language then?
Learn in this article what natural language is and how algorithms can be trained to process, analyze and generate natural language data.
Natural language is defined as the main way we humans communicate using words in a structured and conventional way; we can use language to communicate verbally, by gestures, or in written form. The key is that natural language evolves well, “naturally”, over years and decades and centuries. It evolves with changes in demographics (also known as migration), important social and political events, and new technologies.
Think about the words and expressions that didn’t exist before 2020 in the United States: Covid-19, Nanoplastic, “Black Lives Matter”. Or words that now have new additional meaning; for example, a bubble or pod, which during 2020 could also mean a small group of people you could interact with in person without generating a massive spread of the virus.
So, the next logical question is and what is NOT Natural Language? Well, Python is not a natural language, and neither is Esperanto. They are both languages that were created and updated in an intentional and systematic way. They don’t evolve naturally. Esperanto is recognized as a constructed language, while Python and every other computer language is known as a formal language. Formal languages are built with symbols and letters that represent well defined rules and processes. There are no ambiguities. Specific syntax corresponds to very specific instructions.
Now that we “know” what natural language is, let’s get into natural language processing (NLP) and its origins.
Natural language processing is a subfield at the intersection of linguistics, computer science, and artificial intelligence. It is concerned with the interactions between computers and humans using natural language, which means computers having the ability to understand natural language and be able to respond using natural language. More specifically, how to program computers to process, analyze and generate natural language data.
A brief historical view of NLP: interest in the field began around the 1950’s, when international diplomacy was very active, and there was interest to use computers to help automate language translation. With U.S government funding, one project led jointly by IBM and Georgetown University showed success in automatically translating 60 phrases from Russian to English. This was accomplished with a complicated and intricate rules-based system. As further experiments failed to show substantial improvements, researchers realized NLP was harder than they thought and by the end of the 60’s government funding for NLP and mechanical translation dried up.
At around the same time researchers, in mainly university settings, saw the success as well as the shortcomings of the rules-based system and began to successfully integrate new statistical approaches to language tasks, and the first “chatbots” were born. Among them, Eliza. Eliza used not only rules but also statistical approaches to infer the meaning of natural language to play the role of psychotherapist in quite a convincing way.
However, it wasn’t until the 1980s and 90’s, when computers and the ability to store large amounts of natural language data became much more powerful and less expensive, and the internet was born, that statistical approaches became widely used. Until then, natural language was stored in books and documents, transcripts, and recordings, making it hard to collect and process.
Natural Language processing is hard.
First, language is complex. We don’t often have to think about this, but the meaning of many words depends on the context in which they are written. Take for example the word “book”:
To make matters even harder, we use humor and sarcasm. And the interpretation is very dependent on the background or culture of a person.
Take for example:
“That girl is on fire!”
Or
“Why do they call it rush hour when nothing moves?” Robin Williams, Actor
“Just burned 2,000 calories. That’s the last time I leave brownies in the oven while I nap.”
And then, computers don’t process language, they process numbers. Even when we turn language into numbers, computers can’t interpret it. So we must find ways to relate words to each other so a computer can then interpret “meaning”.
Converting text to represent it mathematically is called text representation. The text representation scheme chosen will depend on the specific requirements of the task ahead. If it is as simple as counting the number of times a word appears in a text, assigning a number to each different word is enough to represent the text. However, if the task is more complex, such as extracting entities from a document and establishing the relationship between them, then more complex text representation techniques will need to be evaluated.
Are you interested in joining the AI conversation with industry leaders and senior practitioners? Check out the Data Science Salon events 2023, early bird rates are now available.