Introduction to Natural Language Processing Basics

Natural Language Processing(NLP) is a branch of Artificial Intelligence(AI) which enables computers to understand, generate and manipulate human language. There are many applications of NLP in modern era such as speech recognition, language translation by computers, text reading.

Fundamental Concepts in NLP

Tokenization

Tokenization is the process of breaking down the text into each individual words or tokens.

For example,
Sentence: "I love Natural Language Processing!!"
Tokens: ["I","love","Natural","Language","Processing","!!"]
Stemming Stemming is the process of removing the affixes (anything added before or after a base word) from the word so that we are left only with the stem of that word.

For example,

Words: "changing","change","changes"
Stem: "chang","chang","chang"

Words: "running","run","ran"
Stem: "run","run","ran"
Lemmatization

Lemmatization is the process of breaking a word to its root form or meaning to identify similarities.

For example,
Words: "changing","change","changes"
Lemmas: "change","change","change"

Words: "running","run","ran"
Lemmas: "run","run","run"
Text Representation

Text representation is process of converting text into numerical format so that the machine learning model can understand it.

For example,
Text 1: "I love AI"
Text 2: "NLP is wow"

Bag-of-words representation: I love AI NLP is wow
Text 1: 1 1 1 0 0 0
Text 2: 0 0 0 1 1 1

Explanation: First all the unique words in all the documents is collected in a place. This is called bag-of-words. Then, for a document if the word is there, it is represented as 1 else 0. Like for Text 1, since I is present so during vector representation first element is 1, and for Text 2, since I is absent, first element is 0. Notice, all the vectors are of same length as length of bag-of-words.

Tip: You can look the concept of Bag-of-words to understand it.
Part-of-Speech (POS) Tagging
POS Tagging is the process of assigning grammatical categories such as noun, adjective, verb to each word in a sentence.
It facilitates understanding of syntactic structure of a sentence which is crucial for extracting meaning.

For example,
Sentence: "I love ideas"
POS Tagging:
I- Pronoun("PRON")
love- Verb("VERB")
ideas- Noun("PROPN")
Named Entity Recognition (NER)

NER is the process of identifying and classifying entities(objects, names, locations, organizations, etc.)

For example,
Sentence: "Balen Shah is mayor of Kathmandu, capital of Nepal."
Named Entities:
Person: "Balen Shah"
Location: "Kathmandu, Nepal"
Sentiment Analysis

Sentiment Analysis is the process of mining the sentiment expressed in the piece of text, whether it is positive, negative or neutral.

For example,
Sentence: "I like Apples."
Sentiment: Positive Sentiment

Sentence: "I don't prefer eating outside."
Sentiment: Negative Sentiment

Sentence: "I will study."
Sentiment: Neutral Sentiment

Challenges in NLP

Ambiguity : Often words have multiple meanings making it hard for application of NLP like sentiment analysis.
Context Understanding : It is vital for chatbots to understand the context, otherwise it will lead to misinterpretation and errors.
Language Variations : Languages varies regionally, slangs are used. This impose challenges for tasks like NER, sentiment analysis and so on.

Tools and Libraries for NLP

Natural Language ToolKit (NLTK)

NLTK is a very comprehensive library for working with NLP tasks.
It can be efficiently used for tasks like tokenization, stemming, lemmatization, POS Tagging, and so on.
```
 import nltk
 from nltk.tokenize import word_tokenize

 text = "I love NLP."
 tokens = word_tokenize(text)
 print(tokens) #Output: ["I","love","NLP","."]
```

SpaCy

SpaCy is an open-source NLP library used for production. It is very efficient and easy to use.
It provides pre-trained models for various NLP tasks such as POS tagging, NER and dependency parsing.

 import spacy

 nlp = spacy.load('en_core_web_sm')
 text = "I love Spacy too"
 doc = nlp(text)

 for token in doc:
     print(token.text, token.pos_)

 # Output : 
 # I PRON 
 # love VERB 
 # Spacy PROPN
 # too ADV

TensorFlow

TensorFlow(TF) is a popular open-source ML library developed by Google. It is widely used for building deep learning models including those for NLP as well.

Note: Don't worry if you don't understand code below. You don't have to. It is just shown for example.

Keras (high-level API for TF) is used in the code below.

 import tensorflow as tf
 from tensorflow.keras.preprocessing.text import Tokenizer
 from tensorflow.keras.preprocessing.sequence import pad_sequences

 texts = ["This is a positive review.", "This is a negative review."]
 labels = [1, 0]

 tokenizer = Tokenizer()
 tokenizer.fit_on_texts(texts)
 sequences = tokenizer.texts_to_sequences(texts)

 padded_sequences = pad_sequences(sequences)

 model = tf.keras.Sequential([
     tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=16),
     tf.keras.layers.GlobalAveragePooling1D(),
     tf.keras.layers.Dense(1, activation='sigmoid')
 ])

 model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
 model.fit(padded_sequences, labels, epochs=10)

Machine Learning(ML) in NLP

ML is very essential in NLP as it enables computers to learn from data and make predictions related to understanding and generation of natural languages.

Supervised Learning
Supervised Learning can be used for classifications and regression tasks in NLP.

Applications:
Text classification:
For spam detection, training models with spam words and classifying emails with respect to spam words present or not.

NER:

Identifying and classifying entities in a text.

Code Example:

 from sklearn.model_selection import train_test_split
 from sklearn.feature_extraction.text import CountVectorizer
 from sklearn.naive_bayes import MultinomialNB
 from sklearn.metrics import accuracy_score

 texts=df["text"]
 labels=df["label"]
 X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

 vectorizer = CountVectorizer()
 X_train_vectorized = vectorizer.fit_transform(X_train)
 X_test_vectorized = vectorizer.transform(X_test)

 model = MultinomialNB()
 model.fit(X_train_vectorized, y_train)

 predictions = model.predict(X_test_vectorized)
 accuracy = accuracy_score(y_test, predictions)

 print(f"Accuracy: {accuracy}")

 #Output: Accuracy: 0.766

Unsupervised Learning

Unsupervised Learning is commonly used in NLP for discovering hidden patterns or grouping similar points.

Applications:

Clustering : Grouping similar documents or words together.
Topic Modeling : Identifying topics present in collection of documents.

Word Embeddings : Representing words in a continuous vector space, capturing semantic relationships.

Code Example :

    from gensim.models import Word2Vec

    model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

    word_vector = model.wv['example']

Conclusion

So, these are the fundamentals of NLP. This will pave your way towards advanced NLP concepts. The field of NLP is rapidly evolving with the development of large pre-trained language models like Genrative Pre-trained Transformer (GPT) and BERT. So, I will encourage you to dive deeper further into NLP with other courses and research papers to stay up-to-date with current trends.