NLP Basics Comprehensive Quiz & Projects
30 questions on NLP Basics Tutorial.
Question 1: What is the difference between Stemming and Lemmatization in text preprocessing?
- A. Stemming is used for semantic translation, while Lemmatization is used for syntax parsing.
- B. Stemming cuts off suffixes crudely (e.g. 'studies' -> 'studi'), while Lemmatization returns the dictionary lemma (e.g. 'studies' -> 'study') using vocabulary analysis. β (correct answer)
- C. Stemming works on sentences, while Lemmatization works only on characters.
- D. Stemming is faster and always yields grammatically correct base words.
Explanation: Lemmatization uses grammatical rules and dictionaries to find the proper base word. Stemming applies simple, heuristic chops.
Question 2: In Term Frequency-Inverse Document Frequency (TF-IDF), what does the IDF component measure?
- A. The number of times a target term appears in a single document.
- B. The grammatical complexity of the text.
- C. The importance or rarity of a word across the entire corpus of documents. β (correct answer)
- D. The length of the document.
Explanation: Eager loading (using the 'with' method) fetches related data using a single query with an IN clause, preventing N additional queries.
Question 3: How do Word Embeddings (like Word2Vec) capture semantic meaning in vector space?
- A. By mapping each word to a unique random integer value.
- B. By positioning words that appear in similar contexts close to each other in high-dimensional vector space. β (correct answer)
- C. By alphabetical sorting matrices.
- D. By hashing words into 128-bit binary signatures.
Explanation: Word embeddings represent words as vectors. The distance and angle between vectors represent semantic and contextual relationships.
Question 4: What limitation of Recurrent Neural Networks (RNNs) led to the development of Transformer architectures?
- A. RNNs do not support training on text documents.
- B. RNNs struggle with long-term dependencies due to the vanishing gradient problem and process tokens sequentially, blocking parallel training. β (correct answer)
- C. RNNs require high GPU memory for basic tokenization.
- D. RNNs cannot be used for translation tasks.
Explanation: Transformers process whole sequences at once (parallelization) and use self-attention to link words over infinite context distances.
Question 5: In Transformer models, what is the purpose of the Self-Attention mechanism?
- A. To monitor model accuracy during validation loops.
- B. To prioritize inputs from the system prompt over user files.
- C. To calculate how much focus or weight one token should place on every other token in the sequence when encoding meaning. β (correct answer)
- D. To clean spelling errors in raw input strings.
Explanation: Self-attention calculates correlation scores between all words in a sentence, capturing context dynamically (e.g. 'bank' of a river vs financial 'bank').
Question 6: What is Tokenization in NLP?
- A. Converting script files to binary formats.
- B. The process of splitting a continuous string of text into individual units (tokens) like words or subwords. β (correct answer)
- C. Checking grammatical errors.
- D. Generating security keys for APIs.
Explanation: Tokenization is the foundation step, transforming raw sentences into model-parseable arrays.
Question 7: What are 'Stop Words' in text processing?
- A. Words that trigger syntax errors.
- B. Common words (like 'and', 'the', 'is') that are often filtered out before processing because they carry little semantic value. β (correct answer)
- C. Key vocabulary words in documents.
- D. Commands that stop text parsers.
Explanation: Removing stop words reduces vocabulary noise, allowing algorithms to focus on content words.
Question 8: What is a Named Entity Recognition (NER) task?
- A. Naming new variables in code files.
- B. Identifying and classifying key entities in text into predefined categories (e.g. Names, Dates, Organizations). β (correct answer)
- C. Parsing sentences to identify parts of speech.
- D. Translating words to different languages.
Explanation: NER models scan inputs to extract structure (e.g., extracting 'Apple' as an Organization).
Question 9: In NLP, what is a 'Corpus'?
- A. A text compiler software.
- B. A large, structured collection of text documents used for training and linguistic analysis. β (correct answer)
- C. A database table schema.
- D. The body of a single function.
Explanation: A corpus represents the dataset of text documents used to train language models.
Question 10: What does a Part-of-Speech (POS) Tagger do?
- A. Compiles text stylesheets.
- B. Analyzes words in a sentence and labels their grammatical class (noun, verb, adjective) based on context. β (correct answer)
- C. Searches documents for target terms.
- D. Translates sentences into binary code.
Explanation: POS tags map syntactic roles, crucial for understanding sentence patterns.
Question 11: What is the difference between Word2Vec's CBOW (Continuous Bag of Words) and Skip-gram architectures?
- A. CBOW is used only for database tables.
- B. CBOW predicts the target word from surrounding context, while Skip-gram predicts the surrounding context from a target word. β (correct answer)
- C. Skip-gram is slower and deprecated.
- D. CBOW works only on character tokens.
Explanation: CBOW is faster and works well for frequent words; Skip-gram performs better on rare words.
Question 12: What are N-grams in text mining?
- A. Mathematical formulas for measuring file sizes.
- B. Contiguous sequences of N items (words or characters) from a given sample of text. β (correct answer)
- C. The number of layers in a neural net.
- D. Key-value indices.
Explanation: Unigrams (1-grams), Bigrams (2-grams), and Trigrams (3-grams) capture local sequences.
Question 13: How is Cosine Similarity used in NLP?
- A. To check text file compression.
- B. To measure the semantic similarity between two word or document vectors by calculating the cosine of the angle between them. β (correct answer)
- C. To encrypt token streams.
- D. To route API requests.
Explanation: Value 1 indicates identical vector directions; 0 indicates orthogonal/unrelated meanings.
Question 14: What does the BLEU score evaluate in NLP applications?
- A. The processing latency of translators.
- B. The quality of machine-translated text by comparing it against human reference translations. β (correct answer)
- C. The storage capacity of text tables.
- D. The count of spelling errors.
Explanation: BLEU measures n-gram overlaps between model output and human reference standards.
Question 15: What is the purpose of Text Normalization?
- A. Sorting files alphabetically.
- B. Standardizing text inputs (e.g., converting to lowercase, removing punctuation, expanding contractions) to reduce vocabulary variants. β (correct answer)
- C. Validating string lengths.
- D. Translating words to English.
Explanation: Normalization ensures 'Car', 'car', and 'car!' map to the identical token key.
Question 16: Which task involves determining the emotional tone behind a body of text (e.g. positive, negative, neutral)?
- A. Language Modeling
- B. Sentiment Analysis β (correct answer)
- C. Syntax Parsing
- D. Tokenization
Explanation: Sentiment analysis classifies subjectivity and emotion, heavily used in customer reviews.
Question 17: How do Recurrent Neural Networks (RNNs) capture sequential context in text?
- A. By processing words in parallel.
- B. By passing a hidden state vector forward through time steps, carrying memory of previous tokens. β (correct answer)
- C. By storing strings in database tables.
- D. By compiling text into binary blobs.
Explanation: The hidden state acts as memory, updating at each token step to capture context.
Question 18: What is a 'Lexicon' in linguistic processing?
- A. A text compiling program.
- B. A dictionary containing vocabulary words and their associated properties or sentiment scores. β (correct answer)
- C. A database table schema.
- D. An API routing controller.
Explanation: Lexicons store word properties (e.g., a lexicon of positive sentiment terms).
Question 19: What does the 'attention weight' represent in Seq2Seq models?
- A. The memory speed of GPU grids.
- B. A score indicating how much attention the decoder should pay to specific encoder input tokens when generating a target output token. β (correct answer)
- C. The priority of the system prompt.
- D. The size of the vocabulary.
Explanation: Attention allows the model to focus on relevant context words dynamically during generation.
Question 20: In NLP, what does 'Bag of Words' (BoW) represent?
- A. A folder containing text documents.
- B. A simple text representation that counts word frequencies, ignoring grammatical structure and word order. β (correct answer)
- C. A list of stop words.
- D. An encryption format for strings.
Explanation: BoW creates a vocabulary list, counting occurrences without caring about sequence layouts.
Question 21: What does Dependency Parsing accomplish?
- A. It checks dependencies in package.json files.
- B. It maps the grammatical relationships between words in a sentence, establishing head-dependency trees. β (correct answer)
- C. It groups documents by topic.
- D. It translates words.
Explanation: Dependency trees show how verbs, nouns, and adjectives relate structurally.
Question 22: What is the difference between character-level and word-level tokenization?
- A. Character tokenization is synchronous, word is asynchronous.
- B. Character tokenization splits text into individual letters/symbols, reducing out-of-vocabulary terms but increasing sequence lengths. β (correct answer)
- C. Word tokenization is deprecated.
- D. Character tokenization is only used for databases.
Explanation: Subword tokenization (Byte-Pair Encoding) sits in between, balancing vocabulary size and length.
Question 23: What is a 'Stop Word List'?
- A. A list of restricted database commands.
- B. A pre-compiled list of common words to be ignored during text processing tasks. β (correct answer)
- C. An index of document slugs.
- D. A list of system configuration keys.
Explanation: Stop word lists hold language-specific filler words (e.g., 'a', 'in', 'on').
Question 24: What is Language Modeling?
- A. Programming models to write code.
- B. The task of predicting the probability of a sequence of words (or predicting the next word in a sequence). β (correct answer)
- C. Translating text between domains.
- D. Checking grammar in documents.
Explanation: Language models learn probability distributions over word sequences (next-token prediction).
Question 25: What is the purpose of Word Sense Disambiguation (WSD)?
- A. Correcting spelling errors in strings.
- B. Identifying which semantic meaning of a word is intended based on the surrounding context (e.g. 'bass' fish vs 'bass' instrument). β (correct answer)
- C. Compacting vocabulary arrays.
- D. Encrypting token payloads.
Explanation: WSD resolves word ambiguities by parsing surrounding context vectors.
Question 26: What does 'NLP' stand for?
- A. Network Layer Protocol
- B. Natural Language Processing β (correct answer)
- C. Numerical Log Parser
- D. Node Loop Process
Explanation: Natural Language Processing combines CS and linguistics to enable computers to process human language.
Question 27: How does Byte-Pair Encoding (BPE) build a subword vocabulary?
- A. By mapping characters to random integer numbers.
- B. By iteratively merging the most frequent pairs of bytes or characters in a text corpus. β (correct answer)
- C. By checking dictionary definitions.
- D. By encrypting string payloads.
Explanation: BPE builds subwords dynamically, letting tokenizers handle unknown words gracefully.
Question 28: What does a perplexity score measure in language models?
- A. The processing latency of queries.
- B. How well a probability model predicts a sample text (lower perplexity represents better prediction accuracy). β (correct answer)
- C. The number of layers in a neural net.
- D. The vocabulary array size.
Explanation: Perplexity is the exponentiated cross-entropy loss, indicating token predictability.
Question 29: What is the difference between extractive and abstractive summarization?
- A. Extractive copy-pastes key sentences from the source directly, while Abstractive paraphrases and generates new sentences to summarize the text. β (correct answer)
- B. Abstractive is faster and uses less CPU memory.
- C. Extractive is unsupervised, while Abstractive is supervised.
- D. Extractive only works on database tables.
Explanation: Abstractive models require generative architectures to write new summaries.
Question 30: Which module is a popular Python library for standard NLP tasks like tokenization and parsing?
- A. NumPy
- B. NLTK (or spaCy) β (correct answer)
- C. PyTorch
- D. Flask
Explanation: NLTK and spaCy are standard libraries for natural language processing.