Anagrammed indexOf
Role: Software Engineer
Problem Overview
Implement a function that finds the first occurrence of a substring in a given string where the substring is an anagram (any permutation) of a query string.
Function Signature:
def anagram_index(lookup_string: str, query: str) -> intParameters:
-
lookup_string: The string to search within -
query: The pattern string whose anagram we're looking for
Returns:
-
The starting index of the first substring in
lookup_stringthat is an anagram ofquery -
-1if no such substring exists
Key Concept: Two strings are anagrams if they contain the same characters with the same frequencies, regardless of order. For example, "tad", "atd", and "dat" are all anagrams of each other.
Examples
Example 1: Basic anagram match
anagram_index('databricks', 'tad') # Returns: 0
anagram_index('databricks', 'atd') # Returns: 0
anagram_index('databricks', 'dat') # Returns: 0Explanation: The substring "dat" at index 0-2 contains the same characters as the queries "tad", "atd", and "dat" (all have 1 'd', 1 'a', 1 't').
Example 2: Anagram found later in string
anagram_index('databricks', 'ribkc') # Returns: 4Explanation: The substring "brick" at index 4-8 is an anagram of "ribkc" (both have 1 'r', 1 'i', 1 'b', 1 'k', 1 'c').
Example 3: Short query
anagram_index('databricks', 'sk') # Returns: 8Explanation: The substring "ks" at index 8-9 is an anagram of "sk".
Example 4: No match found
anagram_index('databricks', 'abc') # Returns: -1Explanation: There is no substring of length 3 in "databricks" that contains all of 'a', 'b', and 'c'.
Example 5: Query longer than lookup string
anagram_index('data', 'databricks') # Returns: -1Explanation: The query is longer than the lookup string, so no substring can possibly match.
Example 6: Empty strings
anagram_index('databricks', '') # Returns: 0 (empty string matches at start)
anagram_index('', 'query') # Returns: -1 (can't find non-empty query in empty string)
anagram_index('', '') # Returns: 0 (empty matches empty at index 0)Approach 1: Sliding Window with Character Frequency Map
Algorithm
The most efficient approach uses a sliding window technique with character frequency counting:
-
Edge Cases: Handle empty strings and length mismatches
-
Initialize Frequency Maps:
-
Count character frequencies in the
querystring -
Count character frequencies in the first window of
lookup_string -
Sliding Window:
-
Compare the two frequency maps
-
If they match, return the current index
-
Slide the window: remove leftmost character, add next character
-
Update frequency map incrementally
-
Return -1 if no match found
Implementation
def anagram_index(lookup_string: str, query: str) -> int:
# Edge case: empty query matches at index 0
if len(query) == 0:
return 0
# Edge case: query longer than lookup string
if len(query) > len(lookup_string):
return -1
# Build frequency map for query
query_freq = {}
for char in query:
query_freq[char] = query_freq.get(char, 0) + 1
# Build frequency map for first window
window_freq = {}
for i in range(len(query)):
char = lookup_string[i]
window_freq[char] = window_freq.get(char, 0) + 1
# Check if first window matches
if window_freq == query_freq:
return 0
# Slide the window
for i in range(len(query), len(lookup_string)):
# Add new character (right side of window)
new_char = lookup_string[i]
window_freq[new_char] = window_freq.get(new_char, 0) + 1
# Remove old character (left side of window)
old_char = lookup_string[i - len(query)]
window_freq[old_char] -= 1
if window_freq[old_char] == 0:
del window_freq[old_char]
# Check if current window matches
if window_freq == query_freq:
return i - len(query) + 1
return -1Complexity Analysis
-
Time Complexity: O(n + m)
-
O(m) to build the query frequency map, where m is the length of query
-
O(n) to slide the window across lookup_string, where n is the length of lookup_string
-
O(k) per window comparison, where k is the number of unique characters (typically k ≤ 26 for lowercase English letters)
-
Overall: O(n × k + m), but k is constant for fixed alphabets, so effectively O(n + m)
-
Space Complexity: O(k)
-
Two frequency maps, each containing at most k unique characters
-
Typically O(1) for fixed alphabets like lowercase English letters
Approach 2: Optimized with Character Count Matching
Instead of comparing entire dictionaries, we can optimize by tracking the number of matching character counts:
def anagram_index(lookup_string: str, query: str) -> int:
if len(query) == 0:
return 0
if len(query) > len(lookup_string):
return -1
from collections import Counter
query_freq = Counter(query)
window_freq = Counter(lookup_string[:len(query)])
# Count how many characters have matching frequencies
def count_matches(freq1, freq2):
matches = 0
for char in freq1:
if freq1[char] == freq2.get(char, 0):
matches += 1
return matches
# Check if we need all characters in query_freq to match
required_matches = len(query_freq)
if window_freq == query_freq:
return 0
for i in range(len(query), len(lookup_string)):
# Slide window
new_char = lookup_string[i]
old_char = lookup_string[i - len(query)]
window_freq[new_char] += 1
window_freq[old_char] -= 1
if window_freq[old_char] == 0:
del window_freq[old_char]
# Check match
if window_freq == query_freq:
return i - len(query) + 1
return -1Approach 3: Array-Based Frequency Count (For ASCII/Limited Charset)
If the character set is limited (e.g., only lowercase English letters), we can use fixed-size arrays for O(1) comparison:
def anagram_index(lookup_string: str, query: str) -> int:
if len(query) == 0:
return 0
if len(query) > len(lookup_string):
return -1
# Use array for frequency counting (assuming lowercase letters only)
query_freq = [0] * 26
window_freq = [0] * 26
# Build query frequency array
for char in query:
query_freq[ord(char) - ord('a')] += 1
# Build first window frequency array
for i in range(len(query)):
window_freq[ord(lookup_string[i]) - ord('a')] += 1
# Check first window
if query_freq == window_freq:
return 0
# Slide window
for i in range(len(query), len(lookup_string)):
# Add new character
window_freq[ord(lookup_string[i]) - ord('a')] += 1
# Remove old character
window_freq[ord(lookup_string[i - len(query)]) - ord('a')] -= 1
# Compare arrays
if query_freq == window_freq:
return i - len(query) + 1
return -1Optimization: This approach has O(1) comparison time if we track a "difference counter" instead of comparing entire arrays.
Edge Cases to Consider
-
Empty query: Should return 0 (empty string is considered to match at the beginning)
-
Empty lookup_string with non-empty query: Should return -1
-
Both strings empty: Should return 0
-
Query longer than lookup_string: Should return -1
-
No anagram exists: Should return -1
-
Single character strings: Should work correctly
-
Entire lookup_string is an anagram of query: Should return 0
-
Multiple matches: Should return the index of the first match
-
Case sensitivity: Clarify if 'A' and 'a' are considered the same
-
Special characters and spaces: Clarify if they should be included in matching
Test Cases
def test_anagram_index():
# Basic tests
assert anagram_index('databricks', 'tad') == 0
assert anagram_index('databricks', 'atd') == 0
assert anagram_index('databricks', 'dat') == 0
# Anagram found later
assert anagram_index('databricks', 'ribkc') == 4
assert anagram_index('databricks', 'sk') == 8
# No match
assert anagram_index('databricks', 'abc') == -1
assert anagram_index('databricks', 'xyz') == -1
# Query longer than string
assert anagram_index('data', 'databricks') == -1
# Empty strings
assert anagram_index('databricks', '') == 0
assert anagram_index('', 'query') == -1
assert anagram_index('', '') == 0
# Single character
assert anagram_index('a', 'a') == 0
assert anagram_index('a', 'b') == -1
assert anagram_index('abc', 'c') == 2
# Entire string is anagram
assert anagram_index('abc', 'bca') == 0
assert anagram_index('abc', 'cab') == 0
# Repeated characters
assert anagram_index('aabbcc', 'abc') == 0
assert anagram_index('aabbcc', 'bca') == 0
assert anagram_index('aaabbbccc', 'abc') == 0
# Multiple possible matches (return first)
assert anagram_index('abcabc', 'abc') == 0
assert anagram_index('xyzabc', 'bca') == 3
print("All tests passed!")
test_anagram_index()Follow-Up Questions
Note: The following follow-up questions are common variations and extensions of the problem, but were not explicitly mentioned in the original interview notes.
Follow-Up 1: Find All Anagram Indices
Question: Modify the function to return a list of all starting indices where anagrams of the query are found, not just the first one.
def find_all_anagram_indices(lookup_string: str, query: str) -> list[int]:
# Similar sliding window approach, but collect all matching indices
passExpected behavior:
find_all_anagram_indices('cbaebabacd', 'abc') # Returns: [0, 6]
find_all_anagram_indices('abab', 'ab') # Returns: [0, 1, 2]This is equivalent to LeetCode 438: Find All Anagrams in a String.
Follow-Up 2: Case-Insensitive Matching
Question: Modify the function to perform case-insensitive anagram matching.
anagram_index('DataBricks', 'TAD') # Should return: 0Hint: Convert both strings to lowercase before processing.
Follow-Up 3: Ignore Spaces and Special Characters
Question: Modify the function to ignore spaces and non-alphabetic characters when matching anagrams.
anagram_index('data bricks!', 'tad') # Should return: 0 (matching "dat" ignoring space)Follow-Up 4: Optimize for Multiple Queries
Question: You need to find anagrams for multiple queries in the same lookup string. How would you optimize for this use case?
def multi_query_anagram_index(lookup_string: str, queries: list[str]) -> dict[str, int]:
# Returns a dictionary mapping each query to its first anagram index
passConsiderations:
-
Can you preprocess the lookup string?
-
What if queries have the same length?
-
What data structures would help?
Follow-Up 5: Longest Anagram Substring
Question: Instead of finding an anagram of a given query, find the longest substring of the lookup string that is an anagram of any substring of the query.
Example:
longest_anagram_substring('databricks', 'abcdefghijk')
# Should find the longest substring of 'databricks' that can be formed
# using characters from 'abcdefghijk'This is a harder variation involving more complex string analysis.