Back to Databricks questions
CodingSoftware Engineer

Anagrammed indexOf

Role: Software Engineer


Problem Overview

Implement a function that finds the first occurrence of a substring in a given string where the substring is an anagram (any permutation) of a query string.

Function Signature:

python
def anagram_index(lookup_string: str, query: str) -> int

Parameters:

  • lookup_string: The string to search within

  • query: The pattern string whose anagram we're looking for

Returns:

  • The starting index of the first substring in lookup_string that is an anagram of query

  • -1 if no such substring exists

Key Concept: Two strings are anagrams if they contain the same characters with the same frequencies, regardless of order. For example, "tad", "atd", and "dat" are all anagrams of each other.

Examples

Example 1: Basic anagram match

python
anagram_index('databricks', 'tad')  # Returns: 0
anagram_index('databricks', 'atd')  # Returns: 0
anagram_index('databricks', 'dat')  # Returns: 0

Explanation: The substring "dat" at index 0-2 contains the same characters as the queries "tad", "atd", and "dat" (all have 1 'd', 1 'a', 1 't').

Example 2: Anagram found later in string

python
anagram_index('databricks', 'ribkc')  # Returns: 4

Explanation: The substring "brick" at index 4-8 is an anagram of "ribkc" (both have 1 'r', 1 'i', 1 'b', 1 'k', 1 'c').

Example 3: Short query

python
anagram_index('databricks', 'sk')  # Returns: 8

Explanation: The substring "ks" at index 8-9 is an anagram of "sk".

Example 4: No match found

python
anagram_index('databricks', 'abc')  # Returns: -1

Explanation: There is no substring of length 3 in "databricks" that contains all of 'a', 'b', and 'c'.

Example 5: Query longer than lookup string

python
anagram_index('data', 'databricks')  # Returns: -1

Explanation: The query is longer than the lookup string, so no substring can possibly match.

Example 6: Empty strings

python
anagram_index('databricks', '')  # Returns: 0 (empty string matches at start)
anagram_index('', 'query')       # Returns: -1 (can't find non-empty query in empty string)
anagram_index('', '')            # Returns: 0 (empty matches empty at index 0)

Approach 1: Sliding Window with Character Frequency Map

Algorithm

The most efficient approach uses a sliding window technique with character frequency counting:

  • Edge Cases: Handle empty strings and length mismatches

  • Initialize Frequency Maps:

  • Count character frequencies in the query string

  • Count character frequencies in the first window of lookup_string

  • Sliding Window:

  • Compare the two frequency maps

  • If they match, return the current index

  • Slide the window: remove leftmost character, add next character

  • Update frequency map incrementally

  • Return -1 if no match found

Implementation

python
def anagram_index(lookup_string: str, query: str) -> int:
    # Edge case: empty query matches at index 0
    if len(query) == 0:
        return 0

    # Edge case: query longer than lookup string
    if len(query) > len(lookup_string):
        return -1

    # Build frequency map for query
    query_freq = {}
    for char in query:
        query_freq[char] = query_freq.get(char, 0) + 1

    # Build frequency map for first window
    window_freq = {}
    for i in range(len(query)):
        char = lookup_string[i]
        window_freq[char] = window_freq.get(char, 0) + 1

    # Check if first window matches
    if window_freq == query_freq:
        return 0

    # Slide the window
    for i in range(len(query), len(lookup_string)):
        # Add new character (right side of window)
        new_char = lookup_string[i]
        window_freq[new_char] = window_freq.get(new_char, 0) + 1

        # Remove old character (left side of window)
        old_char = lookup_string[i - len(query)]
        window_freq[old_char] -= 1
        if window_freq[old_char] == 0:
            del window_freq[old_char]

        # Check if current window matches
        if window_freq == query_freq:
            return i - len(query) + 1

    return -1

Complexity Analysis

  • Time Complexity: O(n + m)

  • O(m) to build the query frequency map, where m is the length of query

  • O(n) to slide the window across lookup_string, where n is the length of lookup_string

  • O(k) per window comparison, where k is the number of unique characters (typically k ≤ 26 for lowercase English letters)

  • Overall: O(n × k + m), but k is constant for fixed alphabets, so effectively O(n + m)

  • Space Complexity: O(k)

  • Two frequency maps, each containing at most k unique characters

  • Typically O(1) for fixed alphabets like lowercase English letters

Approach 2: Optimized with Character Count Matching

Instead of comparing entire dictionaries, we can optimize by tracking the number of matching character counts:

python
def anagram_index(lookup_string: str, query: str) -> int:
    if len(query) == 0:
        return 0

    if len(query) > len(lookup_string):
        return -1

    from collections import Counter

    query_freq = Counter(query)
    window_freq = Counter(lookup_string[:len(query)])

    # Count how many characters have matching frequencies
    def count_matches(freq1, freq2):
        matches = 0
        for char in freq1:
            if freq1[char] == freq2.get(char, 0):
                matches += 1
        return matches

    # Check if we need all characters in query_freq to match
    required_matches = len(query_freq)

    if window_freq == query_freq:
        return 0

    for i in range(len(query), len(lookup_string)):
        # Slide window
        new_char = lookup_string[i]
        old_char = lookup_string[i - len(query)]

        window_freq[new_char] += 1
        window_freq[old_char] -= 1

        if window_freq[old_char] == 0:
            del window_freq[old_char]

        # Check match
        if window_freq == query_freq:
            return i - len(query) + 1

    return -1

Approach 3: Array-Based Frequency Count (For ASCII/Limited Charset)

If the character set is limited (e.g., only lowercase English letters), we can use fixed-size arrays for O(1) comparison:

python
def anagram_index(lookup_string: str, query: str) -> int:
    if len(query) == 0:
        return 0

    if len(query) > len(lookup_string):
        return -1

    # Use array for frequency counting (assuming lowercase letters only)
    query_freq = [0] * 26
    window_freq = [0] * 26

    # Build query frequency array
    for char in query:
        query_freq[ord(char) - ord('a')] += 1

    # Build first window frequency array
    for i in range(len(query)):
        window_freq[ord(lookup_string[i]) - ord('a')] += 1

    # Check first window
    if query_freq == window_freq:
        return 0

    # Slide window
    for i in range(len(query), len(lookup_string)):
        # Add new character
        window_freq[ord(lookup_string[i]) - ord('a')] += 1
        # Remove old character
        window_freq[ord(lookup_string[i - len(query)]) - ord('a')] -= 1

        # Compare arrays
        if query_freq == window_freq:
            return i - len(query) + 1

    return -1

Optimization: This approach has O(1) comparison time if we track a "difference counter" instead of comparing entire arrays.

Edge Cases to Consider

  • Empty query: Should return 0 (empty string is considered to match at the beginning)

  • Empty lookup_string with non-empty query: Should return -1

  • Both strings empty: Should return 0

  • Query longer than lookup_string: Should return -1

  • No anagram exists: Should return -1

  • Single character strings: Should work correctly

  • Entire lookup_string is an anagram of query: Should return 0

  • Multiple matches: Should return the index of the first match

  • Case sensitivity: Clarify if 'A' and 'a' are considered the same

  • Special characters and spaces: Clarify if they should be included in matching

Test Cases

python
def test_anagram_index():
    # Basic tests
    assert anagram_index('databricks', 'tad') == 0
    assert anagram_index('databricks', 'atd') == 0
    assert anagram_index('databricks', 'dat') == 0

    # Anagram found later
    assert anagram_index('databricks', 'ribkc') == 4
    assert anagram_index('databricks', 'sk') == 8

    # No match
    assert anagram_index('databricks', 'abc') == -1
    assert anagram_index('databricks', 'xyz') == -1

    # Query longer than string
    assert anagram_index('data', 'databricks') == -1

    # Empty strings
    assert anagram_index('databricks', '') == 0
    assert anagram_index('', 'query') == -1
    assert anagram_index('', '') == 0

    # Single character
    assert anagram_index('a', 'a') == 0
    assert anagram_index('a', 'b') == -1
    assert anagram_index('abc', 'c') == 2

    # Entire string is anagram
    assert anagram_index('abc', 'bca') == 0
    assert anagram_index('abc', 'cab') == 0

    # Repeated characters
    assert anagram_index('aabbcc', 'abc') == 0
    assert anagram_index('aabbcc', 'bca') == 0
    assert anagram_index('aaabbbccc', 'abc') == 0

    # Multiple possible matches (return first)
    assert anagram_index('abcabc', 'abc') == 0
    assert anagram_index('xyzabc', 'bca') == 3

    print("All tests passed!")

test_anagram_index()

Follow-Up Questions

Note: The following follow-up questions are common variations and extensions of the problem, but were not explicitly mentioned in the original interview notes.

Follow-Up 1: Find All Anagram Indices

Question: Modify the function to return a list of all starting indices where anagrams of the query are found, not just the first one.

python
def find_all_anagram_indices(lookup_string: str, query: str) -> list[int]:
    # Similar sliding window approach, but collect all matching indices
    pass

Expected behavior:

python
find_all_anagram_indices('cbaebabacd', 'abc')  # Returns: [0, 6]
find_all_anagram_indices('abab', 'ab')         # Returns: [0, 1, 2]

This is equivalent to LeetCode 438: Find All Anagrams in a String.

Follow-Up 2: Case-Insensitive Matching

Question: Modify the function to perform case-insensitive anagram matching.

python
anagram_index('DataBricks', 'TAD')  # Should return: 0

Hint: Convert both strings to lowercase before processing.

Follow-Up 3: Ignore Spaces and Special Characters

Question: Modify the function to ignore spaces and non-alphabetic characters when matching anagrams.

python
anagram_index('data bricks!', 'tad')  # Should return: 0 (matching "dat" ignoring space)

Follow-Up 4: Optimize for Multiple Queries

Question: You need to find anagrams for multiple queries in the same lookup string. How would you optimize for this use case?

python
def multi_query_anagram_index(lookup_string: str, queries: list[str]) -> dict[str, int]:
    # Returns a dictionary mapping each query to its first anagram index
    pass

Considerations:

  • Can you preprocess the lookup string?

  • What if queries have the same length?

  • What data structures would help?

Follow-Up 5: Longest Anagram Substring

Question: Instead of finding an anagram of a given query, find the longest substring of the lookup string that is an anagram of any substring of the query.

Example:

python
longest_anagram_substring('databricks', 'abcdefghijk')
# Should find the longest substring of 'databricks' that can be formed
# using characters from 'abcdefghijk'

This is a harder variation involving more complex string analysis.