Skip to content

Latest commit

 

History

History
731 lines (593 loc) · 21.6 KB

17 Python Interview Questions for Data Science.md

File metadata and controls

731 lines (593 loc) · 21.6 KB

17 Python Interview Questions for Data Science

Slide 1: Python Dictionary Deep Dive

A dictionary is a mutable, unordered collection of key-value pairs in Python. It provides constant-time complexity for basic operations and serves as the foundation for many data structures. Dictionaries are hash tables under the hood, enabling efficient data retrieval and modification.

# Creating and manipulating dictionaries
employee = {
    'name': 'John Smith',
    'age': 35,
    'department': 'Data Science',
    'skills': ['Python', 'SQL', 'Machine Learning']
}

# Dictionary operations
print(f"Employee name: {employee['name']}")
print(f"Skills: {', '.join(employee['skills'])}")

# Adding new key-value pair
employee['years_experience'] = 8

# Dictionary comprehension example
squared_nums = {x: x**2 for x in range(5)}
print(f"Squared numbers: {squared_nums}")

# Output:
# Employee name: John Smith
# Skills: Python, SQL, Machine Learning
# Squared numbers: {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

Slide 2: Essential Python Libraries for Data Science

The Python ecosystem offers powerful libraries that form the backbone of data science workflows. NumPy provides advanced array operations, Pandas handles data manipulation, Scikit-learn offers machine learning tools, and Matplotlib/Seaborn enable data visualization.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# NumPy array operations
array = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Array shape: {array.shape}")

# Pandas DataFrame creation
df = pd.DataFrame({
    'A': np.random.randn(5),
    'B': np.random.randint(0, 100, 5)
})
print("\nDataFrame head:\n", df.head())

# Matplotlib visualization
plt.figure(figsize=(8, 4))
plt.plot(df['A'], df['B'], 'o-')
plt.title('Sample Plot')
plt.close()  # Closing to prevent display

Slide 3: Advanced Function Arguments

Python functions support various argument types including positional, keyword, variable-length args (*args), and keyword arguments (**kwargs). This flexibility enables creating highly adaptable and reusable code components for data processing and analysis.

def process_data(data, 
                threshold=0.5, 
                *additional_params,
                **config):
    """
    Example function demonstrating different argument types
    """
    print(f"Main data: {data}")
    print(f"Threshold: {threshold}")
    print(f"Additional parameters: {additional_params}")
    print(f"Configuration: {config}")
    
    return data * threshold

# Function usage examples
result = process_data(
    100,
    0.75,
    'extra1', 'extra2',
    normalize=True,
    verbose=False
)

# Output:
# Main data: 100
# Threshold: 0.75
# Additional parameters: ('extra1', 'extra2')
# Configuration: {'normalize': True, 'verbose': False}

Slide 4: Conditional Logic Implementation

Python's if statement provides elegant control flow with multiple conditions and compound statements. Understanding complex conditional logic is crucial for implementing business rules and data filtering in data science applications.

def classify_data_point(value, threshold_low=10, threshold_high=50):
    """
    Classifies data points based on multiple thresholds
    """
    if not isinstance(value, (int, float)):
        raise TypeError("Value must be numeric")
    
    if value < threshold_low:
        category = 'low'
        risk_score = 0.2
    elif threshold_low <= value < threshold_high:
        category = 'medium'
        risk_score = 0.5
    else:
        category = 'high'
        risk_score = 0.8
        
    return {
        'value': value,
        'category': category,
        'risk_score': risk_score
    }

# Example usage
samples = [5, 25, 75]
results = [classify_data_point(x) for x in samples]
print("Classification results:", results)

Slide 5: Capital Letter Counter Implementation

This implementation demonstrates file handling, string manipulation, and character analysis in Python. The solution uses context managers for proper resource handling and provides detailed statistics about capital letters in text files.

def analyze_capital_letters(filename):
    """
    Analyzes capital letters in a text file
    Returns dictionary with statistics
    """
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            text = file.read()
            
        capital_counts = {}
        total_capitals = 0
        
        for char in text:
            if char.isupper():
                capital_counts[char] = capital_counts.get(char, 0) + 1
                total_capitals += 1
                
        return {
            'total_capitals': total_capitals,
            'unique_capitals': len(capital_counts),
            'distribution': capital_counts
        }
                
    except FileNotFoundError:
        return {"error": "File not found"}
    except Exception as e:
        return {"error": str(e)}

# Example usage with sample file
# Assuming 'sample.txt' contains: "Hello World! Python Programming"
result = analyze_capital_letters('sample.txt')
print(f"Analysis results: {result}")

Slide 6: Python Data Types Deep Dive

Understanding Python's data types is crucial for efficient memory usage and performance optimization in data science applications. Built-in types include numeric (int, float, complex), sequences (list, tuple, range), text sequence (str), and more specialized types.

def analyze_data_types():
    # Numeric types
    integer_val = 42
    float_val = 3.14159
    complex_val = 3 + 4j
    
    # Sequence types
    list_val = [1, 'text', 3.14]
    tuple_val = (1, 2, 3)
    range_val = range(5)
    
    # Text and binary types
    str_val = "Python"
    bytes_val = b"Python"
    
    # Set and mapping types
    set_val = {1, 2, 3}
    dict_val = {'key': 'value'}
    
    # Memory analysis
    type_sizes = {
        'integer': integer_val.__sizeof__(),
        'float': float_val.__sizeof__(),
        'complex': complex_val.__sizeof__(),
        'list': list_val.__sizeof__(),
        'tuple': tuple_val.__sizeof__(),
        'string': str_val.__sizeof__()
    }
    
    return type_sizes

# Example output
sizes = analyze_data_types()
for type_name, size in sizes.items():
    print(f"{type_name}: {size} bytes")

Slide 7: Lists vs Tuples Performance Analysis

Lists and tuples have distinct characteristics affecting performance and memory usage. Tuples are immutable and generally more memory-efficient, while lists offer flexibility for data modification but with additional memory overhead.

import sys
import timeit
import numpy as np

def compare_sequences():
    # Create test data
    data = list(range(1000))
    
    # Memory comparison
    list_mem = sys.getsizeof(data)
    tuple_mem = sys.getsizeof(tuple(data))
    
    # Performance comparison
    list_time = timeit.timeit(
        lambda: [x * 2 for x in data],
        number=10000
    )
    
    tuple_time = timeit.timeit(
        lambda: tuple(x * 2 for x in data),
        number=10000
    )
    
    return {
        'memory': {
            'list': list_mem,
            'tuple': tuple_mem,
            'difference': list_mem - tuple_mem
        },
        'performance': {
            'list_operation': list_time,
            'tuple_operation': tuple_time,
            'difference': list_time - tuple_time
        }
    }

results = compare_sequences()
print(f"Memory and Performance Analysis:\n{results}")

Slide 8: Lambda Functions and Functional Programming

Lambda functions provide concise, anonymous function definitions crucial for data transformations and functional programming paradigms. They excel in data processing pipelines and when used with higher-order functions like map, filter, and reduce.

from functools import reduce
import pandas as pd

# Data processing pipeline using lambda functions
data = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Complex data transformation pipeline
result = (data
    .pipe(lambda x: x * 2)  # Double values
    .apply(lambda x: x ** 2)  # Square values
    .filter(lambda x: x > 50)  # Filter large values
    .agg([
        ('sum', lambda x: x.sum()),
        ('mean', lambda x: x.mean()),
        ('std', lambda x: x.std())
    ]))

# Functional programming example
numbers = range(1, 11)
pipeline = reduce(
    lambda x, func: func(x),
    [
        lambda x: filter(lambda n: n % 2 == 0, x),
        lambda x: map(lambda n: n ** 2, x),
        lambda x: list(x)
    ],
    numbers
)

print(f"Pipeline result:\n{result}")
print(f"Functional result: {pipeline}")

Slide 9: List Comprehensions and Generator Expressions

List comprehensions and generator expressions provide elegant and efficient ways to process sequences. While list comprehensions create new lists in memory, generator expressions offer memory-efficient iteration for large datasets.

import memory_profiler
import sys

def compare_list_processing():
    # Data preparation
    numbers = range(1000000)
    
    # Memory usage with list comprehension
    def using_list_comp():
        return sys.getsizeof(
            [x ** 2 for x in numbers if x % 2 == 0]
        )
    
    # Memory usage with generator expression
    def using_generator():
        return sys.getsizeof(
            (x ** 2 for x in numbers if x % 2 == 0)
        )
    
    # Performance comparison
    list_comp_time = timeit.timeit(
        lambda: [x ** 2 for x in range(1000) if x % 2 == 0],
        number=1000
    )
    
    gen_exp_time = timeit.timeit(
        lambda: list(x ** 2 for x in range(1000) if x % 2 == 0),
        number=1000
    )
    
    return {
        'memory': {
            'list_comprehension': using_list_comp(),
            'generator_expression': using_generator()
        },
        'performance': {
            'list_comprehension': list_comp_time,
            'generator_expression': gen_exp_time
        }
    }

results = compare_list_processing()
print(f"Comparison Results:\n{results}")

Slide 10: Understanding Negative Indexing

Negative indexing provides intuitive access to sequence elements from the end, enhancing code readability and reducing the need for length-based calculations. This feature is particularly useful in data preprocessing and analysis tasks.

def demonstrate_negative_indexing():
    # Sample sequence data
    sequence = list(range(10))
    
    # Dictionary to store different indexing examples
    indexing_examples = {
        'last_element': sequence[-1],
        'last_three': sequence[-3:],
        'reverse_slice': sequence[::-1],
        'skip_backwards': sequence[::-2],
        'complex_slice': sequence[-5:-2],
        'wrap_around': sequence[-len(sequence):] + sequence[:-len(sequence)]
    }
    
    # Practical application: Rolling window calculation
    def rolling_window(data, window_size):
        return [
            data[max(i-window_size+1, 0):i+1] 
            for i in range(len(data))
        ]
    
    window_example = rolling_window(sequence, 3)
    
    return {
        'basic_examples': indexing_examples,
        'rolling_window': window_example
    }

results = demonstrate_negative_indexing()
print(f"Negative Indexing Examples:\n{results}")

Slide 11: Advanced Pandas Operations

Pandas provides sophisticated data manipulation capabilities essential for data science. Understanding DataFrame operations, including handling missing values, merging datasets, and performing complex transformations, is crucial for effective data analysis.

import pandas as pd
import numpy as np

def advanced_pandas_demo():
    # Create sample datasets
    df1 = pd.DataFrame({
        'ID': range(1, 6),
        'Value': np.random.randn(5),
        'Category': ['A', 'B', 'A', 'C', 'B']
    })
    
    df2 = pd.DataFrame({
        'ID': range(3, 8),
        'Score': np.random.randint(60, 100, 5)
    })
    
    # Advanced operations
    results = {
        # Group by operations with multiple aggregations
        'group_stats': df1.groupby('Category').agg({
            'Value': ['mean', 'std', 'count']
        }),
        
        # Complex merge operation
        'merged_data': pd.merge(
            df1, df2, 
            on='ID', 
            how='outer'
        ).fillna({'Score': df2['Score'].mean()}),
        
        # Window functions
        'rolling_stats': df1.assign(
            rolling_mean=df1['Value'].rolling(
                window=2, 
                min_periods=1
            ).mean()
        )
    }
    
    return results

demo_results = advanced_pandas_demo()
for key, df in demo_results.items():
    print(f"\n{key}:\n", df)

Slide 12: Missing Value Analysis in Pandas

Missing value handling is a critical aspect of data preprocessing. Pandas offers multiple strategies for detecting, analyzing, and handling missing values through various imputation techniques and filtering methods.

def missing_value_analysis(df):
    """
    Comprehensive missing value analysis and handling
    """
    # Create sample dataset with missing values
    df = pd.DataFrame({
        'A': [1, np.nan, 3, np.nan, 5],
        'B': [np.nan, 2, 3, 4, 5],
        'C': [1, 2, np.nan, 4, 5],
        'D': [1, 2, 3, 4, np.nan]
    })
    
    analysis = {
        # Missing value count per column
        'missing_count': df.isnull().sum(),
        
        # Missing value percentage
        'missing_percentage': (df.isnull().sum() / len(df)) * 100,
        
        # Pattern analysis
        'missing_patterns': df.isnull().value_counts(),
        
        # Correlation of missingness
        'missing_correlation': df.isnull().corr(),
        
        # Various imputation methods
        'mean_imputed': df.fillna(df.mean()),
        'forward_filled': df.fillna(method='ffill'),
        'backward_filled': df.fillna(method='bfill'),
        
        # Interpolation
        'interpolated': df.interpolate(method='linear')
    }
    
    return analysis

results = missing_value_analysis(pd.DataFrame())
for key, value in results.items():
    print(f"\n{key}:\n", value)

Slide 13: DataFrame Column Selection and Manipulation

Efficient column selection and manipulation are fundamental skills in data analysis. This implementation demonstrates various methods for selecting, filtering, and transforming DataFrame columns using Pandas.

import pandas as pd
import numpy as np

def demonstrate_column_operations():
    # Create sample employees DataFrame
    employees = pd.DataFrame({
        'Department': ['IT', 'HR', 'Finance', 'IT', 'Marketing'],
        'Age': [28, 35, 42, 30, 45],
        'Salary': [75000, 65000, 85000, 78000, 72000],
        'Experience': [3, 8, 12, 5, 15]
    })
    
    operations = {
        # Basic column selection
        'basic_selection': employees[['Department', 'Age']],
        
        # Conditional selection
        'filtered_selection': employees.loc[
            employees['Age'] > 35,
            ['Department', 'Salary']
        ],
        
        # Column creation with transformation
        'derived_columns': employees.assign(
            Salary_Category=lambda x: pd.qcut(
                x['Salary'],
                q=3,
                labels=['Low', 'Medium', 'High']
            ),
            Experience_Years=lambda x: x['Experience'].astype(str) + ' years'
        ),
        
        # Complex transformation
        'calculated_metrics': employees.assign(
            Salary_per_Year_Experience=lambda x: x['Salary'] / x['Experience'],
            Above_Average_Age=lambda x: x['Age'] > x['Age'].mean()
        )
    }
    
    return operations

results = demonstrate_column_operations()
for operation, df in results.items():
    print(f"\n{operation}:\n", df)

Slide 14: Adding Columns with Complex Logic

This implementation showcases advanced techniques for adding columns to DataFrames using complex business logic, conditional statements, and vectorized operations while maintaining optimal performance.

import pandas as pd
import numpy as np
from datetime import datetime

def enhance_employee_data():
    # Create sample DataFrame
    df = pd.DataFrame({
        'employee_id': range(1001, 1006),
        'base_salary': [60000, 75000, 65000, 80000, 70000],
        'years_experience': [2, 5, 3, 7, 4],
        'department': ['IT', 'Sales', 'IT', 'Marketing', 'Sales'],
        'performance_score': [85, 92, 78, 95, 88]
    })
    
    # Add multiple columns with complex logic
    enhanced_df = df.assign(
        # Salary adjustment based on experience
        experience_multiplier=lambda x: np.where(
            x['years_experience'] > 5,
            1.5,
            1.2
        ),
        
        # Complex bonus calculation
        bonus=lambda x: (
            x['base_salary'] * 
            (x['performance_score'] / 100) * 
            (x['years_experience'] / 10)
        ),
        
        # Department-specific allowance
        dept_allowance=lambda x: np.select(
            [
                x['department'] == 'IT',
                x['department'] == 'Sales',
                x['department'] == 'Marketing'
            ],
            [5000, 4000, 3000],
            default=2000
        ),
        
        # Performance category
        performance_category=lambda x: pd.qcut(
            x['performance_score'],
            q=3,
            labels=['Improving', 'Meeting', 'Exceeding']
        )
    )
    
    # Calculate total compensation
    enhanced_df['total_compensation'] = (
        enhanced_df['base_salary'] * 
        enhanced_df['experience_multiplier'] +
        enhanced_df['bonus'] +
        enhanced_df['dept_allowance']
    )
    
    return enhanced_df

result = enhance_employee_data()
print("Enhanced Employee Data:\n", result)

Slide 15: Data Visualization with Python

Advanced data visualization techniques using matplotlib and seaborn for creating insightful visualizations of employee data distributions and relationships between variables.

import matplotlib.pyplot as plt
import seaborn as sns

def create_employee_visualizations(df):
    # Set style for better visualizations
    plt.style.use('seaborn')
    
    # Create figure with subplots
    fig = plt.figure(figsize=(15, 10))
    
    # Age distribution
    plt.subplot(2, 2, 1)
    sns.histplot(
        data=df,
        x='Age',
        bins=20,
        kde=True
    )
    plt.title('Age Distribution')
    
    # Salary by Department
    plt.subplot(2, 2, 2)
    sns.boxplot(
        data=df,
        x='Department',
        y='Salary',
        palette='viridis'
    )
    plt.title('Salary Distribution by Department')
    
    # Experience vs Salary
    plt.subplot(2, 2, 3)
    sns.scatterplot(
        data=df,
        x='Experience',
        y='Salary',
        hue='Department',
        size='Age',
        sizes=(50, 200)
    )
    plt.title('Experience vs Salary')
    
    # Performance Score Distribution
    plt.subplot(2, 2, 4)
    sns.violinplot(
        data=df,
        x='Department',
        y='performance_score',
        palette='magma'
    )
    plt.title('Performance Score Distribution')
    
    plt.tight_layout()
    return fig

# Example usage with sample data
sample_df = pd.DataFrame({
    'Age': np.random.normal(35, 8, 100),
    'Salary': np.random.normal(75000, 15000, 100),
    'Experience': np.random.randint(1, 20, 100),
    'Department': np.random.choice(['IT', 'Sales', 'HR'], 100),
    'performance_score': np.random.normal(85, 10, 100)
})

visualization = create_employee_visualizations(sample_df)
plt.close()  # Close to prevent display

Slide 16: Popular Python IDEs for Data Science

A comprehensive analysis of leading Python IDEs specialized for data science work, focusing on features that enhance productivity in data analysis and machine learning tasks.

def analyze_ide_features():
    ide_comparison = {
        'jupyter_lab': {
            'features': [
                'Interactive notebooks',
                'Integrated plots',
                'Cell-based execution',
                'Rich media output'
            ],
            'best_for': 'Data exploration and visualization',
            'performance_score': 9.0,
            'memory_usage': 'Medium'
        },
        'pycharm': {
            'features': [
                'Advanced debugging',
                'Git integration',
                'Database tools',
                'Scientific mode'
            ],
            'best_for': 'Large scale projects',
            'performance_score': 8.5,
            'memory_usage': 'High'
        },
        'vscode': {
            'features': [
                'Jupyter integration',
                'Extensions ecosystem',
                'Remote development',
                'Integrated terminal'
            ],
            'best_for': 'All-purpose development',
            'performance_score': 9.5,
            'memory_usage': 'Low'
        }
    }
    
    # Convert to DataFrame for better visualization
    ide_df = pd.DataFrame.from_dict(
        ide_comparison,
        orient='index'
    )
    
    return ide_df

ide_analysis = analyze_ide_features()
print("IDE Comparison:\n", ide_analysis)

Slide 17: Additional Resources

  1. arXiv:2207.04836 - "Modern Deep Learning Techniques Applied to Data Science" https://arxiv.org/abs/2207.04836
  2. arXiv:2103.13717 - "Python for Scientific Computing: Current State and Future Directions" https://arxiv.org/abs/2103.13717
  3. arXiv:1907.10121 - "Best Practices for Scientific Computing in Python" https://arxiv.org/abs/1907.10121
  4. arXiv:2202.02941 - "Advanced Data Manipulation Techniques in Python" https://arxiv.org/abs/2202.02941
  5. arXiv:2109.14593 - "Modern Python Development for Data Scientists" https://arxiv.org/abs/2109.14593