Extract query time from hiveserver2 Interactive log

Got in a situation where you were asked to extract hive queries and the time they took to execute?

Steps

  1. On log files run below 2 extracts
    awk 'match($0, "^([^ ]+).*Completed executing command\\(queryId=([0-9a-z_-]+)\\); Time taken: (.*)", a) {print "COMPLETE\t" a[1] "\t" a[2] "\t" a[3]}' hiveserver2Interactive.log* > complete.txt
    awk 'match($0, "^([^ ]+).*Executing command\\(queryId=([0-9a-z_-]+)\\): (.*)", a) {print "EXECUTING\t" a[1] "\t" a[2] "\t" a[3]}' hiveserver2Interactive.log* > executing.txt
  2. Import data in ms access dbs
    Prepare 2 tables complete and executing
  3. Run below query to join the tables and prepare report
    SELECT e.`Field2`, c.`Field2`, c.`Field4`, e.`Field4` FROM `Complete` c JOIN `Executing` e ON e.`Field3` = c.`Field3` WHERE e.`Field4` LIKE 'SELECT%';

     

Additional code for analysis

Once the files are generated from step 1 above. We could run the below python code to match and output the executing-complete queryid pairs,

*Bonus* – also filter output on match criteria 😉

import pandas as pd
from openpyxl import load_workbook
from os import path
import os


def write_to_excel(data_frame, excel_filename):
    writer = pd.ExcelWriter(excel_filename)
    start_row = 0
    sheet_name = 'Sheet1'

    if path.exists(excel_filename):
        book = load_workbook(excel_filename)
        writer.book = book
        writer.sheets = {ws.title: ws for ws in book.worksheets}
        start_row = writer.sheets[sheet_name].max_row

    data_frame.to_excel(writer, sheet_name, startrow=start_row, index=False, header=False)
    writer.save()


def process_hive_logs(complete_file, executing_file, query_match_criteria):
    df_e = pd.read_csv(executing_file, header=None, names=['e_state', 'e_start_time', 'hive_id', 'sql_query'], sep='\t')
    df_c = pd.read_csv(complete_file, header=None, names=['c_state', 'c_start_time', 'hive_id', 'time_taken'], sep='\t')

    df_inner = pd.merge(df_e, df_c, on='hive_id', how='inner')

    df_final = df_inner[df_inner['sql_query'].str.match(query_match_criteria, case=False) == True]
    return df_final


def main():
    c1 = r'c:\data-exchange\hive_log_analysis\2019-09-23\old\complete.txt'
    e1 = r'c:\data-exchange\hive_log_analysis\2019-09-23\old\executing.txt'

    df = process_hive_logs(c1, e1, 'select.*my_database*')
    write_to_excel(df, os.path.dirname(c1) + r'\output.xlsx')


if __name__ == "__main__":
    main()

 

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *