Extract query time from hiveserver2 Interactive log
Got in a situation where you were asked to extract hive queries and the time they took to execute?
Steps
- On log files run below 2 extracts
awk 'match($0, "^([^ ]+).*Completed executing command\\(queryId=([0-9a-z_-]+)\\); Time taken: (.*)", a) {print "COMPLETE\t" a[1] "\t" a[2] "\t" a[3]}' hiveserver2Interactive.log* > complete.txt awk 'match($0, "^([^ ]+).*Executing command\\(queryId=([0-9a-z_-]+)\\): (.*)", a) {print "EXECUTING\t" a[1] "\t" a[2] "\t" a[3]}' hiveserver2Interactive.log* > executing.txt
- Import data in ms access dbs
Prepare 2 tables complete and executing - Run below query to join the tables and prepare report
SELECT e.`Field2`, c.`Field2`, c.`Field4`, e.`Field4` FROM `Complete` c JOIN `Executing` e ON e.`Field3` = c.`Field3` WHERE e.`Field4` LIKE 'SELECT%';
Additional code for analysis
Once the files are generated from step 1 above. We could run the below python code to match and output the executing-complete queryid pairs,
*Bonus* – also filter output on match criteria 😉
import pandas as pd from openpyxl import load_workbook from os import path import os def write_to_excel(data_frame, excel_filename): writer = pd.ExcelWriter(excel_filename) start_row = 0 sheet_name = 'Sheet1' if path.exists(excel_filename): book = load_workbook(excel_filename) writer.book = book writer.sheets = {ws.title: ws for ws in book.worksheets} start_row = writer.sheets[sheet_name].max_row data_frame.to_excel(writer, sheet_name, startrow=start_row, index=False, header=False) writer.save() def process_hive_logs(complete_file, executing_file, query_match_criteria): df_e = pd.read_csv(executing_file, header=None, names=['e_state', 'e_start_time', 'hive_id', 'sql_query'], sep='\t') df_c = pd.read_csv(complete_file, header=None, names=['c_state', 'c_start_time', 'hive_id', 'time_taken'], sep='\t') df_inner = pd.merge(df_e, df_c, on='hive_id', how='inner') df_final = df_inner[df_inner['sql_query'].str.match(query_match_criteria, case=False) == True] return df_final def main(): c1 = r'c:\data-exchange\hive_log_analysis\2019-09-23\old\complete.txt' e1 = r'c:\data-exchange\hive_log_analysis\2019-09-23\old\executing.txt' df = process_hive_logs(c1, e1, 'select.*my_database*') write_to_excel(df, os.path.dirname(c1) + r'\output.xlsx') if __name__ == "__main__": main()