snakebite – Python HDFS client

by robin · Published August 29, 2019 · Updated August 29, 2019

Source: https://snakebite.readthedocs.io/en/latest/client.html

Example:

>>> from snakebite.client import Client
>>> client = Client("localhost", 8020, use_trash=False)
>>> for x in client.ls(['/']):
...     print x

Warning

Many methods return generators, which mean they need to be consumed to execute! Documentation will explicitly specify which methods return generators.

Note

paths parameters in methods are often passed as lists, since operations can work on multiple paths.

Note

Parameters like include_children and recurse are not used when paths contain globs.

Note

Different Hadoop distributions use different protocol versions. Snakebite defaults to 9, but this can be set by passing in the hadoop_version parameter to the constructor.

Parameters:

Parameters:	host (string) – Hostname or IP address of the NameNode port (int) – RPC Port of the NameNode hadoop_version (int) – What hadoop protocol version should be used (default: 9) use_trash (boolean) – Use a trash when removing files. effective_user (string) – Effective user for the HDFS operations (default: None – current user) use_sasl (boolean) – Use SASL authentication or not hdfs_namenode_principal (string) – Kerberos principal to use for HDFS sock_connect_timeout (int) – Socket connection timeout in seconds sock_request_timeout (int) – Request timeout in seconds use_datanode_hostname (boolean) – Use hostname instead of IP address to commuicate with datanodes

host (string) – Hostname or IP address of the NameNode
port (int) – RPC Port of the NameNode
hadoop_version (int) – What hadoop protocol version should be used (default: 9)
use_trash (boolean) – Use a trash when removing files.
effective_user (string) – Effective user for the HDFS operations (default: None – current user)
use_sasl (boolean) – Use SASL authentication or not
hdfs_namenode_principal (string) – Kerberos principal to use for HDFS
sock_connect_timeout (int) – Socket connection timeout in seconds
sock_request_timeout (int) – Request timeout in seconds
use_datanode_hostname (boolean) – Use hostname instead of IP address to commuicate with datanodes

cat(paths, check_crc=False)

Fetch all files that match the source file pattern and display their content on stdout.

Parameters:	paths (list of strings) – Paths to display check_crc (boolean) – Check for checksum errors
Returns:	a generator that yields strings

chgrp(paths, group, recurse=False)

Change the group of paths.

Parameters:	paths (list) – List of paths to chgrp group – New group recurse (boolean) – Recursive chgrp
Returns:	a generator that yields dictionaries

chmod(paths, mode, recurse=False)

Change the mode for paths. This returns a list of maps containing the resut of the operation.

Parameters:	paths (list) – List of paths to chmod mode (int) – Octal mode (e.g. 0o755) recurse (boolean) – Recursive chmod
Returns:	a generator that yields dictionaries

Note

The top level directory is always included when recurse=True

chown(paths, owner, recurse=False)

Change the owner for paths. The owner can be specified as user or user:group

Parameters:	paths (list) – List of paths to chmod owner (string) – New owner recurse (boolean) – Recursive chown
Returns:	a generator that yields dictionaries

This always include the toplevel when recursing.

copyToLocal(paths, dst, check_crc=False)

Copy files that match the file source pattern to the local name. Source is kept. When copying multiple, files, the destination must be a directory.

Parameters:	paths (list of strings) – Paths to copy dst (string) – Destination path check_crc (boolean) – Check for checksum errors
Returns:	a generator that yields strings

count(paths)

Count files in a path

Parameters:	paths (list) – List of paths to count
Returns:	a generator that yields dictionaries

Examples:

>>> list(client.count(['/'])) [{'spaceConsumed': 260185L, 'quota': 2147483647L, 'spaceQuota': 18446744073709551615L, 'length': 260185L, 'directoryCount': 9L, 'path': '/', 'fileCount': 34L}]

delete(paths, recurse=False)

Delete paths

Parameters:	paths (list) – Paths to delete recurse (boolean) – Recursive delete (use with care!)
Returns:	a generator that yields dictionaries

Note

Recursive deletion uses the NameNode recursive deletion functionality instead of letting the client recurse. Hadoops client recurses by itself and thus showing all files and directories that are deleted. Snakebite doesn’t.

df()

Get FS information

Returns:	a dictionary

Examples:

>>> client.df() {'used': 491520L, 'capacity': 120137519104L, 'under_replicated': 0L, 'missing_blocks': 0L, 'filesystem': 'hdfs://www.robin.eu.org:8020', 'remaining': 19669295104L, 'corrupt_blocks': 0L}

du(paths, include_toplevel=False, include_children=True)

Returns size information for paths

Parameters:	paths (list) – Paths to du include_toplevel (boolean) – Include the given path in the result. If the path is a file, include_toplevel is always True. include_children (boolean) – Include child nodes in the result.
Returns:	a generator that yields dictionaries

Examples:

Children:

>>> list(client.du(['/'])) [{'path': '/Makefile', 'length': 6783L}, {'path': '/build', 'length': 244778L}, {'path': '/index.asciidoc', 'length': 100L}, {'path': '/source', 'length': 8524L}]

Directory only:

>>> list(client.du(['/'], include_toplevel=True, include_children=False)) [{'path': '/', 'length': 260185L}]

getmerge(path, dst, newline=False, check_crc=False)

Get all the files in the directories that match the source file pattern and merge and sort them to only one file on local fs.

Parameters:	paths (string) – Directory containing files that will be merged dst (string) – Path of file that will be written nl (boolean) – Add a newline character at the end of each file.
Returns:	string content of the merged file at dst

ls(paths, recurse=False, include_toplevel=False, include_children=True)

Issues ‘ls’ command and returns a list of maps that contain fileinfo

Parameters:	paths (list) – Paths to list recurse (boolean) – Recursive listing include_toplevel (boolean) – Include the given path in the listing. If the path is a file, include_toplevel is always True. include_children (boolean) – Include child nodes in the listing.
Returns:	a generator that yields dictionaries

Examples:

Directory listing

>>> list(client.ls(["/"])) [{'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'access_time': 1367317324982L, 'block_replication': 1, 'modification_time': 1367317325346L, 'length': 6783L, 'blocksize': 134217728L, 'owner': u'wouter', 'path': '/Makefile'}, {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1367317325431L, 'length': 0L, 'blocksize': 0L, 'owner': u'wouter', 'path': '/build'}, {'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'access_time': 1367317326510L, 'block_replication': 1, 'modification_time': 1367317326522L, 'length': 100L, 'blocksize': 134217728L, 'owner': u'wouter', 'path': '/index.asciidoc'}, {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1367317326628L, 'length': 0L, 'blocksize': 0L, 'owner': u'wouter', 'path': '/source'}]

File listing

>>> list(client.ls(["/Makefile"])) [{'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'access_time': 1367317324982L, 'block_replication': 1, 'modification_time': 1367317325346L, 'length': 6783L, 'blocksize': 134217728L, 'owner': u'wouter', 'path': '/Makefile'}]

Get directory information

>>> list(client.ls(["/source"], include_toplevel=True, include_children=False)) [{'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1367317326628L, 'length': 0L, 'blocksize': 0L, 'owner': u'wouter', 'path': '/source'}]

mkdir(paths, create_parent=False, mode=493)

Create a directoryCount

Parameters:	paths (list of strings) – Paths to create create_parent (boolean) – Also create the parent directories mode (int) – Mode the directory should be created with
Returns:	a generator that yields dictionaries

rename(paths, dst)

Rename (move) path(s) to a destination

Parameters:	paths (list) – Source paths dst (string) – destination
Returns:	a generator that yields dictionaries

rename2(path, dst, overwriteDest=False)

Rename (but don’t move) path to a destination

By only renaming, we mean that you can’t move a file or folder out or in other folder. The renaming can only happen within the folder the file or folder lies in.

Note that this operation “always succeeds” unless an exception is raised, hence, the dict returned from this function doesn’t have the ‘result’ key.

Since you can’t move with this operation, and only rename, it would not make sense to pass multiple paths to rename to a single destination. This method uses the underlying rename2 method.

https://github.com/apache/hadoop/blob/ae91b13/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java#L483-L523

Out of all the different exceptions mentioned in the link above, this method only wraps the FileAlreadyExistsException exception. You will also get a FileAlreadyExistsException if you have overwriteDest=True and the destination folder is not empty. The other exceptions will just be passed along.

Parameters:	path (string) – Source path dst (string) – destination
Returns:	A dictionary or None

rmdir(paths)

Delete a directory

Parameters:	paths (list) – Paths to delete
Returns:	a generator that yields dictionaries

serverdefaults(force_reload=False)

Get server defaults, caching the results. If there are no results saved, or the force_reload flag is True, it will query the HDFS server for its default parameter values. Otherwise, it will simply return the results it has already queried.

Note: This function returns a copy of the results loaded from the server, so you can manipulate or change them as you’d like. If for any reason you need to change the results the client saves, you must access the property client._server_defaults directly.

Parameters:	force_reload (bool) – Should the server defaults be reloaded even if they already exist?
Returns:	dictionary with the following keys: blockSize, bytesPerChecksum, writePacketSize, replication, fileBufferSize, encryptDataTransfer, trashInterval, checksumType

Example:

>>> client.serverdefaults() [{'writePacketSize': 65536, 'fileBufferSize': 4096, 'replication': 1, 'bytesPerChecksum': 512, 'trashInterval': 0L, 'blockSize': 134217728L, 'encryptDataTransfer': False, 'checksumType': 2}]

setrep(paths, replication, recurse=False)

Set the replication factor for paths

Parameters:	paths (list) – Paths replication – Replication factor recurse (boolean) – Apply replication factor recursive
Returns:	a generator that yields dictionaries

stat(paths)

Stat a fileCount

Parameters:	paths (string) – Path
Returns:	a dictionary

Example:

>>> client.stat(['/index.asciidoc']) {'blocksize': 134217728L, 'owner': u'wouter', 'length': 100L, 'access_time': 1367317326510L, 'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'path': '/index.asciidoc', 'modification_time': 1367317326522L, 'block_replication': 1}

tail(path, tail_length=1024, append=False)

Show the end of the file – default 1KB, supports up to the Hadoop block size.

Parameters:	path (string) – Path to read tail_length (int) – The length to read from the end of the file – default 1KB, up to block size. append (bool) – Currently not implemented
Returns:	a generator that yields strings

test(path, exists=False, directory=False, zero_length=False)

Test if a path exist, is a directory or has zero length

Parameters:	path (string) – Path to test exists (boolean) – Check if the path exists directory (boolean) – Check if the path is a directory zero_length (boolean) – Check if the path is zero-length
Returns:	a boolean

Note

directory and zero length are AND’d.

text(paths, check_crc=False)

Takes a source file and outputs the file in text format. The allowed formats are gzip and bzip2

Parameters:	paths (list of strings) – Paths to display check_crc (boolean) – Check for checksum errors
Returns:	a generator that yields strings

touchz(paths, replication=None, blocksize=None)

Create a zero length file or updates the timestamp on a zero length file

Parameters:	paths (list) – Paths replication – Replication factor blocksize (int) – Block size (in bytes) of the newly created file
Returns:	a generator that yields dictionaries

classsnakebite.client.AutoConfigClient(hadoop_version=9, effective_user=None, use_sasl=False)

A pure python HDFS client that support HA and is auto configured through the HADOOP_HOME environment variable.

HAClient is fully backwards compatible with the vanilla Client and can be used for a non HA cluster as well. This client tries to read ${HADOOP_HOME}/conf/hdfs-site.xml and ${HADOOP_HOME}/conf/core-site.xml to get the address of the namenode.

The behaviour is the same as Client.

Example:

>>> from snakebite.client import AutoConfigClient >>> client = AutoConfigClient() >>> for x in client.ls(['/']): ... print x

Note

Different Hadoop distributions use different protocol versions. Snakebite defaults to 9, but this can be set by passing in the hadoop_version parameter to the constructor.

Parameters:	hadoop_version (int) – What hadoop protocol version should be used (default: 9) effective_user (string) – Effective user for the HDFS operations (default: None – current user) use_sasl (boolean) – Use SASL for authenication or not

classsnakebite.client.HAClient(namenodes, use_trash=False, effective_user=None, use_sasl=False, hdfs_namenode_principal=None, max_failovers=15, max_retries=10, base_sleep=500, max_sleep=15000, sock_connect_timeout=10000, sock_request_timeout=10000, use_datanode_hostname=False)

Snakebite client with support for High Availability

HAClient is fully backwards compatible with the vanilla Client and can be used for a non HA cluster as well.

Example:

>>> from snakebite.client import HAClient >>> from snakebite.namenode import Namenode >>> n1 = Namenode("namenode1.mydomain", 8020) >>> n2 = Namenode("namenode2.mydomain", 8020) >>> client = HAClient([n1, n2], use_trash=True) >>> for x in client.ls(['/']): ... print x

Note

Different Hadoop distributions use different protocol versions. Snakebite defaults to 9, but this can be set by passing in the version parameter to the Namenode class constructor.

Parameters:

Parameters:	namenodes (list) – Set of namenodes for HA setup use_trash (boolean) – Use a trash when removing files. effective_user (string) – Effective user for the HDFS operations (default: None – current user) use_sasl (boolean) – Use SASL authentication or not hdfs_namenode_principal (string) – Kerberos principal to use for HDFS max_retries (int) – Number of failovers in case of connection issues max_retries – Max number of retries for failures base_sleep (int) – Base sleep time for retries in milliseconds max_sleep (int) – Max sleep time for retries in milliseconds sock_connect_timeout (int) – Socket connection timeout in seconds sock_request_timeout (int) – Request timeout in seconds use_datanode_hostname (boolean) – Use hostname instead of IP address to commuicate with datanodes

namenodes (list) – Set of namenodes for HA setup
use_trash (boolean) – Use a trash when removing files.
effective_user (string) – Effective user for the HDFS operations (default: None – current user)
use_sasl (boolean) – Use SASL authentication or not
hdfs_namenode_principal (string) – Kerberos principal to use for HDFS
max_retries (int) – Number of failovers in case of connection issues
max_retries – Max number of retries for failures
base_sleep (int) – Base sleep time for retries in milliseconds
max_sleep (int) – Max sleep time for retries in milliseconds
sock_connect_timeout (int) – Socket connection timeout in seconds
sock_request_timeout (int) – Request timeout in seconds
use_datanode_hostname (boolean) – Use hostname instead of IP address to commuicate with datanodes

Tags: hadoop python

snakebite – Python HDFS client

You may also like...

Leave a Reply Cancel reply

Archives

snakebite – Python HDFS client

Related posts:

You may also like...

Hive msck repair not working

Hive Vertex failure

Extracting table and column names from SQL query

Leave a Reply Cancel reply

Tags

Archives