snakebite – Python HDFS client
Source: https://snakebite.readthedocs.io/en/latest/client.html
Example:
>>> from snakebite.client import Client >>> client = Client("localhost", 8020, use_trash=False) >>> for x in client.ls(['/']): ... print x
Warning
Many methods return generators, which mean they need to be consumed to execute! Documentation will explicitly specify which methods return generators.
Note
paths
parameters in methods are often passed as lists, since operations can work on multiple paths.Note
Parameters like
include_children
andrecurse
are not used when paths contain globs.Note
Different Hadoop distributions use different protocol versions. Snakebite defaults to 9, but this can be set by passing in the
hadoop_version
parameter to the constructor.Parameters: - host (string) – Hostname or IP address of the NameNode
- port (int) – RPC Port of the NameNode
- hadoop_version (int) – What hadoop protocol version should be used (default: 9)
- use_trash (boolean) – Use a trash when removing files.
- effective_user (string) – Effective user for the HDFS operations (default: None – current user)
- use_sasl (boolean) – Use SASL authentication or not
- hdfs_namenode_principal (string) – Kerberos principal to use for HDFS
- sock_connect_timeout (int) – Socket connection timeout in seconds
- sock_request_timeout (int) – Request timeout in seconds
- use_datanode_hostname (boolean) – Use hostname instead of IP address to commuicate with datanodes
cat
(paths, check_crc=False)-
Fetch all files that match the source file pattern and display their content on stdout.
Parameters: - paths (list of strings) – Paths to display
- check_crc (boolean) – Check for checksum errors
Returns: a generator that yields strings
chgrp
(paths, group, recurse=False)-
Change the group of paths.
Parameters: - paths (list) – List of paths to chgrp
- group – New group
- recurse (boolean) – Recursive chgrp
Returns: a generator that yields dictionaries
chmod
(paths, mode, recurse=False)-
Change the mode for paths. This returns a list of maps containing the resut of the operation.
Parameters: - paths (list) – List of paths to chmod
- mode (int) – Octal mode (e.g. 0o755)
- recurse (boolean) – Recursive chmod
Returns: a generator that yields dictionaries
Note
The top level directory is always included when recurse=True
chown
(paths, owner, recurse=False)-
Change the owner for paths. The owner can be specified as user or user:group
Parameters: - paths (list) – List of paths to chmod
- owner (string) – New owner
- recurse (boolean) – Recursive chown
Returns: a generator that yields dictionaries
This always include the toplevel when recursing.
copyToLocal
(paths, dst, check_crc=False)-
Copy files that match the file source pattern to the local name. Source is kept. When copying multiple, files, the destination must be a directory.
Parameters: - paths (list of strings) – Paths to copy
- dst (string) – Destination path
- check_crc (boolean) – Check for checksum errors
Returns: a generator that yields strings
count
(paths)-
Count files in a path
Parameters: paths (list) – List of paths to count Returns: a generator that yields dictionaries Examples:
>>> list(client.count(['/'])) [{'spaceConsumed': 260185L, 'quota': 2147483647L, 'spaceQuota': 18446744073709551615L, 'length': 260185L, 'directoryCount': 9L, 'path': '/', 'fileCount': 34L}]
delete
(paths, recurse=False)-
Delete paths
Parameters: - paths (list) – Paths to delete
- recurse (boolean) – Recursive delete (use with care!)
Returns: a generator that yields dictionaries
Note
Recursive deletion uses the NameNode recursive deletion functionality instead of letting the client recurse. Hadoops client recurses by itself and thus showing all files and directories that are deleted. Snakebite doesn’t.
df
()-
Get FS information
Returns: a dictionary Examples:
>>> client.df() {'used': 491520L, 'capacity': 120137519104L, 'under_replicated': 0L, 'missing_blocks': 0L, 'filesystem': 'hdfs://www.robin.eu.org:8020', 'remaining': 19669295104L, 'corrupt_blocks': 0L}
du
(paths, include_toplevel=False, include_children=True)-
Returns size information for paths
Parameters: - paths (list) – Paths to du
- include_toplevel (boolean) – Include the given path in the result. If the path is a file, include_toplevel is always True.
- include_children (boolean) – Include child nodes in the result.
Returns: a generator that yields dictionaries
Examples:
Children:
>>> list(client.du(['/'])) [{'path': '/Makefile', 'length': 6783L}, {'path': '/build', 'length': 244778L}, {'path': '/index.asciidoc', 'length': 100L}, {'path': '/source', 'length': 8524L}]
Directory only:
>>> list(client.du(['/'], include_toplevel=True, include_children=False)) [{'path': '/', 'length': 260185L}]
getmerge
(path, dst, newline=False, check_crc=False)-
Get all the files in the directories that match the source file pattern and merge and sort them to only one file on local fs.
Parameters: - paths (string) – Directory containing files that will be merged
- dst (string) – Path of file that will be written
- nl (boolean) – Add a newline character at the end of each file.
Returns: string content of the merged file at dst
ls
(paths, recurse=False, include_toplevel=False, include_children=True)-
Issues ‘ls’ command and returns a list of maps that contain fileinfo
Parameters: - paths (list) – Paths to list
- recurse (boolean) – Recursive listing
- include_toplevel (boolean) – Include the given path in the listing. If the path is a file, include_toplevel is always True.
- include_children (boolean) – Include child nodes in the listing.
Returns: a generator that yields dictionaries
Examples:
Directory listing
>>> list(client.ls(["/"])) [{'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'access_time': 1367317324982L, 'block_replication': 1, 'modification_time': 1367317325346L, 'length': 6783L, 'blocksize': 134217728L, 'owner': u'wouter', 'path': '/Makefile'}, {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1367317325431L, 'length': 0L, 'blocksize': 0L, 'owner': u'wouter', 'path': '/build'}, {'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'access_time': 1367317326510L, 'block_replication': 1, 'modification_time': 1367317326522L, 'length': 100L, 'blocksize': 134217728L, 'owner': u'wouter', 'path': '/index.asciidoc'}, {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1367317326628L, 'length': 0L, 'blocksize': 0L, 'owner': u'wouter', 'path': '/source'}]
File listing
>>> list(client.ls(["/Makefile"])) [{'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'access_time': 1367317324982L, 'block_replication': 1, 'modification_time': 1367317325346L, 'length': 6783L, 'blocksize': 134217728L, 'owner': u'wouter', 'path': '/Makefile'}]
Get directory information
>>> list(client.ls(["/source"], include_toplevel=True, include_children=False)) [{'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1367317326628L, 'length': 0L, 'blocksize': 0L, 'owner': u'wouter', 'path': '/source'}]
mkdir
(paths, create_parent=False, mode=493)-
Create a directoryCount
Parameters: - paths (list of strings) – Paths to create
- create_parent (boolean) – Also create the parent directories
- mode (int) – Mode the directory should be created with
Returns: a generator that yields dictionaries
rename
(paths, dst)-
Rename (move) path(s) to a destination
Parameters: - paths (list) – Source paths
- dst (string) – destination
Returns: a generator that yields dictionaries
rename2
(path, dst, overwriteDest=False)-
Rename (but don’t move) path to a destination
By only renaming, we mean that you can’t move a file or folder out or in other folder. The renaming can only happen within the folder the file or folder lies in.
Note that this operation “always succeeds” unless an exception is raised, hence, the dict returned from this function doesn’t have the ‘result’ key.
Since you can’t move with this operation, and only rename, it would not make sense to pass multiple paths to rename to a single destination. This method uses the underlying rename2 method.
Out of all the different exceptions mentioned in the link above, this method only wraps the FileAlreadyExistsException exception. You will also get a FileAlreadyExistsException if you have overwriteDest=True and the destination folder is not empty. The other exceptions will just be passed along.
Parameters: - path (string) – Source path
- dst (string) – destination
Returns: A dictionary or None
rmdir
(paths)-
Delete a directory
Parameters: paths (list) – Paths to delete Returns: a generator that yields dictionaries
serverdefaults
(force_reload=False)-
Get server defaults, caching the results. If there are no results saved, or the force_reload flag is True, it will query the HDFS server for its default parameter values. Otherwise, it will simply return the results it has already queried.
Note: This function returns a copy of the results loaded from the server, so you can manipulate or change them as you’d like. If for any reason you need to change the results the client saves, you must access the property client._server_defaults directly.
Parameters: force_reload (bool) – Should the server defaults be reloaded even if they already exist? Returns: dictionary with the following keys: blockSize, bytesPerChecksum, writePacketSize, replication, fileBufferSize, encryptDataTransfer, trashInterval, checksumType Example:
>>> client.serverdefaults() [{'writePacketSize': 65536, 'fileBufferSize': 4096, 'replication': 1, 'bytesPerChecksum': 512, 'trashInterval': 0L, 'blockSize': 134217728L, 'encryptDataTransfer': False, 'checksumType': 2}]
setrep
(paths, replication, recurse=False)-
Set the replication factor for paths
Parameters: - paths (list) – Paths
- replication – Replication factor
- recurse (boolean) – Apply replication factor recursive
Returns: a generator that yields dictionaries
stat
(paths)-
Stat a fileCount
Parameters: paths (string) – Path Returns: a dictionary Example:
>>> client.stat(['/index.asciidoc']) {'blocksize': 134217728L, 'owner': u'wouter', 'length': 100L, 'access_time': 1367317326510L, 'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'path': '/index.asciidoc', 'modification_time': 1367317326522L, 'block_replication': 1}
tail
(path, tail_length=1024, append=False)-
Show the end of the file – default 1KB, supports up to the Hadoop block size.
Parameters: - path (string) – Path to read
- tail_length (int) – The length to read from the end of the file – default 1KB, up to block size.
- append (bool) – Currently not implemented
Returns: a generator that yields strings
test
(path, exists=False, directory=False, zero_length=False)-
Test if a path exist, is a directory or has zero length
Parameters: - path (string) – Path to test
- exists (boolean) – Check if the path exists
- directory (boolean) – Check if the path is a directory
- zero_length (boolean) – Check if the path is zero-length
Returns: a boolean
Note
directory and zero length are AND’d.
text
(paths, check_crc=False)-
Takes a source file and outputs the file in text format. The allowed formats are gzip and bzip2
Parameters: - paths (list of strings) – Paths to display
- check_crc (boolean) – Check for checksum errors
Returns: a generator that yields strings
touchz
(paths, replication=None, blocksize=None)-
Create a zero length file or updates the timestamp on a zero length file
Parameters: - paths (list) – Paths
- replication – Replication factor
- blocksize (int) – Block size (in bytes) of the newly created file
Returns: a generator that yields dictionaries
- class
snakebite.client.
AutoConfigClient
(hadoop_version=9, effective_user=None, use_sasl=False) -
A pure python HDFS client that support HA and is auto configured through the
HADOOP_HOME
environment variable.HAClient is fully backwards compatible with the vanilla Client and can be used for a non HA cluster as well. This client tries to read
${HADOOP_HOME}/conf/hdfs-site.xml
and${HADOOP_HOME}/conf/core-site.xml
to get the address of the namenode.The behaviour is the same as Client.
Example:
>>> from snakebite.client import AutoConfigClient >>> client = AutoConfigClient() >>> for x in client.ls(['/']): ... print x
Note
Different Hadoop distributions use different protocol versions. Snakebite defaults to 9, but this can be set by passing in the
hadoop_version
parameter to the constructor.Parameters: - hadoop_version (int) – What hadoop protocol version should be used (default: 9)
- effective_user (string) – Effective user for the HDFS operations (default: None – current user)
- use_sasl (boolean) – Use SASL for authenication or not
- class
snakebite.client.
HAClient
(namenodes, use_trash=False, effective_user=None, use_sasl=False, hdfs_namenode_principal=None, max_failovers=15, max_retries=10, base_sleep=500, max_sleep=15000, sock_connect_timeout=10000, sock_request_timeout=10000, use_datanode_hostname=False) -
Snakebite client with support for High Availability
HAClient is fully backwards compatible with the vanilla Client and can be used for a non HA cluster as well.
Example:
>>> from snakebite.client import HAClient >>> from snakebite.namenode import Namenode >>> n1 = Namenode("namenode1.mydomain", 8020) >>> n2 = Namenode("namenode2.mydomain", 8020) >>> client = HAClient([n1, n2], use_trash=True) >>> for x in client.ls(['/']): ... print x
Note
Different Hadoop distributions use different protocol versions. Snakebite defaults to 9, but this can be set by passing in the
version
parameter to the Namenode class constructor.Parameters: - namenodes (list) – Set of namenodes for HA setup
- use_trash (boolean) – Use a trash when removing files.
- effective_user (string) – Effective user for the HDFS operations (default: None – current user)
- use_sasl (boolean) – Use SASL authentication or not
- hdfs_namenode_principal (string) – Kerberos principal to use for HDFS
- max_retries (int) – Number of failovers in case of connection issues
- max_retries – Max number of retries for failures
- base_sleep (int) – Base sleep time for retries in milliseconds
- max_sleep (int) – Max sleep time for retries in milliseconds
- sock_connect_timeout (int) – Socket connection timeout in seconds
- sock_request_timeout (int) – Request timeout in seconds
- use_datanode_hostname (boolean) – Use hostname instead of IP address to commuicate with datanodes