it-ebooks - Data Science iPython Notebook 20160913
Here you can read online it-ebooks - Data Science iPython Notebook 20160913 full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2016, publisher: iBooker it-ebooks, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:
Romance novel
Science fiction
Adventure
Detective
Science
History
Home and family
Prose
Art
Politics
Computer
Non-fiction
Religion
Business
Children
Humor
Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.
Data Science iPython Notebook 20160913: summary, description and annotation
We offer to read an annotation, description, summary or preface (depends on what the author of the book "Data Science iPython Notebook 20160913" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.
Data Science iPython Notebook 20160913 — read online for free the complete book (whole text) full work
Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Data Science iPython Notebook 20160913" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.
Font size:
Interval:
Bookmark:
This notebook was prepared by Donne Martin. Source and license info is on GitHub.
- SSH to EC2
- Boto
- S3cmd
- s3-parallel-put
- S3DistCp
- Redshift
- Kinesis
- Lambda
Connect to an Ubuntu EC2 instance through SSH with the given key:
Connect to an Amazon Linux EC2 instance through SSH with the given key:
Boto is the official AWS SDK for Python.
Install Boto:
Configure boto by creating a ~/.boto file with the following:
Work with S3:
Work with EC2:
Create a bucket and put an object in that bucket:
Each service supports a different set of commands. Refer to the following for more details:
- AWS Docs
- Boto Docs
Before I discovered S3cmd, I had been using the S3 console to do basic operations and boto to do more of the heavy lifting. However, sometimes I just want to hack away at a command line to do my work.
I've found S3cmd to be a great command line tool for interacting with S3 on AWS. S3cmd is written in Python, is open source, and is free even for commercial use. It offers more advanced features than those found in the AWS CLI.
Install s3cmd:
Running the following command will prompt you to enter your AWS access and AWS secret keys. To follow security best practices, make sure you are using an IAM account as opposed to using the root account.
I also suggest enabling GPG encryption which will encrypt your data at rest, and enabling HTTPS to encrypt your data in transit. Note this might impact performance.
Frequently used S3cmds:
s3-parallel-put is a great tool for uploading multiple files to S3 in parallel.
Install package dependencies:
Clone the s3-parallel-put repo:
Setup AWS keys for s3-parallel-put:
Sample usage:
Dry run of putting files in the current directory on S3 with the given S3 prefix, do not check first if they exist:
S3DistCp is an extension of DistCp that is optimized to work with Amazon S3. S3DistCp is useful for combining smaller files and aggregate them together, taking in a pattern and target file to combine smaller input files to larger ones. S3DistCp can also be used to transfer large volumes of data from S3 to your Hadoop cluster.
To run S3DistCp with the EMR command line, ensure you are using the proper version of Ruby:
The EMR command line below executes the following:
- Create a master node and slave nodes of type m1.small
- Runs S3DistCp on the source bucket location and concatenates files that match the date regular expression, resulting in files that are roughly 1024 MB or 1 GB
- Places the results in the destination bucket
For further optimization, compression can be helpful to save on AWS storage and bandwidth costs, to speed up the S3 to/from EMR transfer, and to reduce disk I/O. Note that compressed files are not easy to split for Hadoop. For example, Hadoop uses a single mapper per GZIP file, as it does not know about file boundaries.
What type of compression should you use?
- Time sensitive job: Snappy or LZO
- Large amounts of data: GZIP
- General purpose: GZIP, as its supported by most platforms
You can specify the compression codec (gzip, lzo, snappy, or none) to use for copied files with S3DistCp with outputCodec. If no value is specified, files are copied with no compression change. The code below sets the compression to lzo:
Copy values from the given S3 location containing CSV files to a Redshift cluster:
Copy values from the given location containing TSV files to a Redshift cluster:
Font size:
Interval:
Bookmark:
Similar books «Data Science iPython Notebook 20160913»
Look at similar books to Data Science iPython Notebook 20160913. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.
Discussion, reviews of the book Data Science iPython Notebook 20160913 and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.