Are you looking to create your very own dataset for a new and innovative application? Or maybe you're trying to collect data for analysis for a college project and have become weary of manually downloading each image or CSV. Worry not, in this article I'll explain the building blocks needed in order to automate downloading files for these kinds of tasks.
Before you can create an application to download and create datasets for you, you'll need to know the basics required for automating file downloads via Java code. Getting the basics right will help you use them to your own specific set of needs, whether it's for a backend server application or Android app.
There are multiple ways to download a file using Java code. Here are just a few ways of how you can accomplish the task:
Java IO
The most easily available and a basic package available for downloading a file from internet using Java code is the Java IO package. Here we will be using the BufferedInputStream
and the URL
classes to open and read a file on a given address to a file on our local system. The reason we use the BufferedInputStream
class instead of the InputStream
is its buffering ability that gives our code a performance boost.
Before we dive deeper into the coding aspect let's take an overview of the classes and the individual functions we will be using in the process.
The java.net.URL
class in Java is a built-in library that offers multiple methods to access and manipulate data on the internet. In this case, we will be using the openStream()
function of the URL
class. The method signature for the openStream()
function is:
public final InputStream openStream() throws IOException
The openStream()
function works on an object of the URL
class. The URL
class opens up a connection to the given URL and the openStream()
method returns an input stream which is used to read data from the connection.
The second class we will be using is the BufferedInputStreamReader
and the FileOutputStream
. These classes are used for reading from a file and writing to it, respectively.
Here is the complete code:
try (BufferedInputStream inputStream = new BufferedInputStream(new URL("http://example.com/my-file-path.txt").openStream());
FileOutputStream fileOS = new FileOutputStream("/Users/username/Documents/file_name.txt")) {
byte data[] = new byte[1024];
int byteContent;
while ((byteContent = inputStream.read(data, 0, 1024)) != -1) {
fileOS.write(data, 0, byteContent);
}
} catch (IOException e) {
// handles IO exceptions
}
Note: You may need to add the 'User-Agent' header to the HTTP request since some servers don't allow downloads from unknown clients.
As you can see we open up a connection using the URL
object and then read it via the BufferedInputStreamReader
object. The contents are read as bytes and copied to a file in the local directory using the FileOutputStream
.
To lower the number of lines of code we can use the Files
class available from Java 7. The Files
class contains methods that read all the bytes at once and then copies it into another file. Here is how you can use it:
InputStream inputStream = new URL("http://example.com/my-file-path.txt").openStream();
Files.copy(inputStream, Paths.get("/Users/username/Documents/file_name.txt"), StandardCopyOption.REPLACE_EXISTING);
Java NIO
Java NIO is an alternative package to handle networking and input-output operations in Java. The main advantage that the Java NIO package offers is that it's non-blocking, and has channeling and buffering capabilities. When we use the Java IO library we work with streams that read data byte by byte. However, the Java NIO package uses channels and buffers. The buffering and channeling capabilities allow the system to copy contents from a URL directly into the intended file without needing to save the bytes in application memory, which would be an intermediary step. The ability to work with channels boosts performance.
In order to download the contents of a URL, we will use the ReadableByteChannel
and the FileChannel
classes.
ReadableByteChannel readChannel = Channels.newChannel(new URL("http://example.com/my-file-path.txt").openStream());
The ReadableByteChannel
class creates a stream to read content from the URL. The downloaded contents will be transferred to a file on the local system via the corresponding file channel.
FileOutputStream fileOS = new FileOutputStream("/Users/username/Documents/file_name.txt");
FileChannel writeChannel = fileOS.getChannel();
After defining the file channel we will use the transferFrom()
method to copy the contents read from the readChannel
object to the file destination using the writeChannel
object.
writeChannel
.transferFrom(readChannel, 0, Long.MAX_VALUE);
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
The transferFrom()
and transferTo()
methods are much more efficient than working with streams using a buffer. The transfer methods enable us to directly copy the contents of the file system cache to the file on the system. Thus direct channeling restricts the number of context switches required and enhances the overall code performance.
Now, in the following sections, we will be looking at ways to download files from a URL using third-party libraries instead of core Java functionality components.
Apache Commons IO
The Apache Commons IO library offers a list of utility classes to manage IO operations. Now you may be thinking why would we use this when Java has its own set of libraries to handle IO operations. However, Apache Commons IO overcomes the problem of code rewriting and helps avoid writing boilerplate code.
In order to start using the Apache Commons IO library, you will need to download the jar files from the official website. When you are done downloading the jar files, you need to add them to use them. If you are using an Integrated Development Environment (IDE) such as Eclipse, you will need to add the files to the build path of your project. To add files to your project you would need to right click on it, select build path option by navigating through "configure build path-> build path", and then choose the add external archives option.
To download a file from a given URL using the Apache Commons IO we will require the FileUtils
class of the package. There is only a single line of code required to download a file, which looks like:
FileUtils.copyURLToFile(
new URL("http://example.com/my-file-path.txt"),
new File("/Users/username/Documents/file_name.txt"),
CONNECTION_TIMEOUT,
READ_TIMEOUT);
The connection and read timeouts convey the permissible time for which either the connection may stay idle or reading from the URL may stop.
Another class of the Apache Commons IO package that can be used to download a file over the internet is the IOUtils class. We will use the copy(inputStream, fileOS)
method to download a file into the local system.
InputStream inputStream = new URL("http://example.com/my-file-path.txt").openStream();
FileOutputStream fileOS = new FileOutputStream("/Users/username/Documents/file_name.txt");
int i = IOUtils.copy(inpuStream, fileOS);
The function returns the number of bytes copied. If the value of the variable i
is -1, then it indicates that the contents of the file are over 2GB. When the returned value is -1, you can use the function copyLarge(inputStream, fileOS)
in place of the copy(inputstream, fileOS)
function to handle this load. Both of these functions buffer the inputstream
internally. The internal buffer means we do not have to use the BufferedInputStream
class to enhance our code performance and helps us avoid writing boilerplate code.
Using Apache HTTP Components
Another library managed by the Apache organization is the HttpComponents package. This library uses the request-response mechanism to download the file from a given URL.
The first step to downloading a file is to create an HTTP client object that would issue the request to the server. For this, we will be using the CloseableHttpClient
class. The CloseableHttpClient
class is an abstract class that requires HttpClientBuilder
class to create instances. The code snippet that creates a new HTTP client is as follows:
CloseableHttpClient client = HttpClientBuilder.create().build();
We then need to create an HttpGet
or HttpPost
object to send the request to the server. The request is created by the following line of code:
HttpGet request = new HttpGet("url from where the file is intended to be downloaded");
The execute(request)
function is applied to the client object and returns with a response from the server. Once the request is sent to the server we need a response object to receive the data sent from the server. To catch the response from the server we use the HttpResponse
class object.
HttpResponse response = client.execute(request);
The data sent by the server in the form of a message is obtained through the getEntity()
function.
HttpEntity entity = response.getEntity();
You can also obtain the response code sent by the server through the response
object and use it to your specific need.
int responseCode = response.getStatusLine().getStatusCode();
The data to be downloaded is encapsulated within the entity
object and can be extracted using the getContent()
function. The getContent()
function returns an InputStream
object that can be further used with a BufferedInputStreamReader
to enhance performance.
InputStream inputStream = entity.getContent();
Now all you need to do is read from the stream byte by byte and write the contents into a file using the FileOutputStream
class.
String fileName = "D:\\Demo\file.txt";
FileOutputStream fos = new FileOutputStream(filename);
Int byte;
while((byte = inputStream.read()) != -1) {
fos.write(byte);
}
The last thing required to be done is closing all the open resources in order to ensure that the system resources are not overutilized and that there are no memory leaks.
Conclusion
So there you have it - these are the simplest ways to download a file using the basic Java code and other third party libraries. Now that we are done with the basics, you can be as creative as you want and utilize the knowledge to suit your needs. So see you next time with a new set of concepts to help you become a better coder. We wish you happy coding till then.