Page 2,640«..1020..2,6392,6402,6412,642..2,6502,660..»

Microsoft: We’ve fixed Azure container flaw that could have leaked data – ZDNet

Microsoft has revealed that it has fixed a bug in its Azure Container Instances (ACI) service that may have allowed a user to access other customers' information in the ACI.

ACI lets customers run applications in containers on Azure using virtual machines that are managed by Microsoft rather than managing their own.

The best cloud storage services

Free and cheap personal and small business cloud storage services are everywhere. But, which one is best for you? Let's look at the top cloud storage options.

Read More

Researchers from Palo Alto Networks reported the security bug to Microsoft, which recently addressed the issue.

SEE:The CIO's new challenge: Making the case for the next big thing

Microsoft said in a blogpost there was no indication any customer information was accessed due to the vulnerability both in the cluster the researchers were using or in other clusters.

"Microsoft recently mitigated a vulnerability reported by a security researcher in the Azure Container Instances (ACI) that could potentially allow a user to access other customers' information in the ACI service. Our investigation surfaced no unauthorized access to customer data," it said.

Nonetheless, it has told customers who received a notification from it via the Azure Portal to revoke any privileged credentials that were deployed to the platform before August 31, 2021.

Ariel Zelivansky, researcher at Palo Alto, told Reuters his team used a known vulnerability to escape Azure's system for containers. Since it was not yet patched in Azure, this allowed them to gain full control of a cluster. Palo Alto reported the container escape to Microsoft in July.

Even without vulnerabilities, containerized applications, which are often hosted on cloud infrastructure, can be difficult to shield from attackers. The NSA and CISA recently issued guidance for organizations to harden containerized applications because their underlying infrastructure can be incredibly complex.

SEE: Open source matters, and it's about more than just free software

Microsoft noted that among other things admins should revoke privileged credentials on a regular basis.

Microsoft disclosed a separate Azure vulnerability two weeks ago affecting customers running NoSQL databases on Azure, which provides the Cosmos DB managed NoSQL DB service. A critical flaw, dubbed ChaosDB, allowed an attacker to read, modify or delete databases.

Here is the original post:
Microsoft: We've fixed Azure container flaw that could have leaked data - ZDNet

Read More..

The need for speed: Amazon uses SAN in the cloud to expedite users’ cloud migration process – SiliconANGLE News

Moving heavy workloads to the cloud, as essential as it is in todays digital economy, is a tough process. Technical debt and risk of error make migrating critical workloads a potential headache and stressor for a lot of companies seeking to join the digital revolution.

With Amazon Elastic Box Storage and its snapshots feature, capable of creating incremental backups, Amazon lessens the load of cloud migration and offers safe storage to those moving their data to the cloud, according toCami Tavares(pictured, right), senior manager of Amazon EBS at Amazon Web Services Inc., who believes using SAN in the cloud is the future of cloud storage, offering high-performance and agility to customers to move them to the cloud and thus the market faster.

When we look at the EBS portfolio and the evolution over the years, you can see that it was driven by customer need, and we have different volume types that have very specific performance characteristics that are built to meet these unique needs of customer workloads,Tavares said. Every business is a data business, and block storage is a foundational part of that.

Tavares and Ashish Palekar(pictured, left), general manager of EBS snapshots at AWS, spoke with Dave Vellante, host of theCUBE, SiliconANGLE Medias livestreaming studio, during the AWS Storage Day event. They discussed Amazon EBS, EBS snapshots, the announcement of io2 Block Express and more. (* Disclosure below.)

AWS made a few announcements that support SAN in the cloud. One of these announcements is io2 Block Express, a modular storage system offering four times the input/output operations per second.

Its a complete ground app, the invention of our storage product offering, and gives customers the same availability, durability and performance characteristics that theyre used to in their on-premises, Palekar explained.

With the sub-millisecond latency, performance and capacity from io2, customers can expect an easier and more efficient migration to the cloud.

Speed is one of the biggest motivators, according to customer feedback, for businesses moving to the cloud. SAN in the cloud specifically addresses this with its increased speed, with enterprises shelling over $22 million in 2021 in SANs, according to Tavares.

Its transformational for businesses to be able to change the customer experience for their customers and innovate at a much faster pace, Tavares said. With the block express product, you get to do that much faster. You can go from an idea to an implementation orders of magnitude faster.

Watch the complete video interview below, and be sure to check out more of SiliconANGLEs and theCUBEs coverage of the AWS Storage Day event. (* Disclosure: TheCUBE is a paid media partner for the AWS Storage Day. Neither Amazon Web Services Inc., the sponsor oftheCUBEs event coverage, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

We are holding our third cloud startup showcase on Sept. 22.Click here to join the free and open Startup Showcase event.

TheCUBEis part of re:Invent, you know,you guys really are a part of the eventand we really appreciate your coming hereand I know people appreciate thecontent you create as well Andy Jassy

We really want to hear from you, and were looking forward to seeing you at the event and in theCUBE Club.

Read more:
The need for speed: Amazon uses SAN in the cloud to expedite users' cloud migration process - SiliconANGLE News

Read More..

AWS execs speak on the top priorities for on-prem to cloud migration – SiliconANGLE News

While organizations are increasingly embracing the value of cloud and hybrid cloud infrastructures for data management, storage and analysis, a few are reticent to make the big bold move of migrating their data over to the cloud.

We still see many customers that are evaluatinghow to do their cloud migration strategiesand theyre looking for, you know, understandingwhat services can help them with those migrations, saidMat Mathews (pictured, left), general manager of Transfer Service at Amazon Web Services Inc..

Mathews;Siddhartha Roy (pictured, second from right), general manager of Snow Family at AWS; andRandy Boutin (pictured, right), GM of DataSync at AWS,spoke with Dave Vellante, host of theCUBE, SiliconANGLE Medias livestreaming studio, during the AWS Storage Day event. They discussed the current state of enterprise cloud migration from an inside perspective. (* Disclosure below.)

Several data pointers clearly signal a clear shift in favor of cloud storage over conventional on-prem solutions. However, moving petabytes of data at a time can often seem daunting (or even expensive) for some organizations. Where do they start? What cloud provider is best suited for their needs? These are the sort of questions that often make the rounds, according to the executive panel.

Id recommend customers look at theircool and cold data. If they look at their backupsand archives and they have not been used for long,it doesnt make sense to keep them on-prem.Look at how you can move those and migrate those firstand then slowly work your way up into, like,warm data and then hot data, Roy stated.

Through its compelling cost savings to customers, long-standing durability record, and unwavering flexibility, AWS has proven itself time and again as the de-facto industry option in cloud storage services, according to the panel.

How do AWS customers figure out which services to use?It comes down to a combination of things, according to Boutin.

First is the amount of available bandwidththat you have, the amount of data that youre lookingto move, and the timeframe you have in which to do that, he said.So if you have a high speed, say, gigabit network,you can move data very quickly using DataSync.If you have a slower network or perhaps you dont wantto utilize your existing network for this purpose,then the Snow Family of products makes a lot of sense.

Watch the complete video interview below, and be sure to check out more of SiliconANGLEs and theCUBEs coverage of the AWS Storage Day event. (* Disclosure: TheCUBE is a paid media partner for the AWS Storage Day. Neither Amazon Web Services Inc., the sponsor of theCUBEs event coverage, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

We are holding our third cloud startup showcase on Sept. 22.Click here to join the free and open Startup Showcase event.

TheCUBEis part of re:Invent, you know,you guys really are a part of the eventand we really appreciate your coming hereand I know people appreciate thecontent you create as well Andy Jassy

We really want to hear from you, and were looking forward to seeing you at the event and in theCUBE Club.

Read the original:
AWS execs speak on the top priorities for on-prem to cloud migration - SiliconANGLE News

Read More..

Cloud is the ‘new normal’ as businesses boost resiliency and agility beyond COVID – SiliconANGLE News

The COVID-19 pandemic and the business challenges it caused were a catalyst for many companies to accelerate their migration to the cloud, and that trend is not likely to change anytime soon.

Amazon Web Services Inc. is betting on companies growing interest in building resilience and agility in the cloud beyond pandemic times, according to Mai-Lan Tomsen Bukovec (pictured), vice president of AWS Storage.

Were going to continue to see that rapid migrationto the cloud, because companies now knowthat in the course of days and months the whole world of your expectationsof where your business is going and where,what your customers are going to do, that can change, she said. And that can change not just for a year,but maybe longer than that.Thats the new normal.

Bukovec spoke with Dave Vellante, host of theCUBE, SiliconANGLE Medias livestreaming studio, during theAWS Storage Day event.They discussed how cloud is the new reality for enterprises, how AWS storage fits into the data fabric debate, and what AWS thinks about its storage strategy and about business going hybrid.(* Disclosure below.)

While the cloud is seen as the new normal for businesses, the paths enterprises use to get there remain diverse. AWS customers typically fall into one of three patterns, the fastest being where they choose to move their core business mission to the cloud because they can no longer scale on-premises, according to Bukovec.

Its not technology that stops peoplefrom moving to the cloud as quick as they want to; its culture, its people, its processes,its how businesses work, she explained. And when you move the crown jewels into the cloud,you are accelerating that cultural change.

Other companies follow what Bukovec sees as the slower path, which is to take a few applications across the organization and move them to the cloud as a reference implementation. In this model of cloud pilots, the goal is to try to get the people who have done thisto generalize the learning across the company.

Its actually counterproductive to a lot of companiesthat want to move quickly to the cloud, Bukovec said.

The third pattern is what AWS calls new applications or cloud-first, when a company decides that all new technology initiatives will be in the cloud. That allows the business to be able to see cloud ideas and technology in different parts of its structure, generating a decentralized learning process with a faster culture change than in the previous pattern.

While cloud storage is centralized, it fully fits into the emerging trend known as data mesh, according to Bukovec. As first defined by Zhamak Dehghani, a ThoughtWorks consultant,a data meshis a type of data decentralized architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design.

Data mesh presupposes separating the data storageand the characteristics of datafrom the data services that interactand operate on that storage, Bukovec explained. The idea is to ensure that the decentralized business model can work with this data and innovate faster.

Our AWS customers are putting their storagein a centralized place because its easier to track,its easier to view compliance, and its easier to predict growth and control costs, but we started with building blocksand we deliberately built our storage servicesseparate from our data services, Bukovec said. We have a number of these data servicesthat our customers are using to buildthat customized data meshon top of that centralized storage.

Heres the complete video interview, part of SiliconANGLEs and theCUBEs coverage of the AWS Storage Day event. (* Disclosure: TheCUBE is a paid media partner for the AWS Storage Day. Neither AWS, the sponsor of theCUBEs event coverage, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

We are holding our third cloud startup showcase on Sept. 22.Click here to join the free and open Startup Showcase event.

TheCUBEis part of re:Invent, you know,you guys really are a part of the eventand we really appreciate your coming hereand I know people appreciate thecontent you create as well Andy Jassy

We really want to hear from you, and were looking forward to seeing you at the event and in theCUBE Club.

Read the original post:
Cloud is the 'new normal' as businesses boost resiliency and agility beyond COVID - SiliconANGLE News

Read More..

Microsoft teases new super simple OneDrive interface – TechRadar

Finding the right files in Microsoft's cloud storage service will soon be even easier as the software giant is currently working on a new interface for OneDrive.

According to a new post on the Microsoft 365 Roadmap, OneDrive will soon be getting a new command bar later this month.

With this update, OneDrive users will easily be able to identify the right file and access primary commands. However, the simplified view in OneDrive's new interface will also help boost productivity as it allows users to focus on the content they're working on as they won't be distracted by additional menus.

In two separate posts, Microsoft also revealed that OneDrive will also be getting a new sharing experience in November of this year.

The company is updating OneDrive's Share menu to provide easy access to additional sharing options such as email, copy link, Teams chat as well as manage to access settings.

However, the Copy link button is set to be replaced by a footer where users will be able to set permissions before copying links and sharing them with recipients.

After releasing the 64-bit version of OneDrive earlier this year, Microsoft has continually updated its cloud storage service and it will be interesting to see how these visual and sharing updates pan out.

See original here:
Microsoft teases new super simple OneDrive interface - TechRadar

Read More..

Pure Storage tantalises with reinvention possibilities Blocks and Files – Blocks and Files

Pure Storage has flagged a major announcement on September 28th. A financial analysts briefing is scheduled to follow the announcement, suggesting news that will affect investors views of Pures future revenues, costs and underlying profitability measures. The company is saying the announcement is about AIOps, the future of storage, storage and DevOps product innovations, and its as-a-Service offerings. What could it announce that could cause analysts to take stock and form a different view of the company?

We ignored the AIOps aspect, as that would be a fairly incremental move, and came up with a list of potential developments:

Hardware array refreshes would be good. Using the latest Xeon processors, for example, supporting PCIe gen-4, that sort of thing but they would hardly move the needle for financial analysts. Possibly committing to support DPUs from Pensando or Fungible might do that. Still, not exactly that much impact on a financial analysts twitch-ometer.

Porting FlashBlade software to one or more public clouds would seem both logical and good sense. It would be additive to the FlashBlade market and we think analysts would concur, nod approvingly and move on. Ditto porting Cloud Block Store to the Google Cloud Platform. Expansion into adjacent market? Tick. Stronger competition for NetApp data fabric idea? Tick. Whats not to like? Nothing. Move on.

Adding file and block support to Cloud Block Store? Trivially there would be a naming problem: do we call it Cloud Block File Object Store? It would seem a logical extension of Pures public cloud capabilities and an improvement in the cross-cloud consistency of Pures hybrid cloud story. We cant imagine analysts would see a downside here.

It could be achieved with another strategy: make the Purity OS software cloud-native and have it run un the public clouds. That would be a big deal with a common code tree and four deployment layers: on-premises arrays, AWS, Azure and GCP. It would be a large very large software effort and give Pure a great hybrid cloud story with lots of scope for software revenue growth. Cue making sure the analysts understand this. An AIOps extension could be added in to strengthen the story as well.

How about doing a Silk, Qumulo or VAST Data, and walking away from hardware manufacturing using a close relationship with a contract manufacturer/distributor instead and certified configurations? Thus would be a major business change, and both analysts and customers would want reassuring that Pure would not lose its hardware design mojo.

A lesser hardware change would be to use commodity SSDs instead of Pure designing its own flash storage drives and controllers. Our instant reaction is a thumbs down, as Pure has consistently said its hardware is superior to COTS SSD vendors such as Dell, HPE and NetApp because it optimises flash efficiency, performance and endurance better than it could if it was limited by SSD constraints.

Such a change would still get analysts in a tizzy. But we dont think it likely, even if Pure could pitch a good cost-saving and no-performance-impact story.

How about a strategic deal with a public cloud vendor similar to the AWS-NetApp FSx for ONTAP deal? That would indeed be a coup having, say, Pures block storage available alongside the cloud vendors native block storage. We dont think it likely, though it has to be on the possibles list.

Expanding the Pure-as-a-Service strategy to include all of Pures products would be an incremental move and so no big deal to people who had taken the basic idea on board already. Analysts would need a talking-to perhaps, to be persuaded that this was worth doing in Annual Recurring Revenue growth terms. This could be thought of as Pure doing a me-too with HPEs GreenLake and Dells APEX strategies.

How about Pure acting as a cloud access broker and front-end concierge supplier, rather like NetApp with its Spot-based products? That would be big news and require new software and a concerted marketing and sales effort. AIOps could play a role here too. Our view, based on gut feelings alone, is that this is an unlikely move although it would be good to see NetApp getting competition.

We are left thinking that the likeliest announcements will be about making more of Pures software available in the public clouds, plus an extension of Pures as-a-Service offerings and a by-the-way set of hardware refreshes. Well see how well our predictions match up with reality on September 28 and mentally prepare for a kicking just in case we are way off base.

Go here to read the rest:
Pure Storage tantalises with reinvention possibilities Blocks and Files - Blocks and Files

Read More..

Italy says bids for national cloud hub expected this month – iTnews

Italy expects to receive bids by the end of September from companies interested in building a national cloud hub, a 900-million-euro (A$1.4 billion) project to upgrade the country's data storage facilities, a government minister said.

Part of EU-funded projects to help Italy's economy recover from the pandemic, the cloud hub initiative reflects European efforts to make the 27-member bloc less dependent on large overseas tech companies for cloud services.

"I'm confident we will receive some expressions of interest by the end of the month," Innovation Minister Vittorio Colao, a former head of telecom giant Vodafone, told reporters during an annual business conference in Cernobbio on Lake Como.

"Technological independence of Europe is important because it allows the bloc to negotiate (with foreign partners) on an equal footing," Colao said, adding he had discussed the issue with French Finance Minister Bruno Le Maire at the conference.

In the Recovery Plan sent to Brussels in April to access EU funds, Rome earmarked 900 million euros for the cloud hub project, according to sources and documents seen by Reuters.

Sources told Reuters in June that Italian state lender Cassa Depositi e Prestiti was considering an alliance with Telecom Italia and defence group Leonardo in the race to create the cloud hub.

US tech giants such as Google, Microsoft and Amazon, which dominate the data storage industry, could provide their cloud technology to the cloud hub, if licensed to companies taking part in the hub project, officials have said.

Such a structure would be aimed at soothing concerns over the risk of US surveillance in the wake of the adoption of the US CLOUD Act of 2018, which can require US-based tech firms to provide data to Washington even if it is stored abroad.

See the rest here:
Italy says bids for national cloud hub expected this month - iTnews

Read More..

Broadcom server-storage connectivity sales down but recovery coming Blocks and Files – Blocks and Files

Although Broadcom saw an overall rise in revenues and profit in its latest quarter, sales in the server-to-storage connectivity area were down. It expects a recovery and has cash for an acquisition.

Revenues in Broadcoms third fiscal 2021 quarter, ended August 1, were $6.78 billion, up 16 per cent on the year. There was a $1.88 billion profit, more than doubling last years $688 million.

Were interested because Broadcom makes server-storage connectivity products such as Brocade host bus adapters (HBAs), SAS and NVMe connectivity products.

President and CEO Hock Tans announcement statement said: Broadcom delivered record revenues in the third quarter reflecting our product and technology leadership across multiple secular growth marketsin cloud, 5G infrastructure, broadband,and wireless. We are projecting the momentum to continue in the fourth quarter.

There are two segments to its business: Semiconductor Solutions, which brought in $5.02 billion, up 19 per cent on the year; and Infrastructure Software, which reported $1.76 billion, an increase of ten per cent.

Tan said in the earnings call: Demand continued to be strong from hyper-cloud and service provider customers. Wireless continued to have a strong year-on-year compare. And while enterprise has been on a trajectory of recovery, we believe Q3 is still early in that cycle, and that enterprise was down year on year.

Inside Semiconductor Solutions, the server storage connectivity area had revenues of $673 million, which was nine per cent down on the year-ago quarter. Tan noted: Within this, Brocade grew 27 per cent year on year, driven by the launch of new Gen 7 Fibre Channel SAN products.

Overall, Tan said: Our [Infrastructure Solutions] products here supply mission-critical applications largely to enterprise, which, as I said earlier, was in a state of recovery. That being said, we have seen a very strong booking trajectory from traditional enterprise customers within this segment. We expect such enterprise recovery in server storage.

This will come from aggressive migration in cloud to 18TB disk drives and a transition to next-generation SAS and NVMe products. Tan expects Q4 server storage connectivity revenue to be up low double-digit percentage year on year. Think two to five per cent.

The enterprise segment will grow more, with Tan saying: Because of strong bookings that we have been seeing now for the last three months, at least from enterprise, which is going through largely on the large OEMs, who particularly integrate the products and sell it to end users, we are going to likely expect enterprise to grow double digits year on year in Q4.

That enterprise business growth should continue throughout 2022, Tan believes: In fact, I would say that the engine for growth for our semiconductor business in 2022 will likely be enterprise spending, whether its coming from networking, one sector for us, and/or from server storage, which is largely enterprise, we see both this showing strong growth as we go into 2022.

Broadcom is accumulating cash and could make an acquisition or indulge in more share buybacks. Tan said: By the end of October, our fiscal year, well probably see the cash net of dividends and our cash pool to be up to close to $13 billion, which is something like $6 billion, $7 billion, $8 billion above what we would, otherwise, like to carry on our books.

Let us pronounce that HBAs are NICs (Network Interface Cards) and that an era of SmartNICs is starting. It might be that Broadcom could have an acquisitive interest in the SmartNIC area.

Broadcom is already participating in the DPU (Data Processing Unit) market, developing and shipping specialised silicon engines to drive specialised workloads for hyperscalers. Answering an analyst question, Tan said: We have the scale. We have a lot of the IP calls and the capability to do all those chips for those multiple hyperscalers who can afford and are willing to push the envelope on specialised I used to call it offload computing engines, be they video transcoding, machine learning, even what people call DPUs, smart NICs, otherwise called, and various other specialised engines and security hardware that we put in place in multiple cloud guys.

Better add Broadcom to the list of DPU vendors such as Fungible, Intel and Pensando, and watch out for any SmartNIC acquisition interest.

The rest is here:
Broadcom server-storage connectivity sales down but recovery coming Blocks and Files - Blocks and Files

Read More..

"Rockset is on a mission to deliver fast and flexible real-time analytics" – JAXenter

JAXenter: Thank you for taking the time to speak with us! Can you tell us more about Rockset and how it works? How does it help us achieve real-time analytics?

Venkat Venkataramani: Rockset is a real-time analytics database that serves low latency applications. Think real-time logistics tracking, personalized experiences, anomaly detection and more.Wh

Rockset employs the same indexing approach used by the systems behind the Facebook News Feed and Google Search, which were built to make data retrieval for millions of users and on TBs of data, instantaneous. It goes a step further by building a Converged Index a search index, a columnar store and a row index on all data. This means sub-second search, aggregations and joins without any performance engineering.

You can point Rockset at any data structured, semi-structured and time series data and it will index the data in real-time and enable fast SQL analytics. This frees teams from time-consuming and inflexible data preparation. Teams can now onboard new datasets and run new experiments without being constrained by data operations. And, Rockset is fully-managed and cloud-native, making a massively distributed real-time data platform accessible to all.

SEE ALSO: Shifting toward more meaningful insights means shifting toward proactive analytics

JAXenter: What data sources does it currently support?

Venkat Venkataramani: Rockset has built-in data connectors to data streams, OLTP databases and data lakes. These connectors are all fully-managed and stay in sync with the latest data. That means you can run millisecond-latency SQL queries within 2 seconds of data being generated. Rockset has built-in connectors to Amazon DynamoDB, MongoDB, Apache Kafka, Amazon Kinesis, PostgreSQL, MySQL, Amazon S3 and Google Cloud Storage. Rockset also has a Write API to ingest and index data from other sources.

JAXenter: Whats new at Rockset and how will it continue to improve analytics for streaming data?

Venkat Venkataramani: We recently announced a series of product releases to make real-time analytics on streaming data affordable and accessible. With this launch, teams can use SQL to transform and pre-aggregate data in real-time from Apache Kafka, Amazon Kinesis and more.

This makes real-time analytics up to 100X more cost-effective on streaming data. And, we free engineering teams from needing to construct and manage complex data pipelines to onboard new streaming data and experiment on queries. Heres what weve released:

You can delve further into this release by watching a live Q&A with Tudor Bosman, Rocksets Chief Architect. He delves into how we support complex aggregations on rolled up data and ensure accuracy even in the face of dupes and latecomers.

JAXenter: What are some common use cases for real-time data analytics? When is it useful to implement?

Venkat Venkataramani: You experience real-time analytics every day whether you realize it or not. The content displayed in Instagram newsfeeds, the personalized recommendations on Amazon and the promotional offers from Uber Eats are all examples of real-time analytics. Real-time analytics encourages users to take desired actions from reading more content, to adding items to our cart, to using takeout and delivery services for more of our meals.

We think real-time analytics isnt just useful to the big tech giants. Its useful across all technology companies to drive faster time to insight and build engaging experiences. Were seeing SaaS companies in the logistics space provide real-time visibility into the end-to-end supply chain, route shipments and predict ETAs. This ensures that materials arrive on time and within schedule, even in the face of an increasingly complex chain. Or, there are marketing analytics software companies that need to unify data across a number of interaction points to create a single view of the customer. This view is then used for segmentation, personalization and automation of different actions to create more compelling customer experiences.

Theres a big misperception in the space that a) real-time analytics is too expensive b) real-time analytics is only accessible to large tech companies. Thats just not true anymore. The cloud offerings, availability of real-time data and the changing resource economics are making this within reach of any digital disrupter.

JAXenter: How is Rockset built under the hood?

Venkat Venkataramani: The Converged Index, mentioned previously, is the key component in enabling real-time analytics. Rockset stores all its data in the search, column-based and row-based index structures that are part of the Converged Index, and so we have to ensure that the underlying storage can handle both reads and writes efficiently. To meet this requirement, Rockset uses RocksDB as its embedded storage engine, with some modifications for use in the cloud. RocksDB enables Rockset to handle high write rates, leverage SSDs for optimal price-performance and support updates to any field.

Another core part of Rocksets design is its use of a disaggregated architecture to maximize resource efficiency. We use an Aggregator-Leaf-Tailer (ALT) architecture, common at companies like Facebook and LinkedIn, where resources for ingest compute, query compute and storage can be scaled independently of each other based on the workload in the system. This allows Rockset users to exploit cloud efficiencies to the full.

SEE ALSO: Codespaces helps developers to focus on what matters mostbuilding awesome things

JAXenter: Personally, what are some of your favorite open source tools that you cant do without?

Venkat Venkataramani: RocksDB! The team at Rockset built and open-sourced RocksDB at Facebook, a high performance embedded storage engine used by other modern data stores like CockroachDB, Kafka and Flink. RocksDB was a project at Facebook that abstracted access to local stable storage so that developers could focus their energies on building out other aspects of the system. RocksDB has been used at Facebook as the embedded storage for spam detection, graph search and message queuing. At Rockset, weve continued to contribute to the project as well as release RocksDB-cloud to the community.

We are also fans of the dbt community, an open-source tool that lets data teams collaborate on transforming data in their database to ship higher quality data sets, faster. We share a similar outlook on the data space we think data pipelines are challenging to build and maintain, respect SQL as the lingua franca of analytics and want to make it easy for data to be shared across an organization.

JAXenter: Can you share anything about Rocksets future? Whats on the roadmap next, what features and/or improvements are being worked on?

Venkat Venkataramani: Rockset is on a mission to deliver fast and flexible real-time analytics, without the cost and complexity. Our product roadmap is geared towards enabling all digital disrupters to realize real-time analytics.

This requires taking steps to make real-time analytics more affordable and accessible than ever before. A first step towards affordability was the release of SQL-based rollups and transformations, which cut the cost of real-time analytics up to 100X for streaming data. As part of our expansion initiative, were also expanding Rockset to users across the globe. Follow us as we continue to put real-time analytics within reach of all engineers.

Read the original:
"Rockset is on a mission to deliver fast and flexible real-time analytics" - JAXenter

Read More..

Lessons Learned: Training and Deploying State of the Art Transformer Models at Digits – insideBIGDATA

In this blog post, we want to provide a peek behind the curtains on how we extract information with Natural Language Processing (NLP). Youll learn how to appy state-of-the-art Transformer models for this problem and how to go from an ML model idea to integration in the Digits app.

Our Plan

Information can be extracted from unstructured text through a process called Named Entity Recognition (NER). This NLP concept has been around for many years, and its goal is to classify tokens into predefined categories, such as dates, persons, locations, and entities.

For example, the transaction below could be transformed into the following structured format:

We had seen outstanding results from NER implementations applied to other industries and we were eager to implement our own banking-related NER model. Rather than adopting a pre-trained NER model, we envisioned a model built with a minimal number of dependencies. That avenue would allow us to continuously update the model while remaining in control of all moving parts. With this in mind, we discarded available tools like the SpaCy NER implementation or HuggingFace models for NER. We ended up building our internal NER model based only on TensorFlow 2.x and the ecosystem library TensorFlow Text.

The Data

Every Machine Learning project starts with the data, and so did this one. We decided which relevant information we wanted to extract (e.g., location, website URLs, party names, etc.) and, in the absence of an existing public data set, we decided to annotate the data ourselves.

There are a number of commercial and open-source tools available for data annotation, including:

The optimal tool varies with each project, and is a question of cost, speed, and useful UI. For this project, our key driver for our tool selection was the quality of the UI and the speed of the sample processing, and we chose doccano.

At least one human reviewer then evaluated each selected transaction, and that person would mark the relevant sub-strings as shown above. The end-product of this processing step was a data set of annotated transactions together with the start- and end-character of each entity within the string.

Selecting an Architecture

While NER models can also be based on statistical methods, we established our NER models on an ML architecture called Transformers. This decision was based on two major factors:

The initial attention-based model architecture was the Bidirectional Encoder Representation from Transformers (BERT, for short), published in 2019. In the original paper by Google AI, the author already highlighted potential applications to NER, which gave us confidence that our transformer approach might work.

Furthermore, we had previously implemented various other deep-learning applications based on BERT architectures and we were able to reuse our existing shared libraries. This allowed us to develop a prototype in a short amount of time.

BERT models can be used as pre-trained models, which are initially trained on multi-lingual corpi on two general tasks: predicting mask tokens and predicting if the next sentence has a connection to the previous one. Such general training creates a general language understanding within the model. The pre-trained models are provided by various companies, for example, by Google via TensorFlow Hub. The pre-trained model can then be fine-tuned during a task-specific training phase. This requires less computational resources than training a model from scratch.

The BERT architecture can compute up to 512 tokens simultaneously. BERT requires WordPiece tokenization which splits words and sentences into frequent word chunks. The following example sentence would be tokenized as follows:

Digits builds a real-time engine

[bdig, b##its, bbuilds, ba, breal, b-, btime, bengine]

There are a variety of pre-trained BERT models available online, but each has a different focus. Some models are language-specific (e.g., CamemBERT for French or Beto for Spanish), and other models have been reduced in their size through model distillation or pruning (e.g., ALBERT or DistilBERT).

Time to Prototype

Our prototype model was designed to classify the sequence of tokens which represent the transaction in question. We converted the annotated data into a sequence of labels that matched the number of tokens generated from the transactions for the training. Then, we trained the model to classify each token label:

In the figure above, you notice the O tokens. Such tokens represent irrelevant tokens, and we trained the classifier to detect those as well.

The prototype model helped us demonstrate a business fit of the ML solution before engaging in the full model integration. At Digits, we develop our prototypes in GPU-backed Jupyter notebooks. Such a process helps us to iterate quickly. Then, once we confirm a business use-case for the model, we focus on the model integration and the automation of the model version updates via our MLOps pipelines.

Moving to Production

In general, we use TensorFlow Extended (TFX) to update our model versions. In this step, we convert the notebook code into TensorFlow Ops, and here we converted our prototype data preprocessing steps into TensorFlow Transform Ops. This extra step allows us later to train our model versions effectively, avoid training-serving skew, and furthermore allows us to bake our internal business logic into our ML models. This last benefit helps us to reduce the dependencies between our ML models and our data pipeline or back-end integrations.

We are running our TFX pipelines on Google Clouds Vertex AI pipelines. This managed service frees us from maintaining a Kubernetes cluster for Kubeflow Pipelines (which we have done prior to using Vertex AI).

Our production models are stored in Google Cloud Storage buckets, and TFServing allows us to load model versions directly from cloud storage. Because of the dynamic loading of the model versions, we dont need to build custom containers for our model serving setup; we can use the pre-built images from the TensorFlow team.

Here is a minimal setup for Kubernetes deployment:

apiVersion: apps/v1 kind: Deployment metadata: name: tensorflow-serving-deployment spec: template: spec: containers: name: tensorflow-serving-container image: tensorflow/serving:2.5.1 command: /usr/local/bin/tensorflow_model_server args: port=8500 model_config_file=/serving/models/config/models.conf file_system_poll_wait_seconds=120

Note the additional argument file_system_poll_wait_seconds in the list above. By default, TFServing will check the file system for new model versions every 2s. This can generate large Cloud Storage costs since every check triggers a list operation, and storage costs are billed based on the used network volume. For most applications, it is fine to reduce the file system check to every 2 minutes (set the value to 120 seconds) or disable it entirely (set the value to 0).

For maintainability, we keep all model-specific configurations in a specific ConfigMap. The generated file is then consumed by TFServing on boot-up.

apiVersion: v1 kind: ConfigMap metadata: namespace: ml-deployments name: -config data: models.conf: |+ model_config_list: { config: { name: , base_path: gs:///, model_platform: tensorflow, model_version_policy: { specific: { versions: 1607628093, versions: 1610301633 } } version_labels { key: canary, value: 1610301633 } version_labels { key: release, value: 1607628093 } } }

After the initial deployment, we started iterating to optimize the model architecture for high throughput and low latency results. This meant optimizing our deployment setup for BERT-like architectures and optimizing the trained BERT models. For example, we optimized the integration between our data processing Dataflow jobs and our ML deployments, and shared our approach in our recent talk at the Apache Beam Summit 2021.

Results

The deployed NER model allows us to extract a multitude of information from unstructured text and make it available through Digits Search.

Here are some examples of our NER model extractions:

The Final Product

At Digits, an ML model is never itself the final product. We strive to delight our customers with well-designed experiences that are tightly integrated with ML models, and only then do we witness the final product. Many additional factors come into play:

Latency vs. Accuracy

A more recent pre-trained model (e.g., BART or T5) could have provided higher model accuracy, but it would have also increased the model latency substantially. Since we are processing millions of transactions daily, it became clear that model latency is critical for us. Therefore, we spent a significant amount of time on the optimization of our trained models.

Design for false-positive scenarios

There will always be false positives, regardless of how stunning the model accuracy was pre-model deployment. Product design efforts that focus on communicating ML-predicted results to end-users are critical. At Digits, this is especially important because we cannot risk customers confidence in how Digits is handling their financial data.

Automation of model deployments

The investment in our automated model deployment setup helped us provide model rollback support. All changes to deployed models are version controlled, and deployments are automatically executed from our CI/CD system. This provides a consistent and transparent deployment workflow for our engineering team.

Devise a versioning strategy for release and rollback

To assist smooth model rollout and a holistic quantitative analysis prior to rollout, we deploy two versions of the same ML model and use TFServings version labels (e.g., release and pre-release tags) to differentiate between them. Additionally, we use an active version table that allows for version rollbacks, made as simple as updating a database record.

Assist customers, dont alienate them

Last but not least, the goal for our ML models should always be to assist our customers in their tasks instead of alienating them. That means our goal is not to replace humans or their functions, but to help our customers with cumbersome tasks. Instead of asking people to extract information manually from every transaction, well assist our customers by pre-filling extracted vendors, but they will always stay in control. If we make a mistake, Digits makes it easy to overwrite our suggestions. In fact, we will learn from our mistakes and update our ML models accordingly.

Further Reading

Check out these great resources for even more on NER and Transformer models:

About the Author

Hannes Hapke is a Machine Learning Engineer at Digits. As a Google Developer Expert, Hannes has co-authored two machine learning publications: NLP in Action by Manning Publishing, and Building Machine Learning Pipelines by OReilly Media. At Digits, he focuses on ML engineering and applies his experience in NLP to advance the understanding of financial transactions.

Sign up for the free insideBIGDATAnewsletter.

Join us on Twitter:@InsideBigData1 https://twitter.com/InsideBigData1

See original here:
Lessons Learned: Training and Deploying State of the Art Transformer Models at Digits - insideBIGDATA

Read More..