Preserving the Web in the Age of AI
Download MP3[00:00:14] Chris Freeland: We tend to think of the web as a living archive, this vast, searchable record of who we were, what we knew, and how we understood the world at any given moment in time. But that assumption is starting to crack as publishers move to block AI scraping and restrict access to their content. The tools that quietly preserve our online history, like the internet archives way back machine, are getting caught in the crossfire.
[00:00:39] While it started as a fight over AI training data is quickly becoming something bigger. A question about whether the web itself will be able to be archived in the future, because if preserving the web becomes a threat, what happens to our memory when the past can't be saved? Hi everyone. I'm Chris Freeland.
[00:00:59] I'm a librarian at the internet archive. I wanna welcome you to today's discussion. So we have assembled an excellent panel of experts to discuss how efforts to limit AI access are reshaping the boundaries of preservation and what's at stake if those boundaries continue to close. So today we're joined by Mike Masnick, the founder of Tech Dirt, Mark Graham, the director of the Wayback Machine, and Kendra Albert, a tech and media policy expert at Albert Sellers, LLP.
[00:01:27] Here to introduce our speakers and to set the stage for today's conversation is Dave Hansen, the Executive Director of Authors Alliance.
[00:01:37] Dave Hansen: Thanks, Chris. Hi, everyone. So this is a little bit different for us today. Usually we're doing book talks and we thought that this was such an important issue and such a fast moving issue like no one has yet.
[00:01:47] Written the book on what is happening with the crisis in web archiving and preserving the web, but it's a really important issue. Authors alliance. From our perspective, we care about this 'cause we're quite fond of the internet and being able to research adequately what has happened over time across the web.
[00:02:06] It is so important for any sort of journalistic writing, for history. I refer to the way Back machine weekly, at least for writing when I'm looking at kind of back versions of documents and things like that. And I think we really do face a real crisis at this moment where, you know, this has never been an easy task and I think we'll hear from Mark about the way back.
[00:02:28] Machine takes a lot of work and other web preservation efforts take a lot of work. It's never been easy, but. In the current moment, it is particularly challenging when we have news publishers and other platforms online making it not just legally or technically complicated, but in some cases almost impossible to really engage in preserving content in an automated way.
[00:02:52] So we're here to talk about that, and I think the three perspectives that we've assembled here. I hope going to fill in some of the pieces of what's happening kind of on the ground with web preservation. What's happening in the broader sort of policy sphere that's driving some of this. And also what does the law have to say about this?
[00:03:09] 'cause often what we see happening is sort of a reflection or a shadow of, uh, the legal rights that exist. So with that, I'm gonna turn it over to our speakers here. And Mark, how about we start off with you to just talk a little bit about what do you see as like the major challenge right now in terms of preserving the web in the age of ai?
[00:03:32] Mark Graham: Sure. Well, I mean, first of all, for close to 30 years, Internet archives way back machine has been archiving much of the public web, including journalism and making that material available to people. I should note that a large percentage of this material was no longer available on the live web.
[00:03:51] And indeed, thousands and thousands of news sites that we have archived over the decades are no longer available. And, uh, what has been going on recently, as reported by Andrew Deck of Harvard's Journalism School and, and others, is that some news organizations and other platforms, most notably the New York Times and Reddit have begun blocking.
[00:04:14] The ability preventing the internet archives way back machine from producing archives of their material and making that available. Indeed, it has been suggested that we are victims, if you will, collateral damage caught up in the conflict between AI companies and publishers.
[00:04:35] Dave Hansen: Thanks, mark. Mike, how about, let's hear from you and take this whatever direction you want, but I'm particularly interested in your take on sort of the broader, like what's happening in the policy sphere around this.
[00:04:46] Mike Masnick: I mean it, it's a really tricky space because I think that a lot of people certainly recognize the value and importance of preserving culture and understanding culture, building things like institutions, like libraries and related institutions. And yet there's been this kind of struggle in large part because of the rise of ai, which Mark sort of hinted at in his opening, which is that.
[00:05:14] Until just a few years ago, most people consider the archiving of the web and of other resources as something that was seen sort of akin to the role of the library, which makes sense. But with the rise of AI tools. There is this interesting challenge in that, you know, all of the major frontier LLM models have been trained on huge corpuses of data and they're always looking for more.
[00:05:41] And the question is, where and how. And there are all sorts of other related discussions on, you know, is training fair use and things like that, which I don't think we need to get into here, but that is the backdrop behind all of this. And so companies, especially the media companies. Are certainly very concerned about the way that the AI companies have gotten access to their data for training purposes, and they feel that they are.
[00:06:08] Uncompensated and that they need to be compensated. Some of them have been working out deals. You mentioned the New York Times and Reddit and New York Times is suing open AI and has cut deals with others and Reddit has cut deals with Google and some others. And there there's all sorts of back and forth and negotiations and all of this debate.
[00:06:30] Then becomes collateral damage to that because the fear is, and I think it's an overblown and misguided fear, that because you have organizations like the Internet Archive Building the Way Back Machine, which again, they've built for decades, and I think most people recognize is just a generally useful tool for the preservation of culture and for researchers and for the journalists at some of the news organizations who are complaining, they feel that something like the Wayback machine.
[00:06:59] Offers a, a way to go around these negotiations and undercut the negotiations in some form or another, which I think is. At some point in retrospect, we'll be seen as a huge mistake by those media organizations. An overreaction to a problem that probably isn't really a problem, but in this sort of rush to deal with this concern of, oh my gosh, the AI companies are taking over everything.
[00:07:26] They're looking to plug any hole and block any opportunity for the AI companies to train on their material and. The collateral damage of that is that some of them are now blocking the way back machine and I think, you know, there are other efforts then to see about will there be either. On the legal side, which Kendrick can talk about, or just through technical measures, will there be ways to put up effectively toll booths on the internet?
[00:07:53] If you wanna archive this or if you wanna make use of the larger corpus of data that various news organizations are put together, do you first have to pay a toll to the AI scanning companies have to effectively pay for the right to read that content and that leads to a whole bunch of other downstream.
[00:08:11] Issues, but I'll cut it off there and give Kendrick a chance to talk as well.
[00:08:16] Kendra Albert: First of all, just wanna, I'm really excited for this conversation and you know, before I get to the law, I wanna obligatory state that my experience with web archiving is sort of a little bit unique in the sense that I was on the founding team for Perma cc, which is basically another web archiving service that's aimed at providing sort of permanent links for use and sort.
[00:08:35] Scholarly work, court filings, et cetera. A project that owes a lot to the way back machine. And so I think, you know, this is a topic that's sort of near and dear to my heart, even outside of the sort of intellectually interesting parts of the legal analysis. So I think there's sort of two sets of legal issues that you can kind of think about when you're thinking about sort of web arching.
[00:08:53] the one that everyone thinks of, and I realize probably Most people do not have an instinctive legal reaction to a conversation about web gaming, but one of them is basically copyright law. The question of whether you can make a copy of someone's work and save it, even if the purpose you're saving it for is quite different than the purpose it was originally created for.
[00:09:11] And there's some good case law from, you know, primarily actually the early two thousands, suggesting that making cached copies of websites, even the full website by Google is fair use. Use of images for search results. Fair use. So basically fair use being a limitation on copyright law that allows people to make use of copyrighted works without their permission from the original owner.
[00:09:32] And I think that's how many folks think about many of these large scale web archiving projects is that they, they use. They're under fair use and that oftentimes fair use asks questions like, Hey, are you harming the market for the original copyrighted works? What are you really doing with this sort of use of the copyright works?
[00:09:50] It the test to ask questions about sort of what you're doing, and I think in the context of web archiving, especially for sort of memory institutions, for the kinds of criticism, accountability journalism that we're gonna talk about, I think probably a little later. There's some really strong, fair use arguments, although it's not just like most things in the space.
[00:10:06] It's not like we have a Supreme Court case on point about this specifically. That's the sort of, in some ways the easier question and when fair use is the easier question, you know, you're like in a bad situation. From a legal perspective, the much harder question having to do with web scraping is sort of the process of scraping material online itself, and this sort of falls under a kind of separate.
[00:10:28] Set of legal regimes, including things like the Computer Fraud and Abuse Act, which if you're a little confused as to why the federal anti-hacking statute applies to certain kinds of web scraping. Originally the theory sort of had more to do with the fact that terms of service for websites would prohibit.
[00:10:44] Web scraping and thus those terms of service could be sort of used to argue that there was A-C-F-A-A violation. Now as we're thinking more about the sort of technical restrictions that folks are placing on accessing websites or even things like Robots text, which I'm sure we'll talk about, but which sort of is meant to convey signals about what, whether websites want themselves to be scraped, you know.
[00:11:06] Courts do have to take up the question of whether violating those signals constitutes breaking the law. And this is where it gets even more tricky because in the fairies context you get to talk about things like, Hey, it's really good for general knowledge that folks ha can access archive version of websites.
[00:11:24] This isn't harming the market in many conversations around web scraping, whether it's under the CFAA or some other legal theory. You're often much less focused on the question of why are you doing this? Right? And this is, I think, gets to Mike's point about the broader environment and the sort of good guy, archival institutions as kind of being collateral damage.
[00:11:43] Much of what we're seeing in the backlash around AI training, web scraping, and in legitimate concerns about just bandwidth use. Right? Like I'm sure Mark can speak to this more than I can, but I think it's really important to say like. Part of the reason people are concerned about web scraping is just because they are paying for companies to access their website at a scale that is not feasible.
[00:12:01] So when we think about the law, it's not an area where we have super clear answers as to the legality. And a lot of it depends on the particulars and has been sort of defended I think, for a long time. But the fact that most folks doing web archiving. Have been good actors like the internet archive, like Perma, like other folks who are responsive to requests to D list material from the web or from their sort of online archives who are thoughtful about their engagement with people who, you know, wanna have a conversation about, oh God, are we spamming your website?
[00:12:32] Like, let's not do that. Right? And so I think that's meant that actually we haven't had a ton of litigation over. Okay. Exactly. Like how is this legal under copyright? In the web scraping context, there has been much more direct litigation, including a fair amount having to do with sort of scraping of LinkedIn, especially by commercial providers.
[00:12:50] And so when we think about the law of web scraping in particular, you're thinking less about why are you doing it, although there's some slight exceptions and more just about are you kind of. Trying to get around technological barriers. Are you sort of trespassing on a website? Like kinds of the kinds of questions that aren't typically how we think about access to things on the internet?
[00:13:11] Dave Hansen: Thanks, Kendra. The CFAA piece of this is just like bewildering to me to think that if you follow that path to its logical conclusion mm-hmm. We've criminalized being an archivist potentially online, and it's just wild to me that that's the world that we're now in. So I wanna talk a little bit about the motivation for blocking a bit more.
[00:13:34] I mean, we've gotten into like AI as, I guess ostensibly the driver here, but then not everybody, not every news organization, not every website has like seen this pot of gold and said we must protect it at all costs. They haven't like shut this down across the board and so. I wanted to probe that a little bit.
[00:13:52] What's going on there and why is there like significant variation, at least at this point, across policies from let's, I guess we can focus on news. I know there are other websites as well, so.
[00:14:02] Mark Graham: I think be, first of all, it should be noted that very few news organizations have actually taken these kinds of measures, like the New York Times.
[00:14:10] The vast majority of the news organizations in the world are very happy for their resources to be archived. Indeed, a. If they hadn't been over the decades, we would not have access to them today. Examples would include Gaca Media or MTV News, nearly a half a million articles from the US or maybe in Hong Kong, where news organizations like Apple Daily.
[00:14:33] Or the stand. Were shut down for political reasons and indeed editors are in jail today. The only way one can access that material is from the way back machine. In addition, we partner with Bard College and Pan America on a project, the Russian Independent media Archive, focusing on archiving news from the Russian language journalism in exile and other places in the world where journalism.
[00:15:00] Is at risk. And I also note that Andrew Deck, in his reporting, he could not find any examples where any publisher found evidence that material from the Wayback machine was in fact being exploited by AI companies. So there's, I think, a great deal at risk here and frankly, very little, if any, evidence whatsoever of a threat to these news organizations.
[00:15:24] And at the same time, I wanna emphasize that. The internet archive is working collaboratively and supportively with journalists for decades that journalism often is based on references to other journalism. I, I was in the offices of the New York Times, I was just a few weeks ago, and a senior researcher came up to me and said, oh my god, mark, thank you so much for the way back machine we use you.
[00:15:48] All the time there is material available that we've used from the way back machine that we can't even find in our own archives. I get those stories all the time, and I wanna emphasize that we're not static. We don't just like do our thing and then nothing ever changes. The web is constantly changing.
[00:16:06] Business environments are changing, uh, et cetera, and we change as well. We have implemented a whole variety of mechanisms over the last few years, especially with the rise of AI company scrapings to make it such that by and large, the way back machine is limited in its use by humans. Whether the system is.
[00:16:26] Optimized for use by humans. We've taken specific measures to reduce, if not eliminate bulk access to materials, especially from certain news organizations. Limiting functionality in the way back machines, ui, collaborating with entities like CloudFlare. Putting in place rate limiting mechanisms and a whole variety of measures.
[00:16:48] Some of these we've taken in collaboration with news organizations who have expressed specific legitimate concerns. So the conversation is very much up. We welcome it as we look for ways to continue to provide the vital service that we provide to archive and make available what is considered by many, you know, the first rough draft of history.
[00:17:11] By definition, that rough draft needs to be available to be able to be examined, to be interrogated, to be reviewed, to be cited and referenced to. Just one more thought there. The citation side of it. Today, there are millions of URLs from news sites in Wikipedia articles, and a large percentage of them are only available because they're in the way back machine because the new sites they came from just don't exist anymore.
[00:17:40] Mike Masnick: The point I'll add in terms of why our news organizations doing this is that I think it's all part of a negotiation, right? I mean, if you look at the ones who are sort of at the forefront of trying to do this, blocking the, you know, the New York Times and Gnet mainly being some of the big ones, their own business model has changed quite a lot in the last few decades, and certainly in the last few years as well.
[00:18:06] And they're very, very focused on. Trying to figure out how they're going to continue to make money. And lately that has been through negotiating with large tech AI players and trying to cut deals. And the concern, which again I would argue is misplaced, is that anything that might undercut the negotiations to make a deal and sort of prop up their business model is seen as a threat to that.
[00:18:33] And so the few of them that are going around and saying. That the internet archive is a problem or needs to be blocked from scraping their content. For the most part, they're using that just to help them in their negotiations out of the fear that, oh, if the AI companies have a backdoor into getting our content, then the negotiation with us over a deal is a different proposition.
[00:18:58] I think this is a mistake on multiple levels, but that is kind of where their thinking seems to be.
[00:19:04] Kendra Albert: I also think like to that point that there can be this sort of way of thinking about, I'm reminded of the sort of the famous drill tweet. Like there's no difference between good and bad things, right? Like sort of this idea that, you know, in order to take a stance about bot access on your platform, you have to block all of them.
[00:19:20] Great. It doesn't matter what they're there for, doesn't matter, you know, whether they're well behaved in terms of sort of bandwidth use or crawling, right? Like, I don't know. I haven't had conversations with folks. Some of it may be lawyer brain, like I'm happy. I think there is a world in which lawyers sort of looking at their legal positions with regard to kind of.
[00:19:37] Scraping that might be occurring through AI sites might say, well actually this is simpler if we don't have to explain. We allow it for these folks 'cause we actually think that they're okay. Or we think that the uses may be fair or whatever, but we don't allow it for these folks. I can imagine that making an argument more complicated, even if I, I sort of agree with Mike that I don't think it's a particularly good way to do things.
[00:19:57] And I also think websites generally, whether they're news publishers or not, should be careful about sort of the degree to which we throw the baby out with the bath water and say, well actually some. Of these entities are behaving badly or doing things that we don't like with our content, whether they're entitled to legally or not, and therefore we're just going to take a broad stance across the board.
[00:20:16] The other thing I can imagine, and I don't think this is true for the New York Times organic, and Mark can correct me if I'm wrong, but is I think there are some circumstances where, you know, there are smaller sites that actually may not have the specific technical expertise to really understand what's fully going on.
[00:20:30] Right where you have a site that's just like knows their bandwidth costs have gone crazy, they are aware that folks are sort of scraping the web training for ai. They're not necessarily in a position to go through and actually kind of distinguish different sort of bots or actors, different sort of folks who are accessing content as may take a kind of a uniform approach.
[00:20:49] But I mean, I think part of this has to be a conversation about, hey, you know, independent of what you think about the training data, copyright fight, which I'm, I'm not gonna get any then sort of saying that, right. The archival uses are really important, right? Right now I think it does this via RSS feeds 'cause it's for headlines, but there's a bot on Blue Sky and Mastodon that sort of looks at changes to New York Times headlines from like the first post to like 20 minutes later.
[00:21:14] And you can see this sort of like diff between those headlines and like that provides valuable media criticism frankly. And we're not even talking about 20 years from now, we're talking about it like within the day of people posting to be able to see how sites. Stories have changed. So I don't think folks are doing it.
[00:21:30] I don't think the New York Times is doing it 'cause they don't want people seeing how their headlines have changed or you know, that they're stealth correcting things in the text. Although they certainly do do that. But I think that some of it comes from this sort of general framing of enclosure, as Mike was talking about, or can come from a lack of kind of going through the details to understand the differences between different types of actors who may be using somewhat similar technologies.
[00:21:52] Mike Masnick: Yeah. Can I just add something
[00:21:54] Mark Graham: to that?
[00:21:54] Mike Masnick: You know, one of the things that I think is important. That overlays a whole bunch of, this is the general feeling of many people somewhat reasonably about the entire AI space right now that there's a large sort of backlash. You know, there was this study recently that ICE has a higher approval rating than AI technology right now.
[00:22:16] There there is a general sort of. Conceptual backlash to this technology. Some of it based on perhaps good reasons, some of it based on perhaps not good reasons, none of that matters culturally. There is this general backlash, and especially for smaller, less sophisticated sites that don't want to go through the process of, you know, having to deal with that and the nuance or just saying, I want to opt out.
[00:22:40] If I can of this technology that I feel is problematic and bad, and if they don't have a clear and easy way to do that, one of that might be, well, I'm going to block any and all scraping, because I vaguely know that that is being used to allow these companies that I hate to do something with my content.
[00:23:00] And therefore, for some of them it is not a well thought out. You know, I am taking a stand against archives, right? Mm-hmm. They're not thinking that far. They're just saying like, AI bad. I have no control over this situation. The only thing I can do is, you know, someone has made it easy for me to block archiving or scraping, and therefore I have to do that as a stand against this technology.
[00:23:24] Mark Graham: That's true. And at the same time, I want to note that there are many other news organizations that take the opposite approach. Yes, indeed. Specifically, for example, the Pointer Institute. And with the organization behind the investigative Reporters and editors conference has partnered with the internet archive on a project called Today's News for Tomorrow.
[00:23:44] And what we are doing specifically is providing free archival services to more than. 300 local newsrooms across the United States to help them archive their material. They have chosen to participate in this project because they value and appreciate the importance of the archiving, and at the same time, I note that more than 200 journalists have recently signed a letter endorsing the work of the way back machine, celebrating it.
[00:24:13] In fact, Rachel Maddow and others are on record supporting this and has signed the, uh, letter of support. So we are focusing here on some of the pushback from a very small number, but influential and well-known news and other sites. But I want to put across the point here that generally speaking, we are able to continue to provide the service that we have for decades with the.
[00:24:36] Active support of, first of all, the patrons of the internet archive. The folks that are curious enough to wanna learn from journalists writ large and from media platforms.
[00:24:46] Mike Masnick: And I signed that letter and I completely agree with that thinking. I'm just sort of explaining some of the thinking. AndI would even go a little bit further in that, you know, beyond just the importance of archiving and being able to use these tools that journalists use for research, I do worry a little bit, you know, even if we're talking about the AI technologies as well, that when you have major publications like the New York Times trying to block.
[00:25:11] Any and every possible way in which their writing might be read by AI tools that that actually has problematic downstream consequences as well, where you have, you know, more problematic publications that are out there and the ones that have done more careful reporting. New York Times sometimes does careful reporting.
[00:25:32] Not always, I would say, but like. You want to have good reporting in these archives and in the AI tools as well as people are using them so that they're not overrun by more problematic content.
[00:25:44] Mark Graham: It does, and if I could build on this just a bit, it sets a very bad precedent and that then bleeds into other areas of publishing, for example.
[00:25:54] The US government, the world's largest publisher, uses large commercial platforms for MU Publishing, the US Agency for Global Media. The folks behind Reader Free Europe, et cetera, use YouTube to publish videos. Millions of videos, thousands of which have been taken down since this new Trump administration.
[00:26:12] A couple of months ago, the State Department said that they were gonna remove. All of the social media posts prior to the Trump True administration, and as we were racing to archive more than 2 million social media posts, we were watching accounts from embassies, ambassadors, and others over the years literally disappear from our screen as we were trying to archive them.
[00:26:34] So I think this is a, it's a dangerous precedent and something that we should be paying attention to in. All dimensions of how we are working to preserve the materials that are published and to never trust a publisher to do the job of a library.
[00:26:48] Dave Hansen: You're talking, I'm, I'm really thinking here about some of the business model stuff that underlies so much of these concerns. And I was recalling, it was like three or four years ago, I guess at this point, working with a library that was doing a licensing deal with a rather large newspaper. And I mean, the numbers that they showed me, they're talking about six figure data licenses for access to the newspaper data.
[00:27:15] We've had people talk about this before on here, Sarah Lambda did a talk about her book Data Cartels, where you know, a lot of it focuses on Read Elsevier academic publisher, and there's this real disconnect with like how authors and contributors and journalists think of those outlets and what those outlets actually are from a business perspective.
[00:27:34] I think the New York Times at this point is as much a data and analytics company as it is a newspaper. Read Elsevier specifically calls themselves a data analytics company, even though they are on paper an academic publisher. And I think it doesn't really help solve the situation, but at least explains a little bit more to me why they are making the moves that they are around restricting access to this content.
[00:27:58] If that's like the core of your business, I still don't like it, but that explains a little bit. So I do wanna talk about some other companies and outside of news, I guess is where I'd like to go. So Reddit has been, you know, pretty public about blocking access. They have a lawsuit right now against philanthropic that's been a kind of interesting one to watch, and there are lots of other commercial platforms, social media platforms, for instance, that are restricting access for web scraping and preservation.
[00:28:28] So like what's going on there? Kendra, maybe we can start with you to just talk a little bit about what's happening in litigation with some of these other platforms. I think Mark's point. Right. And your point about these sort of platforms is like, I think it's really valuable in some ways to think about like the actual rights to the content or your sort of legal right to use the content as an actually like functionally, totally separate question of how scraping.
[00:28:52] Kendra Albert: Right? Mm-hmm. And I think with, specifically with Reddit, right? Reddit doesn't have the right to sue someone for copyright infringement, for copying Reddit posts, right? Like, you know, I haven't read their terms of service recently, but I'm pretty sure you're not allowing Reddit to sue on your behalf for copyright infringement.
[00:29:06] But oftentimes the way this litigation is framed is around sort of access to the platform, sort of circumventing technological measures. So that's. The anti circumvention part of the copyright statute, section 1201, or through things like, you know, there's this fantastic, I'm not sure how I mean that word, but I'm not, not entirely positive tort.
[00:29:26] That used to be because you like touched someone's car without permission, called trespass to chattels, which also has to do with what was kind of. Has been historically used in some contexts for web scraping, although usually you need to show that there's some form of harm to, you know, the sort of infrastructure in order to bring it.
[00:29:41] We talked about the CFAA, there's, you know, trade secret. There's all kinds of other sort of legal claims. So in some ways, when you're thinking about kind of how some of these platforms are choosing to, you know, in some ways back up their business model goals, right? You know, Reddit has done licensing deals with AI companies.
[00:29:57] I forget which one off the top of my head, but. The, you know, there is a very real conversation about like, Hey, why should we pay you, you know, for this data if we could scrape it for Mus much less money. Now, of course, the version that you're gonna get from Reddit if you pay them for it, is probably going to have.
[00:30:14] Other advantages just in terms of the metadata, the infrastructure, you know, being able to ask Reddit questions about how the data works, right? All that kinds of stuff. But when we're thinking about sort of the legal reality behind these decisions, I think part of it has to do with kind of the idea of the business model.
[00:30:29] And part of it has to do with, you know, I think some degree to which I think some of these platforms may be genuinely responding to their own. Users being upset and LinkedIn scraping from current generative AI data days is actually a really good example of this because LinkedIn brought a lot of scraping litigation against primarily business competitors that were sort of using LinkedIn data in order to, you know, run a recruiting tool or kind of do other things that one might wanna do with sort of professional information.
[00:31:00] To some extent that was protective of their business model, right? These were effectively their competitors, or they would roll out a product that was competing with whatever that company was doing. But also legitimately, sometimes folks had real privacy concerns about the fact that, Hey, I shared this data on LinkedIn.
[00:31:14] I didn't assume that it was going to go everywhere. Now it's gone everywhere, right? I think, you know, that is different than the web archiving context. And I'm not saying, oh, this is the same thing, but I think why I bring it up is to say that. You have this sort of circumstance under which there's a variety of different incentives for kind of limiting access to data and it's impossible to disentangle them.
[00:31:36] Right? You know, it's impossible to say, oh, this is only because of business models. Oh, this is only because people have privacy or sort of usage concerns that where this sort of goes outside where it was supposed to be. And that oftentimes tech companies. LinkedIn has long said actually that their primary reason for a lot of their antis scraping tooling is to protect users' privacy.
[00:31:57] Now, I think that that's a hard position to defend given the sort of business model stuff that that's the only reason. But I don't think it's not part of it. So I think that when we think about sort of the moves by companies like Reddit to restrict all kinds of access, including the internet archive and the way back machine.
[00:32:14] You know, you can't just pin it to one thing, and it's not always based on one specific like legal theory, because oftentimes they're trying a bunch of different stuff simultaneously, of which copyright might be one of the tools, but often is actually not the most useful if you're talking about really significant amounts of web scraping.
[00:32:30] I hope that sort of answered your question, Dave.
[00:32:34] Mike Masnick: The one thing I was gonna add in the Reddit context is that it is a, an example of where this can lead in terms of like starting to test out questionable or extreme legal theories, right? So like one of the cases that Reddit has is against this company called ERP, API, which.
[00:32:51] I don't know how they pronounce their name, and you can argue that this is perhaps not a good company, but basically what they do is they scrape Google results and create an API so that you can programmatically make use of Google results. Google is also suing them, but that's a separate case. But you have Reddit suing this company over copyrights that Reddit doesn't own.
[00:33:14] As Kendra noted. It's the users in most cases. If there's any copyright interest at all. And they're suing this company for scraping Google's results, which again, is not Reddit and claiming that it's A-D-M-C-A 1201 anti circumvention measure over something that Reddit itself that hasn't set up the technological protection measure.
[00:33:34] The only thing that they've done is cut a $40 million deal with Google. And so you get these sort of stacking legal theories and questionable things that. While you can see like, okay, Reddit is upset that perhaps AI companies are routing around doing a deal with Reddit or with Google because they can use a company like Server API to get Google results that scrape Reddit because they have a deal with Reddit.
[00:34:01] It leads to really questionable places in terms of. Other types of scraping or other uses that are important and useful culturally, but because everybody's sort of trying to figure out how do we do these things and how do we cut these deals, you see these sort of somewhat stretched legal definitions, I think, or temps at questionable cases.
[00:34:22] Kendra Albert: And can I just say one more thing about Mike's Point real fast, which is I think that. That's entirely true. And I think the other thing to point out there is right. You know, as much as I like to, I think it's good to distinguish between good things and bad things. Right. I'll go on the record as being in favor of that.
[00:34:36] Right. I think when we're talking about making case law, oftentimes the sort of factors judges look at it, or the decisions judges make, don't say, okay, well. I don't like this company 'cause I think their business model's bad and so I'm gonna find that they violated the CFAA because of that. But the good guys, it's not A-C-F-A-A violation.
[00:34:52] That part of the law usually works. We actually get to do that way more in fair use because of that. In my sort of current job, we often work with researchers who scrape internet platforms to look for things like, you know, bias discrimination to understand how platforms work, right? Like that kind of thing.
[00:35:07] Those folks are subject to all the same bodies of law that get made by, well, Reddit is pissed off that you can get Reddit results from Google at this company. Or, you know, Reddit feels like they're channeling their users outrage that the sort of user's data like is being sort of used for these purposes they didn't intend.
[00:35:25] Right. So I think it is really important to note that like there, you know. Archiving research, all of these kinds of uses often basically require exactly the same tools. Just like the way back machine uses the same, you know, using bots to view webpages and archive them. Right. I'm Mark. I am wildly dumbing down the complexity of what you do.
[00:35:45] But you know, researchers are using the same tools to scrape data and to sort of understand how tech works, right. So I think. You know, it's not actually easy to just be like, okay, great. This technology, this way of doing it is good or bad and we should just like make a rule generally.
[00:36:00] Mark Graham: You have explained a little bit too about like what's at risk here beyond just news.
[00:36:04] The, uh, the internet archive, uh, archives more than a a billion URLs a day. And, uh, one of the signals that we follow is links added to Wikipedia articles, for example, all of them. And as a result of that, uh, we have been able to identify and fix that is edit and replace otherwise broken URLs that would return a 4 0 4.
[00:36:26] With archives of those references that human beings had added to Wikipedia articles over the years, more than 30 million links have have been fixed in this way. Pew Research, for example, identified for a collection of URLs they looked at that were 10 years old, that 38% of them were no longer. Available on the live web.
[00:36:46] So what does that mean if, if we can't have access to this material anymore? A variety of things, hundreds of, of times a year. The Wayback Machine team produces an affidavit to attest to the veracity of our web archives for the use by lawyers. In courts, and often these are cases of product liability, maybe a misrepresentation by a company, et cetera, and this material is often the critical evidence that is used to determine the outcome of the case.
[00:37:16] So there are any one over a number of applications of web archives beyond just news that are vital to our society to be able to hold those power accountable and to be able to help those curious enough to learn to inform themselves.
[00:37:30] Mike Masnick: I do think that that it is important to just remember the concept of the open web itself and sort of how we got here in the first place.
[00:37:38] I think it gets very easy. I mean, I even sort of got bogged down immediately on the AI aspect of all this, but you know, the Open Web has been around for more than three decades at this point, and I think many of us are here because we believe in the promise of the Open web and what it enabled in terms of.
[00:37:56] Community and culture and sharing of information and meeting people and all, everything. You know, so much of what we rely on today was built on this open web and the concept of the open web is this idea that. It's not controlled by any one entity and it is not locked down and limited, but that we can build on it and do more with it and we can share with each other and build culture.
[00:38:21] Culture is about multiple people understanding the same concepts, and that is built very much on the open web these days. And so much of where this unfortunately potentially leads to is a locking down of the open web just because you know of concerns about. How it might be used in one particular way.
[00:38:41] And, and so just as I know we're sort of getting to the q and a part, I, I felt like we should emphasize that aspect of, of why we're all here.
[00:38:52] Chris Freeland: Thank you, Mike, for acknowledging that I'd say long live the Open Web. I 100% agree with everything you said. It was the open web is an important part of our culture and I hope that it remains that way.
[00:39:03] And Mark, I think it may be helpful if you can explain how does the Wayback Machine make data available? In bulk. And what kinds of protections are in place to prevent some of the abuses that have been mentioned here?
[00:39:16] Mark Graham: Sure. Generally speaking, we don't make material available in, in bulk. The underlying files, uh, behind the Wayback machine are generally not.
[00:39:25] Publicly accessible. We do provide an ability to play back to replay individual web pages through what I refer to as the thin straw of the Wayback machine. For, for those of you who have used the, the service, you understand what I mean? It's pretty slow. There are certain features where one can list large numbers of URLs for a given site.
[00:39:47] For example, at the request of some publishers, including the New York Times, we've uh, disabled that capability for those. Particular sites. We do some archiving of material that is generally considered to be publicly available in particular material from governments. We participate with many others including Kendrick with perm CC at Harvard and others on doing a deep dive on material from the US government.
[00:40:13] And we do package that material up and we do make up. Bulk acts for that particular collection of web archives available to researchers and others. And also, as I noted, that's on the playback side, on the archiving side, or you know, how we serve material out, out to the world. There are a variety of mechanisms that we put in place to do limiting, to de, to detect and de deter access to the service that is not human originated.
[00:40:41] Chris Freeland: Very helpful, thank you. Question for everyone. If the Wayback machine and other archival institutions get blocked, people are probably still gonna do some archiving, but they're gonna do so in maybe less legitimate ways and screenshots and other things. And so I'd be interested in your thoughts on this issue of like maybe the non-legitimate archives or the the preservation by organizations that are outside of the traditional library sphere.
[00:41:05] What does that mean for the historical record?
[00:41:07] Kendra Albert: I'm gonna just leave non-legitimate archives over there. Well, so I think there's a couple things to think about there. One is I think like, yes, certainly, you know, screenshots are not as good as sort of a more interactive page component, but I think like ultimately having something of it is better than having nothing at all.
[00:41:23] Right. One area I work on a lot is video game preservation, which where we encounter a lot of somewhat similar challenges in terms of sort of the degree to which the technological complexity, the sort of challenges of permissions from rights holders, that kind of thing. And I think one thing that. I think about it a lot.
[00:41:37] There is like. In some ways when you make it really hard for institutions to legitimately preserve things for institutions that are big public, who are very clear about what they do and how they do it, right? You do in some ways seed ground to smaller institutions that may have different practices, right?
[00:41:55] Then some of those institutions are often like really good at what they do, and they're just quiet about it, and that's great, and some of those institutions I think we've. Maybe all followed. There was a whole kerfluffle about, I think archive is, which was sort of a tool that people used for archiving webpages often, you know, getting around paywalls that was like allegedly running like a fake capture that was like DDoSing a critic of the site.
[00:42:18] I think that's a really good example of one of the potential downsides of some of the like more aggressive attempts to kind of limit automated access or access because. Folks were not going to that site because they unnecessarily would've preferred that site. You know, they were going to that site because they could view content there that they didn't, weren't able to view elsewhere, or they were, could access an archive page that they couldn't access elsewhere.
[00:42:39] And so I think there is a real risk in a lot of these spaces of making it very hard for institutions that wanna do the right thing to effectively preserve or save works. And then it's sort of causing challenges for both the historical record and for sort of who's left.
[00:42:57] Mike Masnick: I mean, I think there are good actors in this space and obviously the Wayback Machine, the internet archive are a very clear example of a good actor and if you continue to make life difficult for them, it is only going to push people to those who maybe are less good actors, and there is other kinds of collateral damage that comes along with that.
[00:43:18] Chris Freeland: Leaving the non-legitimate archives on the floor. But something of a related question, so should preservation institutions be treated differently from AI companies in law or policy and and are there then proactive policies that libraries need to be able to continue doing this work in the digital age?
[00:43:36] Kendra Albert: I mean, in some ways they already are. Right? You know, section 1 0 7, the reason I kept being like, and you get to actually talk about whether, you know what people do. Section 1 0 7, which is fair use within the us, like does actually care about what you're doing with the content. Section 1 0 8 of the Copyright Act is specific to libraries.
[00:43:53] Certain kinds of archival and preservation institutions allows them to do things that other institutions can't do. You know, it's not a question of like, should we treat them differently? We already do. It then becomes a question of, Hey, should we treat them differently anywhere else? Right is maybe the sort of question I'm asking, and I think it's really hard in the existing scraping law context to see how that would quite work.
[00:44:12] Although I think that we did see some of that in a case called Sam, big VDOJ, where some researchers sued the DOJ over the computer fraud Bs, sex criminal components, making it harder to First Amendment kinds of protective research. So I think there's some inklings of that and it would be fantastic to kind of, I think.
[00:44:28] See more engagement with this question of what are the actual uses we think are good and important, and how do we promote those versus sort of, you know, okay. Like, just, just get rid of the whole thing.
[00:44:38] Mark Graham: Yeah. I'm gonna add, I'm, first of all, I, I'm not a lawyer and I do recognize, you know, existing copyright and fair use allowances.
[00:44:45] Substantiate the work of the Wayback Machine to support it. But at the same time, you know, there was the Vanderbilt clause added to carve out specific explicit protections in the area of television news archiving that I should note, that the internet archive is ever a robust television news archiving program as as well.
[00:45:03] But I wanna flip it around a little bit and say that, you know, news is a very special category of online material. It plays a vital role in our democracy. Indeed is refer to as the, the fourth estate and various measures of privilege are given to news, news organizations, and I might suggest that with those privileges and and rights comes certain responsibilities related to access and availability.
[00:45:33] I think living in a world that's awash with mis and disinformation, the internet archive recently co-published a paper that suggests that up to a third of new websites and web pages appearing on the public web today are at least partially AI generated. And, uh, so this is a time of rapid change. In fact, if we're pay walling and making quality journalism generally.
[00:45:57] Unavailable to people unless they have a subscription, which is a teeny, teeny percentage of the population, that we're gonna end up with a world more and more where the truth, the quality journalism is pay walled and therefore generally inaccessible to people, but the lies will proliferate and they will become as they are in many cases, the dominant presence in the conversation when I was growing up.
[00:46:22] I had a library, a physical library, and I had access to the New York Times and other magazines that I was able to read. If that library didn't have access to that material, I simply wouldn't have had access.
[00:46:34] Chris Freeland: Hat tip to Nathan J. Robinson, and a current affairs. The truth is paywall, but the lies are free. I wanna leave with our final question here for each of the panelists.
[00:46:43] what can anyone who's listening here today do to help change this trajectory?
[00:46:49] Mike Masnick: I mean, speak about it. Talk about it. You know, obviously use the tools. Well and intelligently and explain to others how you're using these tools and why they matter. Certainly when it comes to things like potential policy or legislation, being aware of what's happening and being willing to speak out and make sure that there is nothing that will then get in the way of important cultural institutions like the internet archive, but really just, you know, being a part of the conversation.
[00:47:19] I think a lot of people don't understand. You know where this is leading and sort of the impact on, or organizations like the internet archive and tools like the Wayback Machine. And so making sure that more people are aware, I think is the most important thing that at an individual level you can do.
[00:47:36] Obviously at institutional levels, if you do work for a news organization that is. Blocking access to the internet archive. Maybe try to convince people that that is a bad idea and we'll have downstream cultural impacts that are not good for society. But that all more depends on where people are situated.
[00:47:54] Kendra Albert: Mike Stoll, one of the things I was gonna say, which is I think that, you know, for folks who have institutional affiliations, I think making sure that a, like you can still access the internet archive is still accessing pages from your institution. And then if it's not making the case internally that, hey, this is why it's important.
[00:48:11] For my work, for the things that I do, for the things that I care about, which I think is gonna be much more powerful coming from folks who are internal to an institution than necessarily coming from those of us who are sort of out here being like, zoom is coming. You know, archiving is stopping. So I think thinking about sort of your, to the extent that folks have an institutional role where they bring attention to these issues, I think that's really valuable.
[00:48:33] Chris Freeland: Mark, how about you?
[00:48:34] Mark Graham: Just a few things. First of all, use our service. We're at public library and we love it when people are able to benefit from the resources that are available from our library and give us feedback about how we can do a better job at providing those services. Subscribe to our newsletters, follow us on the socials.
[00:48:54] If you're a journalist or know a journalist, I'd recommend that you check out the fight for the future letter that we share here, that Chris you shared here. And then if you're in the Bay Area, come visit us. We host more than a hundred events a year at our facility in San Francisco, and every Friday, except for I think Thanksgiving and Christmas at one o'clock, we host a tour.
[00:49:16] So you can kind of get an in-depth and personal look at what we do and and how we do it.
[00:49:23] Chris Freeland: Thank you for that, mark. Thank you to Mark and to Mike and to Kendra for such a fascinating conversation today, and to Dave Hansen and Authors Alliance, as always, for facilitating and co-hosting this session. Thanks everyone.
[00:49:35] Have a great day.
[00:49:36] Thanks for joining us on this journey into the future of knowledge. Be sure to follow the show. New episodes, drop every other Wednesday with bold ideas, fresh insights, and the voices shaping tomorrow.
